Friday, December 23, 2011

Performance Tunings tips : Inlist vs Between

There are lot of questions running around guys who do tuning . Sometimes they do suggest use of temporary tables instead of using large in list values, as the optimizer would go for value in XXX or Value in XXY or Value in XXZ to generate a explain plain.

What if the column compared against large in-list values was part of any index say PI,SI,JI.... ?  Sometimes it so happens that even after using a temp table with list of values you would still get same performance issue, why so? Did you ever consider to use  "between" clause to check if the query performed better ??  Did you...

I would say give it a try to see if this would be much better option compared to standard "temp table " against the in list

Say for example:

SELECT customer_number, customer_name
FROM customer
WHERE customer_number in (1000, 1001, 1002, 1003, 1004);
is much less efficient than:

SELECT customer_number, customer_name
FROM customer
WHERE customer_number BETWEEN 1000 and 1004

We are assuming that an index on customer_number, the Query Optimizer can locate a range of numbers much faster (using BETWEEN) than it can find a series of numbers using the IN clause.

Check for explain to compare difference,  if still the same...  refresh/collect stats on index column it would help.   If still you would find some kind of issue try to find out skewness of column  using following query and try to rectify the issue.

Sel hashamp(hashbucket(hashrow(customer_number))), count(*) from Customer  group by 1;

Over all i would say trying the query with between would be of great help.  But query tuning is such a complex thing you will never get what you want unless you understand the data ;-)

Wednesday, November 23, 2011

100,000 visits .Thank you for making it a success

I want to thank one and all for making a successful site .  It is very exciting to see it reach first milestone of 100,000 visits.  Now teradatahelp has become one of top 10 frequently accessed sites for Teradata on web ( source alexa & google analytics ) .

I beleive in concept of free knowledge sharing  and it is the right of everyone to demand for it. 

By the way ,  Holiday season is nearing soon.  I wish one and all merry christmas and New year in advance.
Work hard and party harder.  God Bless!


Tuesday, June 28, 2011

Did someone say secondary index??

Secondary Index

"last but not the least!!"  is what i would say when it comes to secondary index.  This  is very much true since secondary index is concept when it comes to handy when others doesn't!

A secondary index is one which provides alternate access paths to the data  thus helping in speeding of the queries.They are usually defined on columns which are more frequently used rather than PI column in joining condition.

A secondary index can be defined as unique secondary index or non unique secondary index.
CREATE UNIQUE INDEX (Columns) ON dbname.tablename ;
This creates Unique secondary index on Employees table

CREATE INDEX (Columns) ON dbname.tablename ;
This creates non unique secondary index on employees table.

A worthy tip:
Please make sure to use naming for index when multiple columns are defined  . It would be easy to drop a index  on that table just by using indexname,
create index SI_emp (firstname,lastname) from employees;  --- index created as SI_emp for table employees
Drop index SI_emp on employees;     --- easy to drop index by giving index name

So what exactly does this Secondary index do?? Does it hash the data values based on secondary index and redistribute data? No. It doesn't'

It just creates a subtable for defined secondary index , which contains there columns
-secondary index value
-secondary index rowid (hashed value for SI)
-hash value for primary index(pointing to target amp)

This  subtable is created on each amp  which in turn points to target AMP.   Hence Secondary index can be TWO amp operation for USI and MANY amp operation for NUSI.  So whenever a SI column is used in a query,  based on that column  optimizer quickly reaches subtable by way  of  row hash generated for SI value and using PI row id corresponding to it,  .it would reach target amp.

The secondary index Sub table is key to alternate access paths provided in teradata. With this benefit we end up paying  the price for extra perm space consumed by subtables and overhead for their maintenance. So think wisely before going for Secondary index or use the beauty of dynamic secondary index concept which supports creation of Secondary index when required and dropping of it when not used hence saving considerable amount of space,

I want to share one of the scenarios where we had a table with set of date columns and Only PI defined on that table,  A query was run on particular  date range of that table ,which went on for full table scan taking more than 4 hours to complete. 
so decided to focus on this date column since it looked like  main culprit for performance issue.
Two things strike my mind when date column came into Picture . Should i go for PPI or SI??

1.  In case of PPI,
create table with PPI and repopulate data .  But this adds to overhead since there was  addition of 2 bytes to every row increasing  size of the table by 2*n rows.
Second issue was PPI causing redistribution of PI in partitions resulting in Primary index based queries taking more than normal time.   Joins happening on PPI columns to non PPI table would result in sliding window join which would further decrease the performance.

Most important of all, how frequently was this date column used in queries  .  Since  frequency of usage  was less and seeing drawbacks of PPI , i decided to go on with Secondary index which could be created any time and dropped any time.

Since  date column  is range based , I created value ordered secondary index as follows
create index (date-column) order by values on employees ;

Stats were collected on to ensure  optimizer to pick secondary index.  I could see the performance improvement in query after creation of valued ordered NUSI . Please note that Value ordered secondary index can be created only  as NUSI  and supports date, int,decimal upto 4 bytes.

Please make sure to collect stats on secondary index since optimizer might have to decide whether reading SI subtable is faster than  going for full table scan ( reading all the amps)

I want to point out some Limitations of secondary index
1. Secondary index is not compatible with fastload  as there is overhead for fastload on duplicate secondary index checking with subtable creation
Drop SI before doing fastload on that table and then recreate them,
2.   Table can have upto 32 secondary indices in teradata
3. Multicolumn secondary index can have upto 16 columns defined in a index
4. USI is not supported on Multiload, but NUSI is supported by multiload
5. USI cannot be created on table which already has duplicate data in columns to be defined as  part of USI.
6. SI cannot be created on column which is  already defined as PI.  

Tuesday, April 5, 2011

Peformance tuning Tips : Join Considerations

If you are working on writing queries, working on performance or helping in betterment of performance. You will have to take sometime in going through this topic.   It is all to do about Joins which is most important concern in Teradata.

If some light is given to following suggestions, any join related issues can be taken care off...

Tip 1: Joining on PI/NUPI/ Non PI columns

We should make sure join is happening on columns composed of UPI/NUPI.  But why??

Whenever we join two tables on common columns, the smart optimizer will try to take data from both the data into a common spool space and join them to get results. But getting data from both the tables into common spool has overhead.

What if I joined a very large table with small table?
Should small table be redistributed or large table?
Should small table be duplicated across all the AMPs?
Should both the tables be redistributed across all the AMPs??

Here is some basic thumb rules on joining columns on Index, so joining happens faster.

Case 1 - P.I = P.I joins
There is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast.

Case 2 - P.I = Non PI column joins
-Data from second table will be re-distributed on all amps since joins are happening on PI vs. NUPI column.  Ideal scenario is when small table is redistributed to be joined with large table records on same amp
-Data in small table is duplicated to Every AMP where it is joined locally with large table

Case 3 - No Index = Non PI column joins
Data from both the tables are redistributed on all AMPs.  This is one of the longest processing   queries , Care should be taken to see that stats are  collected on these columns

Tip 2: The columns part of join must be of the same data type (CHAR, INTEGER,). But why?!?

When trying to join columns from two tables, optimizer makes sure that datatype is same or else it will translate the column in driving table to match that of derived table.

Say for example
TABLE employee deptno (char)
TABLE dept deptno  (integer)

If I am joining employee table with Dept on  employee.deptno(char) = dept.deptno(Integer), optimizer will convert character column to Integer resulting in translation  . What would happen if employee table had 100 million records and every time deptno would have to undergo Translation. So we have to make sure to avoid  such scenarios since translation is a  cost factor  and might need time and system resources.

Make sure you are joining columns that have same data types to avoid translation!!!!

Tip 3 : Do not use functions like SUBSTR, COALESCE , CASE ... on the indices used as part of Join.  Why?!?

It is not recommended not to use functions such as SUBSTR, COALESCE, CASE and others since they add up to cost factor resulting in performance issue.
Optimizer will not be able to read stats on those columns which have functions as it is busy converting functions. This might result in Product join, spool out issues and optimizer will not be able to take decisions since no stats/demographics are available on column. It might assume column to have 100 values instead of 1 million values and might redistribute on wrong assumption directly impacting performance.

Tip 4 : use NOT NULL where ever possible!

What?!!  Did someone say Not Null?? .. Yes, we have to make sure to use NOT null for columns which are declared as NULLABLE in TABLE definition.
Reason being that all the Null values might get sorted to one poor AMP resulting in infamous " NO SPOOL SPACE "  Error  as that  AMP cannot accommodate any more Null values.
SO remember to use NOT NULL in joining  so that table SKEW can be avoid  .

Since V2R5 , teradata automatically adds the condition « IS NOT NULL » to the query. Still it is better to ensure NOT NULL columns are not included as part of the join

Tuesday, March 29, 2011

Partitioned Primary Index (PPI)

Partitioned Primary Index (PPI)

Partitioned Primary Index is one of the unique features of Teradata, which allows access of portion of data of large table. This reduces the overhead of scanning the complete table thus improving performance. 

PPI works by hashing rows to different virtual AMPs, as is done with a normal PI.  PPI does not alter data distribution, it only creates partitions on data already distributed based on PI.

Usually PPI's are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.

Partitions are usually defined based on Range or Case as follows.
Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.

1. Partition by CASE
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
PRIMARY INDEX (customer_number)
            order_total < 10000 ,
            order_total < 20000 ,
            order_total < 30000,
NO           CASE     OR        UNKNOWN ) ;

2. Partition by Range - example using date range
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
PRIMARY INDEX (customer_number)
PARTITION BY range_n (
            Order_date BETWEEN date '2010-01-01'       AND      date '2010-12-01'
            EACH interval '1' month,
        OR  UNKNOWN);

P.S: If we use NO RANGE or NO CASE - then all values not in this range will be in a single partition.
If we specify UNKNOWN, then all null values will be placed in this partition

PPI improves performance as follows:
  • Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
  • Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
  • Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
  • Provides an access path to the rows in the base table while still providing efficient join Strategies
  • If the same partition is consistently targeted, the part of the table updated may be able to  fit largely in cache, significantly boosting performance
  • PPI based tables have advantage during Purging stages. Since purging based on partition is very fast and deletion happens quickly.

PPI also comes with some disadvantages like
  • The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
  • Another disadvantage is that when another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join

Limitations of Partitioned Primary Index (PPI) :
  •  Primary index of PPI table has to be 
    • Non unique PI, if PPI column is not part of Index, since enforcing a Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.
  •  Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint in same partition.
  •  PPI cannot be defined on Global temporary tables and Volatile tables and also on compressed join indices
  • PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition number for each row .
  • PPI table rows are four bytes wider if value compression is specified for the table. 

Since PPI , results in lot of partitions , there is a little overhead to user/dba.  He has to regularly run collect stats on the PPI column.

It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index for that PPI table.By doing so, it avoids all unnecessary partitions , thus speeding up the access.
Help stats tablename column PARTITION; -- used to list partitions in table and their details
Collect stats on tablename column PARTITION; -- refresh partition details

Monday, March 28, 2011

Maximum limits in v2r5/ v2r6 databases

Maximum limits in v2r5/ v2r6 databases

The database specifications in the following table apply to a single Teradata database.

Max number of journal tables per database
Max number of data tables per database
 4.2 x 10 9
Max number of columns per table or view
± 2048
Max number of columns created over the life of a table
Max number of rows per table
 Limited by disk capacity
Max row size
 Approx 64,000 bytes
Max field size
 Approx 64,000 bytes
Max database object name size
 30 bytes
Max number of columns per primary or secondary index
Max SQL request size
1 MB
Max SQL response size
 Approx 64,000 bytes
Max expanded text size for macros and views
 2 MB
Max SQL title size
 60 characters
Max number of characters in a string constant
± 32,000
Max number of columns per DML statement ORDER BY clause
Max number of primary indexes per table
Max number of partitions for a partitioned primary index
 ± 65,535
Max numbers of secondary, hash, and join indexes, in any combination, per table

Found this very useful  information elsewhere on net, felt like sharing it here.!!

Tuesday, March 22, 2011

Teradata Interview questions - Part 7

Teradata Interview questions - Part 7

How do you set the session mode parameters in BTEQ?
.set session transaction ANSI /* this is to set ANSI mode */
.set session transaction BTET /* this is to set Teradata transaction mode */

These commands have to be entered before logging to the session

What is the command in BTEQ to check for session settings ?
The BTEQ .SHOW CONTROL command displays BTEQ settings.

How do you submit bteq script (batch mode)?
1. Start the BTEQ, by typing BTEQ
2. Enter the following command
    .run file = BTEQScript.btq
1. Bteq < BTEQScript.btq

BTEQScript.btq contains following

.logon, dbc;
sel top 10 * from dbc.tables;

Is Like comparision case-sensitive in Teradata?

LIKE operator is not case sensitive in Teradata.
Consider the following example  
Select F_name from employee where F_name like '%JO%' ;
The followig query will pick values matchinf with 'JO' and 'jo' as well, since teradata is not case-sensitive

To overcome this problem, a new function called "CASESPECIFIC" is used in TERADATA as follows
Select F_name from employee where F_name (CASESPECIFIC) like '%JO%' ;
What does DROP table command do?

It deletes all data in emp_DATA
Removes the emp_DATA definiation from the data dictionary
Removes all explicit access rights on the table
How do you set default date setting in BTEQ?

There are two default date setting in BTEQ.  They have to be set after logging on to the session
They are

Set session dateform = ANSIDATE; /*format is yyyy-mm-dd */      
Set session dateform = integerdate; /* format is yy/mm/dd   -teradata date format */
Difference between Inner join and outer join?

An inner join gets data from both tables where the specified data exists in both
An outer join gets data from the source table at all times, and returns data from
the outer joined table ONLY if it matches the criteria.
What is multi Insert?

Inserting data records into the table using multiple insert statements.
 Putting a Semi colon in front of the key word INSERT in the next statement rather than
Terminating the first statement with a semi colon achieves it.

Insert into Sales “select * from customer”
; Insert into Loan “select * from customer”;

How to find duplicates in a table?
Group by those fields and select id, count(*) from table group by id having count
(*) > 1

How do you see a DDL for an existing table?
By using show table command.

Which is more efficient GROUP BY or DISTINCT to find duplicates?
With more duplicates GROUP BY is more efficient, if only a few duplicates exist
DISTINCT is more efficient.

Syntax for CASE WHEN statement?
CASE value_expression_1 WHEN value_expression_n THEN scalar_expression_n

Monday, March 21, 2011

Locking table for access?

Locking table for access?

We would have come across this statement in many queries which are run in sensitive environments like PROD, UAT.   They can be used with views or sometimes just for querying purpose.

I wanted to discuss how important this statement would be in real-time /active data warehouses where lot of users will be striking queries on same database at the time.

create  view Employee.view_employ_withLock
locking table Employee.Dept_emp  for access
select   *
from     Employee.Dept_emp ;

By using locking table for access, we make sure that normal "access" lock is applied on table which is required to fetch results. By doing so ,

- There is no waiting for other locks to release since access lock can be applied on table which has read/write lock applied to it
- This will cause the query to execute even when some lock is applied , but  accessing data using this lock might not be consistent as it might result in dirty read due to concurrent write on the same table.

It is always suggested to use “locking table for access" which since they will not block the other users from applying read/write lock on the table.

Fastload locks

Fastload locks are one of the major issues in fastload . I have seen this kind of scenarios in production environment and there are various ways by which we can handle this issue.

When do fastload locks happen?
- When the fastload script results in failure (error code 8 or 12)
- When fastload script is aborted by dba or some other session
- Fastload fails (Bad record, file)
- When you have forgot to add end loading statement in ur fload script

What can be done to release fload locks?

1. Finish the fastload, which means leave your hands off of the error tables and restart  fastload. When it finishes the table lock will be released.
 If the following error persists then do step 2.
Error is: RDBMS error 2652: Operation not allowed: _db_._table_ is being Loaded.

2.  Create a dummy fload job  as following without any content for loading  as  follows.

     logon tdpid/username,password;
     begin loading databasename.tablename;
/*  emtpy loading   to release fastload lock */
     end loading;

3.  If   second step also fails then, drop the empty locked table (because if you can't restart Fastload you most likely dropped the error tables)
   Please note that before using the DROP solution, try to close your already opened fastload session.

4. If still this  error persists and not able to drop the table  :( , then please kill the fload session from pmon/manager , drop the table and rerun actual     fload script .

I hope this solution works, if not then please contact ur DBA or contact me.