Tuesday, April 5, 2011

Peformance tuning Tips : Join Considerations

If you are working on writing queries, working on performance or helping in betterment of performance. You will have to take sometime in going through this topic.   It is all to do about Joins which is most important concern in Teradata.

If some light is given to following suggestions, any join related issues can be taken care off...

Tip 1: Joining on PI/NUPI/ Non PI columns

We should make sure join is happening on columns composed of UPI/NUPI.  But why??

Whenever we join two tables on common columns, the smart optimizer will try to take data from both the data into a common spool space and join them to get results. But getting data from both the tables into common spool has overhead.

What if I joined a very large table with small table?
Should small table be redistributed or large table?
Should small table be duplicated across all the AMPs?
Should both the tables be redistributed across all the AMPs??

Here is some basic thumb rules on joining columns on Index, so joining happens faster.

Case 1 - P.I = P.I joins
There is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast.

Case 2 - P.I = Non PI column joins
-Data from second table will be re-distributed on all amps since joins are happening on PI vs. NUPI column.  Ideal scenario is when small table is redistributed to be joined with large table records on same amp
-Data in small table is duplicated to Every AMP where it is joined locally with large table

Case 3 - No Index = Non PI column joins
Data from both the tables are redistributed on all AMPs.  This is one of the longest processing   queries , Care should be taken to see that stats are  collected on these columns

Tip 2: The columns part of join must be of the same data type (CHAR, INTEGER,). But why?!?

When trying to join columns from two tables, optimizer makes sure that datatype is same or else it will translate the column in driving table to match that of derived table.

Say for example
TABLE employee deptno (char)
TABLE dept deptno  (integer)

If I am joining employee table with Dept on  employee.deptno(char) = dept.deptno(Integer), optimizer will convert character column to Integer resulting in translation  . What would happen if employee table had 100 million records and every time deptno would have to undergo Translation. So we have to make sure to avoid  such scenarios since translation is a  cost factor  and might need time and system resources.

Make sure you are joining columns that have same data types to avoid translation!!!!

Tip 3 : Do not use functions like SUBSTR, COALESCE , CASE ... on the indices used as part of Join.  Why?!?

It is not recommended not to use functions such as SUBSTR, COALESCE, CASE and others since they add up to cost factor resulting in performance issue.
Optimizer will not be able to read stats on those columns which have functions as it is busy converting functions. This might result in Product join, spool out issues and optimizer will not be able to take decisions since no stats/demographics are available on column. It might assume column to have 100 values instead of 1 million values and might redistribute on wrong assumption directly impacting performance.

Tip 4 : use NOT NULL where ever possible!

What?!!  Did someone say Not Null?? .. Yes, we have to make sure to use NOT null for columns which are declared as NULLABLE in TABLE definition.
Reason being that all the Null values might get sorted to one poor AMP resulting in infamous " NO SPOOL SPACE "  Error  as that  AMP cannot accommodate any more Null values.
SO remember to use NOT NULL in joining  so that table SKEW can be avoid  .

Since V2R5 , teradata automatically adds the condition « IS NOT NULL » to the query. Still it is better to ensure NOT NULL columns are not included as part of the join