Thursday, November 25, 2010

Types of Teradata Joins

Teradata joins

When we join two or more tables on a column or set of columns, Joining takes place. This will result in data resulting from matching records in both the tables. This Universal concept remains the same for all the databases.
In Teradata, we have Optimizer (a very smart Interpreter), which determines type of join strategy to be used based on user input taking performance factor in mind.

In Teradata, some of common join types are used like
- Inner join (can also be "self join" in some cases)
- Outer Join (Left, Right, Full)
- Cross join (Cartesian product join)

When User provides join query, optimizer will come up with join plans to perform joins. These Join strategies include
- Merge Join
- Nested Join
- Hash Join
- Product join
- Exclusion Join


 Merge Join
--------------------

Merge join is a concept in which rows to be joined must be present in same AMP. If the rows to be joined are not on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make that happen based on row hash of the columns involved in the joins WHERE Clause.
            If two tables to be joined have same primary Index, then the records will be present in Same AMP and Re-Distribution of records is not required.

There are four scenarios in which redistribution can happen for Merge Join
Case 1: If joining columns are on UPI = UPI, the records to be joined are present in Same AMP and redistribution is not required. This is most efficient and fastest join strategy
Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has to be redistributed on AMP's based on data corresponding to first table.
Case 3: If joining columns are on Non Index column = Non Index column , the both the tables are to be redistributed so that matching data lies on same amp , so the join can happen on redistributed data.  This strategy is time consuming since complete redistribution of both the tables takes across all the amps
Case 4: For join happening on Primary Index, If the Referenced table (second table in the join) is very small, then this table is duplicated /copied on to every AMP.

 Nested Join
-------------------
Nested Join is one of the most precise join plans   suggested by Optimizer .Nested Join works on UPI/USI used in Join statement and is used to retrieve the single row from first table . It then checks for one more matching rows in second table based on being used in the join using an index (primary or secondary) and returns the matching results.

Example:

Select EMP.Ename , DEP.Deptno, EMP.salary
from
EMPLOYEE EMP ,
DEPARTMENT DEP
Where EMP.Enum = DEP.Enum
and EMp.Enum= 2345;    -- this results in nested join

 Hash join
----------------
Hash join is one of the plans suggested by Optimizer based on joining conditions. We can say Hash Join to be close relative of Merge based on its functionality. In case of merge join, joining would happen in same amp.   In Hash Join, one or both tables which are on same amp are fit completely inside the AMP's Memory   . Amp chooses to hold small tables in its memory for joins happening on ROW hash.

Advantages of Hash joins are
1. They are faster than Merge joins since the large table doesn’t need to be sorted.
2. Since the join happening b/w table in AMP memory and table in unsorted spool, it happens so quickly.

 Exclusion Join
-------------------------

These type of joins are suggested by optimizer when following are used in the queries
- NOT IN
- EXCEPT
- MINUS
- SET subtraction operations


Select EMP.Ename , DEP.Deptno, EMP.salary
from
EMPLOYEE EMP
WHERE EMP.Enum NOT IN
( Select Enum from
DEPARTMENT DEP
where Enum is NOT NULL );

Please make sure to add an additional WHERE filter “with <column> IS NOT NULL” since usage of NULL in a NOT IN <column> list will return no results.

Exclusion join for following NOT In query has 3 scenarios
Case 1: matched data in "NOT IN" sub Query will disqualify that row
Case 2: Non-matched data in "NOT IN" sub Query will qualify that row
Case 3: Any Unknown result in "NOT IN" will disqualify that row - ('NULL' is a typical example of this scenario). 

How to Export without junk characters in FastExport?

How to Export without junk characters in FastExport?

Solution1: Create a small OUTMOD to strip off the two-byte VARCHAR length field.

Solution2: Another alternative if you are in a UNIX environment is to use AWK or SED or CUT the file after it has been output to strip the leading to characters from each record.

STEP1: Write the query in FASTEXPORT file as follows

SELECT '|'||Col1||';'||Col2 from TAB1; -- Here '|' pipe operator is used as delimiter.
So result generated will be as follows
^A|Col1;Col2

"^A" is a junk value generated during EXPORT.

STEP2: Write UNIX script and pass filename as first parameter

#####UNIX SOLUTION #############
##create a file trim.ksh ############
## Run the file trim.ksh <user_filename>###

FileName=$1
cat $FileName | cut -f2- -d '|' > $FileName.tmp
mv $FileName.tmp $Join_Export.txt1

The Result after cutting all junk characters till occurrence of PIPE operation will be as follows
Col1;Col2

Solution 3: by casting Entire result to CHAR

When we are exporting a table in record mode , Please make sure that data exported does not contain extra 2 characters by casting entire result to CHAR().

For example: if table1 has 2 columns, Col1 is of type char (6), Col2 is of type char (2)
Write the query as follows:

Select cast (c1 || c2) as char (8) from table1;

Note: If we write query like “select c1 || c2 from table1",
The records exported will contain 2 byte length at the beginning, because concatenation returns VARCHAR () type not CHAR () type.

Thursday, November 18, 2010

Teradata interview questions PART 5

Which is faster – MultiLoad delete or Delete command?
Answer

MultiLoad delete is faster then normal Delete command, since the deletion happens in data blocks of 64Kbytes, where as delete command deletes data row by row.  Transient journal maintains entries only for Delete command since Teradata utilities doesn’t support Transient journal loading.


How to Skip or Get first and Last Record from Flat File through MultiLoad?
Answer

In .IMPORT command in Mload we have a option to give record no. from which processing should begin. i.e. ‘FROM m’ ‘m’ is a logical record number, as an integer, of the record in the identified data source where processing is to begin. You can mention ’m’ as 2 and processing will  start from second record.

THRU k and FOR n are two options in the same Mload command, functions same towards the end of the processing.

Adding to the above, if from n"start record" and for n "stop record" are not mentioned, mload considers records from   start till the end of the file.


What do High confidence, Low confidence and No confidence mean in EXPLAIN plan?
Answer

Explain gives the execution strategy means what are the different steps that the query will go through.

HIGH CONFIDENCE:  Statistics are collected.
LOW CONFIDENCE:   Statistics are not collected.  But the where condition is having the condition on indexed column. Then estimations can be based on sampling.
NO CONFIDENCE: Statistics are not collected and the condition is on non indexed column.


What is the default join strategy in Teradata???
Answer
# 1
The Teradata Cost based optimizer will decide the join strategy based on the optimum path. The common strategies followed are from Merge, Hash & Nested Join

Answer 
# 2 
Three strategies followed by optimizer are:
1. Duplication of rows of one table to every amp
--> This one is opted by optimizer when the non-PI column is on a small table.
2. Redistribution of the non PI rows of the table to the amp containing the matching PI row amp.
--> This one is opted when the tables are large. The non PI column is in one table is redistributed to the amp containing the matching PI.
3. Redistribute both the rows of the table by hash values.
--> This is chosen when the join is on a column that is not the PI in either table. This will require the most spool space.


Can we take collect stats on Derived Tables and Volatile tables and Temporary tables?
What is Global Temporary table?
What is the use of this Global Temporary table?

Answer
1. No for volatile and derived tables and yes for global tables.
2. Global tables are temp tables like volatile tables but unlike volatile tables, their definitions are retained in dd.
3. It is used whenever there is a need for a temporary table with same table definition for all users.


How to handle nulls in Teradata???
How many columns can be there in a table???
How to find average sal with out using avg function????
Answer
1. Use zeroifnull, nullifzero in select and NULL in insert directly.
2. 256 columns max per table.
3.  Without using "avg" we can find the avg salary by using sum (sal)/count (sal);
  sel sum(sal)/count(sal) as avgsal from tablename


Why Fload doesn’t support multiset table?
Answer

Fload does not support Multiset table because of following reason.

Say, the fastload job fails. Till the fastload failed, some number of rows was sent to the AMP's.
Now if you restart FLOAD, it would start loading record from the  last checkpoint and some of the consecutive rows are sent for the second time. These will be caught as duplicate rows are found after sorting of data.
   This restart logic is the reason that Fastload will not load duplicate rows into a MULTISET table. It assumes they are duplicates because of this logic. Fastload support Multiset table but does not support the duplicate rows. Multiset tables are tables that allow duplicate rows. When Fastload finds the duplicate rows it discards it. Fast Load can load data into multiset table but will not load the duplicate rows.


How many types of Skew exist?
Answer

If you utilized unequally TD resources (CPU,AMP,IO,Disk and etc) this is called skew exists. Major are 3 types of skews (CPU skew, AMP/Data skew, IO Skew).

-Data skew?
When data is not distributed equally on all the AMPs.
-Cpu skew?
Who is taking/consuming more CPU called cpu skew.
-IO skew?
Who perform more IO Operation? Resulting in IO Skew


Why does varchar occupy 2 extra bytes?
Answer

The two bytes are for the number of bytes for the binary length of the field.
It stores the exact no of characters stored in varchar


What is the difference between start schema and Fsldm? 
Answer

Star Schema
--> Demoralized for query performance
--> A star schema has one Fact tables and many dimensional tables to store the related.

FSLDM (Financial Services Logical Data Model)
--> 3rd normal form
-->Proposed and designed by NCR to cater the needs of financial sectors, especially for the Banking domain.
--> It has many Fact tables and dimension tables.
--> The FSLDM can be customized to cater to bank's specific needs.
Ex. The major fact tables include Party (Customers of the bank), Product (Services offered by the bank), Event (An event occurred ex: An ATM transaction), Agreement (A deal between
the party and the bank for a product) , etc..
Can you load multiple data files for same target table using Fastload?
Answer:

Yes, we can Load a table using multiple datafiles in Fastload.

Before giving "end loading" statement user can define file path and use insert sql for multiple source files and give "end loading" statement at the end


Explain types of re-distribution of data happening for joining of columns from two tables
Case 1 - P.I = P.I joins
Case 2 - P.I = non Index joins
Case 3 -
non Index = non Index joins
Answer:

Case1 - there is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast.
Case2 - data from second table will be re-distributed on all amps since joins are happening on PI vs.
non Index column.  Ideal scenario is when small table is redistributed to be joined with large table records on same amp
case3 - data from both the tables are redistributed on all AMPs.  This is one of the longest processing queries , Care should be taken to see that stats are collected on these columns
How to skip the header row in the fastload script

RECORD 2;   /* this skips first record in the source file */
DEFINE ...
How to find out list of indexes in Teradata?


IndexType
Description
P
Nonpartitioned Primary
Q
Partitioned Primary
S
Secondary
J
join index
N
hash index
K
primary key
U
unique constraint
V
value ordered secondary
H
hash ordered ALL covering secondary
O
valued ordered ALL covering secondary
I
ordering column of a composite secondary index
M
Multi column statistics
D
Derived column partition statistics
1
field1 column of a join or hash index
2
field2 column of a join or hash index

SELECT
databasename, tablename, columnname, indextype, indexnumber, indexname
FROM  
dbc.indices
ORDER BY
databasename,  tablename, indexnumber;


How to identify PPI columns?
Answer:

Select databasename , tablename , columnposition ,columnname from dbc.indices
where indextype ='Q'
order by 1 ,2,3 ;


What are the scenarios in which Full Table Scans occurs?

1. The where clause in SELECT statement does not use either primary index or secondary index
2. SQL Statement which uses a partial value (like or not like), in the WHERE statement.
3. SQL Statement which does not contain where clause.
4. SQL statement using range in where clause. Ex. (col1 > 40 or col1 < =10000)


 
What are Differences between Teradata and ANSI Session modes in Teradata?
Answer:

TERADATA MODE
-------------
1. Comparison is not Case sensitive
2. Create table are default to SET tables
3. Each transaction is committed implicitly
4. Supports all Teradata commands 
5. It follows BTET (Begin and End Transaction) Mode


ANSI MODE
-------------
1. Comparison is CASE sensitive
2. Create table are default to MULTISET tables
3. Each transaction has to be committed explicitly
4. Does not support all Teradata commands 
5. It does not follow BTET Mode


How do you create materialized view in Teradata?
Answer:

There is no such thing as a "materialized view" in Teradata. The equivalent in Teradata would be a join index (or aggregate index) with a corresponding view put on top of it. The command to create one is "CREATE JOIN INDEX...(lots of options)".

Join indices are maintained when the data is maintained. They cannot be accessed directly, but are invoked when the optimizer determines it is beneficial. Even though you cannot invoke a join index directly, you create a view that looks like the join index and in that way, it would be similar to having direct access. However, this approach does not guarantee that the join index will be used.


What is difference B/w User and database in Teradata?
Answer:

- User is a database with password but database cannot have password
- Both can contain Tables , views and macros
- Both users and databases may or may not hold privileges
- Only users can login, establish a session with Teradata database and they can submit requests

Tuesday, November 16, 2010

Teradata Performance Tuning - Basic Tips

Performance tuning thumb rules.

Here are very basic steps which are used to PT any given query in given environment . As a pre-requiste , make sure 
- user has proper select rights and actual profile settings
- Enough space available to run and test the queries

1. Run explain plan (pressing F6 or “EXPLAIN sel * …”,)
Then see for potential information like
- No or low confidence
- Product joins conditions
- By way of an all row scan - FTS
- Translate

Also check for
- Distinct or group by keywords in SQL query
- In/ not in keywords and check for the list of values generated for the same

APPROACHES

A. In case of product join scenarios,check for
- Proper usage of alias
- joining on matching columns
- Usage of join keywords - like specifying type of joins (ex. inner or outer )
- use union in case of "OR” scenarios
- Ensure statistics are collected on join columns and this is especially important if the columns you are joining on are not unique.

B. collects stats
- Run command "diagnostic help stats on for the session"
- Gather information on columns on which stats has to be collected
- Collect stats on suggestions columns
- Also check for stats missing on PI, SI or columns used in joins - "help stats <databasename>.<tablename>
- Make sure stats are re-collected when at-least 10% of data changes
- remove unwanted stats or stat which hardly improves performance of the queries
- Collect stats on columns instead of indexes since index dropped will drop stats as well!!
- collect stats on index having multiple columns, this might be helpful when these columns are used in join conditions
- Check if stats are re-created for tables whose structures have some changes 

c. Full table scan scenarios
- Try to avoid FTS scenarios as, it might take very long time to access all the data in every amp in the system
- Make sure SI is defined on the columns which are used as part of joins or Alternate access path.
- Collect stats on SI columns else there are chances where optimizer might go for FTS even when SI is defined on that particular column

2. If intermediate tables are used to store results, make sure that
- It has same PI of source and destination table

3. Tune to get the optimizer to join on the Primary Index of the largest table, when possible, to ensure that the large table is not redistributed on AMPS

4. For large list of values, avoid using IN /NOT IN in SQLs. Write large list values to a temporary table and use this table in the query

5. Make sure when to use exists/not exists condition since they ignore unknown comparisons (ex. - NULL value in the column results in unknown) . Hence this leads to inconsistent results

6. Inner Vs Outer Joins
Check which join works efficiently in given scenarios.Some examples are 
- Outer joins can be used in case of large table joining with small tables (like fact table joining with Dimension table based on reference column)
- Inner joins can be used when we get actual data and no extra data is loaded into spool for processing
Please note for outer join conditions:
1. Filter condition for inner table should be present in "ON" condition
2. Filter condition for outer table should be present in "WHERE" condition

DIAGNOSTIC HELPSTATS

One of my favorite commands and most useful among the lot , is Diagnostic help stats .
This command is very useful in helping user understand which all columns should have collect
stats be collected on, so optimizer can select the best plan.

  To Include the Stats collection recommendations in the explain plan.

DIAGNOSTIC HELPSTATS ON FOR SESSION;

At the end of the explain text is the recommended statistics for collection will be as follows

/*BEGIN RECOMMENDED STATS ->
 16) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_SIZE".  (HighConf)
 17) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_CODE".  (HighConf)
 18) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_DESC".  (HighConf) */
If you want explain to stop showing recommendations for collection of stats, then use the following

DIAGNOSTIC HELPSTATS NOT ON FOR SESSION;

Diagnostic help stats has some drawbacks like

  • It does not give any sort of indication of stale stats
  • Stats should be chosen carefully as recommended by diagnostic help stats
  • Care should be taken to see that too many stats on a given table can impact batch running of scripts and increases the overload of stats maintenance.
  • If recommended stats don’t show any improvements in performance, DROP them!

Wednesday, November 10, 2010

Difference between count(*) and count(1) ?


Difference between count(*) and count(1) ?

Count is one of very important functions used in  any database . But what many dont know that  is the result count we get from COUNT function might be different  based on how it is used. Let us consider the example of count(*) and Count(1) and understand how there is some difference the usage.

Say for a given table with data
X       y
------------
21      31
Null    32
Null    33
41      34
Null    Null

And this query:
     select count(*), count(x), count(1) from table;

Results: 4, 2, 4

* Count(*) counts all occurrences including nulls. This is a row count for a given table.
* Count(x) counts all non null occurrences of values in column x. If x has null values, count(x) will be less than count(*).
* Count(1) is similar to  count(x) .It counts all null values in a "pseudo column" created by the constant. Since the constant 1 will never be null, this would,  be like count(*) i.e row count .

Wednesday, November 3, 2010

List of useful Data dictionary views

List of useful Data dictionary views which might come in handy in situations!

1. DBC.users 
This view gives current user information

2. dbc.sessioninfo
This view gives information about
- details of  users currently logged in

3.DBC.Databases
This view list all the databases present in the given teradata database system. ALso contains useful information like
-Creatorname
-OWnername
-PERMspace
-SPOOLspace
-TEMPspace

4.DBC.Indices
It gives information on the index created for given table

5.DBC.Tables
It gives information about all the Tables(T), views(V), macros(M), triggers(G), and stored procedures .

6.DBC.IndexConstraints
It Provides information about partitioned primary index constraints.
'Q' indicates a table with a PPI

7. DBC.DiskSpace
It provides information about disk space usage (including spool) for any database or account.
SELECT      DatabaseName
,CAST (SUM (MaxPerm) AS FORMAT 'zzz,zzz,zz9')
        ,CAST (SUM (CurrentPerm) AS FORMAT 'zzz,zzz,zz9')
        ,CAST (((SUM (CurrentPerm))/
                NULLIFZERO (SUM(MaxPerm)) * 100)
                AS FORMAT 'zz9.99%') AS "% Used"
FROM   DBC.DiskSpace
GROUP BY   1
ORDER BY   4 DESC ;

8. DBC.TableSize
It provides information about disk space usage (excluding spool) for any database, table or account
SELECT      Vproc
        ,CAST (TableName
                 AS FORMAT 'X(20)')
        ,CurrentPerm
        ,PeakPerm
FROM   DBC.TableSize
WHERE DatabaseName = USER
ORDER BY           TableName, Vproc ;

9. DBC.AllSpace
It provides information about disk space usage (including spool) for any database, table, or account.
SELECT      Vproc
        ,CAST (TableName AS
                FORMAT 'X(20)')
        ,MaxPerm
        ,CurrentPerm
FROM   DBC.AllSpace
WHERE DatabaseName = USER
ORDER BY   TableName, Vproc ;

10.  DBC.columnstats , DBC.indexstats  and DBC.Multicolumnstats
 These are used to find stats info on given tables