Tuesday, June 28, 2011

Did someone say secondary index??

Secondary Index

"last but not the least!!"  is what i would say when it comes to secondary index.  This  is very much true since secondary index is concept when it comes to handy when others doesn't!

A secondary index is one which provides alternate access paths to the data  thus helping in speeding of the queries.They are usually defined on columns which are more frequently used rather than PI column in joining condition.

A secondary index can be defined as unique secondary index or non unique secondary index.
CREATE UNIQUE INDEX (Columns) ON dbname.tablename ;
This creates Unique secondary index on Employees table

CREATE INDEX (Columns) ON dbname.tablename ;
This creates non unique secondary index on employees table.

A worthy tip:
Please make sure to use naming for index when multiple columns are defined  . It would be easy to drop a index  on that table just by using indexname,
create index SI_emp (firstname,lastname) from employees;  --- index created as SI_emp for table employees
Drop index SI_emp on employees;     --- easy to drop index by giving index name

So what exactly does this Secondary index do?? Does it hash the data values based on secondary index and redistribute data? No. It doesn't'

It just creates a subtable for defined secondary index , which contains there columns
-secondary index value
-secondary index rowid (hashed value for SI)
-hash value for primary index(pointing to target amp)

This  subtable is created on each amp  which in turn points to target AMP.   Hence Secondary index can be TWO amp operation for USI and MANY amp operation for NUSI.  So whenever a SI column is used in a query,  based on that column  optimizer quickly reaches subtable by way  of  row hash generated for SI value and using PI row id corresponding to it,  .it would reach target amp.

The secondary index Sub table is key to alternate access paths provided in teradata. With this benefit we end up paying  the price for extra perm space consumed by subtables and overhead for their maintenance. So think wisely before going for Secondary index or use the beauty of dynamic secondary index concept which supports creation of Secondary index when required and dropping of it when not used hence saving considerable amount of space,

I want to share one of the scenarios where we had a table with set of date columns and Only PI defined on that table,  A query was run on particular  date range of that table ,which went on for full table scan taking more than 4 hours to complete. 
so decided to focus on this date column since it looked like  main culprit for performance issue.
Two things strike my mind when date column came into Picture . Should i go for PPI or SI??

1.  In case of PPI,
create table with PPI and repopulate data .  But this adds to overhead since there was  addition of 2 bytes to every row increasing  size of the table by 2*n rows.
Second issue was PPI causing redistribution of PI in partitions resulting in Primary index based queries taking more than normal time.   Joins happening on PPI columns to non PPI table would result in sliding window join which would further decrease the performance.

Most important of all, how frequently was this date column used in queries  .  Since  frequency of usage  was less and seeing drawbacks of PPI , i decided to go on with Secondary index which could be created any time and dropped any time.

Since  date column  is range based , I created value ordered secondary index as follows
create index (date-column) order by values on employees ;

Stats were collected on to ensure  optimizer to pick secondary index.  I could see the performance improvement in query after creation of valued ordered NUSI . Please note that Value ordered secondary index can be created only  as NUSI  and supports date, int,decimal upto 4 bytes.

Please make sure to collect stats on secondary index since optimizer might have to decide whether reading SI subtable is faster than  going for full table scan ( reading all the amps)

I want to point out some Limitations of secondary index
1. Secondary index is not compatible with fastload  as there is overhead for fastload on duplicate secondary index checking with subtable creation
Drop SI before doing fastload on that table and then recreate them,
2.   Table can have upto 32 secondary indices in teradata
3. Multicolumn secondary index can have upto 16 columns defined in a index
4. USI is not supported on Multiload, but NUSI is supported by multiload
5. USI cannot be created on table which already has duplicate data in columns to be defined as  part of USI.
6. SI cannot be created on column which is  already defined as PI.