Tuesday, March 29, 2011

Partitioned Primary Index (PPI)

Partitioned Primary Index (PPI)

Partitioned Primary Index is one of the unique features of Teradata, which allows access of portion of data of large table. This reduces the overhead of scanning the complete table thus improving performance. 

PPI works by hashing rows to different virtual AMPs, as is done with a normal PI.  PPI does not alter data distribution, it only creates partitions on data already distributed based on PI.

Usually PPI's are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.

Partitions are usually defined based on Range or Case as follows.
Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.

1. Partition by CASE
CREATE      TABLE ORDER_Table
(
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY case_n (
            order_total < 10000 ,
            order_total < 20000 ,
            order_total < 30000,
NO           CASE     OR        UNKNOWN ) ;

2. Partition by Range - example using date range
CREATE TABLE ORDER_Table
(
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY range_n (
            Order_date BETWEEN date '2010-01-01'       AND      date '2010-12-01'
            EACH interval '1' month,
NO RANGE
        OR  UNKNOWN);

P.S: If we use NO RANGE or NO CASE - then all values not in this range will be in a single partition.
If we specify UNKNOWN, then all null values will be placed in this partition


PPI improves performance as follows:
  • Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
  • Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
  • Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
  • Provides an access path to the rows in the base table while still providing efficient join Strategies
  • If the same partition is consistently targeted, the part of the table updated may be able to  fit largely in cache, significantly boosting performance
  • PPI based tables have advantage during Purging stages. Since purging based on partition is very fast and deletion happens quickly.

PPI also comes with some disadvantages like
  • The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
  • Another disadvantage is that when another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join

Limitations of Partitioned Primary Index (PPI) :
  •  Primary index of PPI table has to be 
    • Non unique PI, if PPI column is not part of Index, since enforcing a Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.
  •  Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint in same partition.
  •  PPI cannot be defined on Global temporary tables and Volatile tables and also on compressed join indices
  • PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition number for each row .
  • PPI table rows are four bytes wider if value compression is specified for the table. 

Since PPI , results in lot of partitions , there is a little overhead to user/dba.  He has to regularly run collect stats on the PPI column.

It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index for that PPI table.By doing so, it avoids all unnecessary partitions , thus speeding up the access.
Help stats tablename column PARTITION; -- used to list partitions in table and their details
Collect stats on tablename column PARTITION; -- refresh partition details

13 comments:

  1. thanks for the summary on PPI!

    ReplyDelete
  2. HI SIR THIS IS CHANTI....
    SIR PLS TELL ME HOW TO DELETE DUP RECORDS IN MULTISET TABLE WITHOUT USING ANY TABLES,,, PLZ TELL ME.. THIS QN I FACED IN COGNIGENT INTERVIEW..

    ReplyDelete
    Replies
    1. This indeed is tricky and indeed not possible without second table .

      Teradata versions prior to V2R5 had some function similar to rowid( concept in Oracle) to delete the records .. It has been taken out in later versions due to performance issues .

      The closet approach is using row_number or rank with qualify to pick non duplicate records

      There is one more myth about Multiset & UPI combination. Now a days many clients are going for Multiset & UPI to achieve non duplication of records in table.

      hope this helps!!

      Vinay

      Delete
  3. Hi Could you pls explain what is the difference between value ordered NUSI and PPI?

    ReplyDelete
    Replies
    1. Please refer to my post on usage of secondary Index. It will answer your question :-)

      Vinay

      Delete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. hi.. could u pls explain why primary index defined on column otherthan ppi column would be a full table scan?

    ReplyDelete
  6. Hi ...
    the tutorial is excellent.But PPI can be defined on global temp tables ,non compressed join tables and volatile table also.
    Please check and respond if i am wrong.

    ReplyDelete
    Replies
    1. Hi Anwaar,

      Let me explain as to why PPI cannot be created on GTT's, volatile and Join tables..

      - One of the reasons why Teradata dint implement them is complexity and also the purpose of temporary tables is limited to session and minimal entries are made into data dictionary.
      Having PPI on these tables causes performance issues and unnecessary maintenance of details in Data dictionary.

      The benefits of having PPI on these tables is minimal and since data is not permanent , it doesn't make sense if PPI are present in these tables.

      I would really doubt if Teradata would implement PPI on them in future..

      Br,
      Vinay SHet

      Delete
  7. please tell me .....overhead of secondary index is how many bytes?

    ReplyDelete
    Replies
    1. Hi,

      Secondary index will need some space on amp to store the subtable
      Also stats needs to be upto date for secondary indicies ..especially for non unique

      But there is no overhead in bytes like PPI since secondary index is not used for generating rowhash for
      Primary index columns,,

      Br
      Vinay SHet

      Delete
  8. thank you for your info........!!!!!
    really helpful....

    with regards
    B.THEJA

    ReplyDelete