Tuesday, March 29, 2011

Partitioned Primary Index (PPI)

Partitioned Primary Index (PPI)

Partitioned Primary Index is one of the unique features of Teradata, which allows access of portion of data of large table. This reduces the overhead of scanning the complete table thus improving performance. 

PPI works by hashing rows to different virtual AMPs, as is done with a normal PI.  PPI does not alter data distribution, it only creates partitions on data already distributed based on PI.

Usually PPI's are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.

Partitions are usually defined based on Range or Case as follows.
Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.

1. Partition by CASE
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
PRIMARY INDEX (customer_number)
            order_total < 10000 ,
            order_total < 20000 ,
            order_total < 30000,
NO           CASE     OR        UNKNOWN ) ;

2. Partition by Range - example using date range
ORD_number  integer NOT NULL,
customer_number integer NOT NULL,
order_date  date ,
order_total integer
PRIMARY INDEX (customer_number)
PARTITION BY range_n (
            Order_date BETWEEN date '2010-01-01'       AND      date '2010-12-01'
            EACH interval '1' month,
        OR  UNKNOWN);

P.S: If we use NO RANGE or NO CASE - then all values not in this range will be in a single partition.
If we specify UNKNOWN, then all null values will be placed in this partition

PPI improves performance as follows:
  • Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
  • Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
  • Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
  • Provides an access path to the rows in the base table while still providing efficient join Strategies
  • If the same partition is consistently targeted, the part of the table updated may be able to  fit largely in cache, significantly boosting performance
  • PPI based tables have advantage during Purging stages. Since purging based on partition is very fast and deletion happens quickly.

PPI also comes with some disadvantages like
  • The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
  • Another disadvantage is that when another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join

Limitations of Partitioned Primary Index (PPI) :
  •  Primary index of PPI table has to be 
    • Non unique PI, if PPI column is not part of Index, since enforcing a Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.
  •  Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint in same partition.
  •  PPI cannot be defined on Global temporary tables and Volatile tables and also on compressed join indices
  • PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition number for each row .
  • PPI table rows are four bytes wider if value compression is specified for the table. 

Since PPI , results in lot of partitions , there is a little overhead to user/dba.  He has to regularly run collect stats on the PPI column.

It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index for that PPI table.By doing so, it avoids all unnecessary partitions , thus speeding up the access.
Help stats tablename column PARTITION; -- used to list partitions in table and their details
Collect stats on tablename column PARTITION; -- refresh partition details

Monday, March 28, 2011

Maximum limits in v2r5/ v2r6 databases

Maximum limits in v2r5/ v2r6 databases

The database specifications in the following table apply to a single Teradata database.

Max number of journal tables per database
Max number of data tables per database
 4.2 x 10 9
Max number of columns per table or view
± 2048
Max number of columns created over the life of a table
Max number of rows per table
 Limited by disk capacity
Max row size
 Approx 64,000 bytes
Max field size
 Approx 64,000 bytes
Max database object name size
 30 bytes
Max number of columns per primary or secondary index
Max SQL request size
1 MB
Max SQL response size
 Approx 64,000 bytes
Max expanded text size for macros and views
 2 MB
Max SQL title size
 60 characters
Max number of characters in a string constant
± 32,000
Max number of columns per DML statement ORDER BY clause
Max number of primary indexes per table
Max number of partitions for a partitioned primary index
 ± 65,535
Max numbers of secondary, hash, and join indexes, in any combination, per table

Found this very useful  information elsewhere on net, felt like sharing it here.!!

Tuesday, March 22, 2011

Teradata Interview questions - Part 7

Teradata Interview questions - Part 7

How do you set the session mode parameters in BTEQ?
.set session transaction ANSI /* this is to set ANSI mode */
.set session transaction BTET /* this is to set Teradata transaction mode */

These commands have to be entered before logging to the session

What is the command in BTEQ to check for session settings ?
The BTEQ .SHOW CONTROL command displays BTEQ settings.

How do you submit bteq script (batch mode)?
1. Start the BTEQ, by typing BTEQ
2. Enter the following command
    .run file = BTEQScript.btq
1. Bteq < BTEQScript.btq

BTEQScript.btq contains following

.logon, dbc;
sel top 10 * from dbc.tables;

Is Like comparision case-sensitive in Teradata?

LIKE operator is not case sensitive in Teradata.
Consider the following example  
Select F_name from employee where F_name like '%JO%' ;
The followig query will pick values matchinf with 'JO' and 'jo' as well, since teradata is not case-sensitive

To overcome this problem, a new function called "CASESPECIFIC" is used in TERADATA as follows
Select F_name from employee where F_name (CASESPECIFIC) like '%JO%' ;
What does DROP table command do?

It deletes all data in emp_DATA
Removes the emp_DATA definiation from the data dictionary
Removes all explicit access rights on the table
How do you set default date setting in BTEQ?

There are two default date setting in BTEQ.  They have to be set after logging on to the session
They are

Set session dateform = ANSIDATE; /*format is yyyy-mm-dd */      
Set session dateform = integerdate; /* format is yy/mm/dd   -teradata date format */
Difference between Inner join and outer join?

An inner join gets data from both tables where the specified data exists in both
An outer join gets data from the source table at all times, and returns data from
the outer joined table ONLY if it matches the criteria.
What is multi Insert?

Inserting data records into the table using multiple insert statements.
 Putting a Semi colon in front of the key word INSERT in the next statement rather than
Terminating the first statement with a semi colon achieves it.

Insert into Sales “select * from customer”
; Insert into Loan “select * from customer”;

How to find duplicates in a table?
Group by those fields and select id, count(*) from table group by id having count
(*) > 1

How do you see a DDL for an existing table?
By using show table command.

Which is more efficient GROUP BY or DISTINCT to find duplicates?
With more duplicates GROUP BY is more efficient, if only a few duplicates exist
DISTINCT is more efficient.

Syntax for CASE WHEN statement?
CASE value_expression_1 WHEN value_expression_n THEN scalar_expression_n

Monday, March 21, 2011

Locking table for access?

Locking table for access?

We would have come across this statement in many queries which are run in sensitive environments like PROD, UAT.   They can be used with views or sometimes just for querying purpose.

I wanted to discuss how important this statement would be in real-time /active data warehouses where lot of users will be striking queries on same database at the time.

create  view Employee.view_employ_withLock
locking table Employee.Dept_emp  for access
select   *
from     Employee.Dept_emp ;

By using locking table for access, we make sure that normal "access" lock is applied on table which is required to fetch results. By doing so ,

- There is no waiting for other locks to release since access lock can be applied on table which has read/write lock applied to it
- This will cause the query to execute even when some lock is applied , but  accessing data using this lock might not be consistent as it might result in dirty read due to concurrent write on the same table.

It is always suggested to use “locking table for access" which since they will not block the other users from applying read/write lock on the table.

Fastload locks

Fastload locks are one of the major issues in fastload . I have seen this kind of scenarios in production environment and there are various ways by which we can handle this issue.

When do fastload locks happen?
- When the fastload script results in failure (error code 8 or 12)
- When fastload script is aborted by dba or some other session
- Fastload fails (Bad record, file)
- When you have forgot to add end loading statement in ur fload script

What can be done to release fload locks?

1. Finish the fastload, which means leave your hands off of the error tables and restart  fastload. When it finishes the table lock will be released.
 If the following error persists then do step 2.
Error is: RDBMS error 2652: Operation not allowed: _db_._table_ is being Loaded.

2.  Create a dummy fload job  as following without any content for loading  as  follows.

     logon tdpid/username,password;
     begin loading databasename.tablename;
/*  emtpy loading   to release fastload lock */
     end loading;

3.  If   second step also fails then, drop the empty locked table (because if you can't restart Fastload you most likely dropped the error tables)
   Please note that before using the DROP solution, try to close your already opened fastload session.

4. If still this  error persists and not able to drop the table  :( , then please kill the fload session from pmon/manager , drop the table and rerun actual     fload script .

I hope this solution works, if not then please contact ur DBA or contact me.