Wednesday, November 14, 2012

Teradatahelp reaches yet another milestone!!


I am very happy to say that my blog www.teradatahelp.com has reached yet another milestone since its launch.  Today it is celebrating 300k visits  .

I want to thank you all for support and those comments which pushed this  blog to be one of the most searched on Web*

Keep coming back for more learning tips and suggestions which could help you in day to day work and making best out of Teradata to build a better Datawarehouse.

*(Alexa 2012 August ratings)

Sunday, November 4, 2012

COLLECTION OF STATISTICS USING SAMPLE


Here is the step by step process on how to collect statistics using SAMPLE

SAMPLE statistics is collected using Random amp sampling  and it is  recommended to use when we dont have stats collected on  index or set of columns.

As a preparation step we check whether table is suitable for SAMPLE STATISTICS collection using following query
/* suggested to use when data skew is less and also when more rows are there in table than number of amps*/
            SEL
            TAB1.A AS TABLECOUNT,
            TAB2.B AS AMPCOUNT,
            CASE WHEN TABLECOUNT > AMPCOUNT THEN ' RANDOM AMP SAMPLING CAN BE SUGGESTED'                     
            ELSE 'RANDOM AMP SAMPLING NOT NEEDED'
            END
            FROM
            (SEL COUNT (*) AS A FROM TABLENAME) TAB1,
            (SEL HASHAMP () +1 AS B) TAB2;


Below is step by step process on Collecting statistics using SAMPLING

1.         Check if Stats are already collected on the column of table for which Random AMP sampling is to be considered using
      HELP STATISTICS ON <YOUR_DB>.<YOUR_TB>;

If YES, then this situation is tricky and do you still want to try out SAMPLING or look for other recommendations is up-to you..

2.         If NO then, check if column is highly skewed using following Query.
      SELECT HASHAMP (HASHBUCKET (HASHROW (<YOUR_COLUMN>))) ,        
      COUNT (*)
      FROM <YOUR_DB>.<YOUR_TB>
            GROUP BY 1;
            If you see that Data is equally distributed among all the amps (Variance of +-5 % is  accepted),
 
If there is large amount of DATASKEW in one AMP, then SAMPLING is not a good option

3.         If you don’t find data skew on any particular AMP then,
            Run sample statistics on column of particular table as follows.
            COLLECT STATICSTICS ON <YOUR_DB>.<YOUR_TB> COLUMN (<YOUR_COLUMN>) USING SAMPLE;

4.         Check the performance of query after running sample STATS, also note the time taken for collecting sample stats.

5.         If not satisfied with performance, try to run full statistics on columns and measure performance and time taken to collect full stats

6.         Decide which is the best option “FULL STATS or SAMPLE“considering factors like
  • Performance,
  • Time taken for statistics collection on scenarios,
  • Table size,
  • Data skew,
  • Frequency of table being loaded
  • How many times this table would be used in your environment.


Advantages of Collecting Stats using Sample
1.         Only a sample of table rows is scanned. Default being 2%.
            It is based on random amp sampling estimate of total rows.  If you want to override the  default value for particular session then use,
            DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=n" ON FOR SESSION;

2.         It uses less CPU and I/O resources compared to full statistics hence saving considerable  amount of time and resources.

It is not recommended to used for
1.         Columns which are not indexed
2.         Indexes which has lot of duplicates or non unique combinations
3.         for small tables like dimension /key tables
4.         for tables that have greater data skew.

Please note that Sample statistics cannot be collected on
1.         Global temporary tables
2.         Join indexes

Comments and Suggestions are welcome!!!

Thursday, June 7, 2012

How to start SQL assistant using Batch Mode or using Command Prompt?

How to start SQL assistant using Batch Mode or using Command Prompt.

This is indeed interesting as there is not much documentation on how you can use SQL assistant via Batch mode or using Command prompt in Windows.

For most of batch mode / Command prompt (interactive mode) we would use Bteq due to its flexibity and ease of use in either Windows or UNIX systems.

There are some cases where you would like to use SQL assistant to run the set of queries over a period of time or for repetitive tasks. In such cases we could use following commands.

Go to Windows start --> Run -->
Enter following command

Sqla -c Test1 -f "c:\my queries\emplst.sql” -e "c:\my queries\emplst_output.log"

Explanation
Sqla
It is used to invoke sql assistant from its home directory.

-c Test1
-c is used to establish connection to Test1 server. 
Please make sure to setup ODBC connection string using ODBC administrator & enter IP address, Username & password.

-f "c:\my queries\emplst.sql" 
The parameter –f is used to open a file in the given path directly on to Sql assistant.
In above scenario, we are trying to open emplst.sql which is present in c:\my queries

As a general rule I would enclose filenames in double quotes just in case they contain spaces or special characters.

-e "c:\my queries\emplst_output.log"
The parameter –e is used to export the results/ output into the given file.

Sometimes we need to use –p ODBC or –p TD depending on version of TD sql assistant.
I would suggest to use –p parameter whenever running the command as to identify the Data source as either ODBC or Teradata.net


One could create a reusable batch file in windows and run this command with our sql file and record output in log file for reporting or whatever you want to do!!!
Steps:
1.      Create file opensqla.bat
2.      Enter following command
Start Sqla -c Test1 -f "c:\my queries\emplst.sql” -e "c:\my queries\emplst_output.log"

Thursday, February 23, 2012

difference between soft referential integrity and hard referential integrity in teradata


One of my colleagues had come across the issue of  soft RI  vs hard RI .  Even i was curious to find that this concept is widely used in Teradata , we are having lack of knowledge on how it works  including me :-)

So i did some analysis on this concept with example and wanted to share with everyone over here.



Hard Referential Integrity -

It involves manual creation of Referential links like "foreign key ... references"....  Please find an example below between department & employee table .

-- create department table & employee table and have referential integrity set explicity  between them as follows

Create table department
(
Dept_no integer not null,
Dept_name varchar(30),
Dept_loc varchar(50)
)
Unique primary index(dept_no);

--- insert some sample records into  Department table

insert into department values (101,'Sales','mumbai);
insert into department values (102,'Accounts','mumbai);
insert into department values (103,'Human Resources','mumbai);


Create  table Employee
(
Emp_num integer not null,
Emp_name carchar(30) not null,
Dept_no integer not null,
Constraint foreign_emp_dept foreign key ( Dept_no)  references department(dept_no)
)
unique primary index(emp_num);

Now , try to insert following record into employee table as below

insert into Employee values (123456,'swapnil',104);
-- following statement fails because  employee table refers to dep_no column in department table to check for value 104.  Since it wont be able to find the value it will fail with referential integrity error.

This type of referential integrity is called Hard referential integrity.



SOFT referential integrity.
( thanks to James Lee  for correcting me :) )

This is handled at system level by optimizer  and we need to specify "with no check option"
Create table department
(
Dept_no integer not null,
Dept_name varchar(30),
Dept_loc varchar(50)
)
Unique primary index(dept_no);

--- insert some sample records into  Department table

insert into department values (101,'Sales','mumbai);
insert into department values (102,'Accounts','mumbai);
insert into department values (103,'Human Resources','mumbai);


Create  table Employee
(
Emp_num integer not null,
Emp_name carchar(30) not null,
Dept_no integer not null
,Constraint foreign_emp_dept foreign key ( Dept_no)  references WITH NO CHECK OPTION department(dept_no)  
)
unique primary index(emp_num);

insert into Employee values (123456,'vinay',101);
insert into Employee values (123456,'Sachin',104);


We can achieve soft  referential integrity by using  inner join between department & Employee tables

sel
dept.dept_no,
emp.emp_num,
emp.emp_name
from
employee emp
inner join   ----> here inner join does soft referential integrity & picks only matching columns
department dept
on
dept.dept_no = emp.dept_no;

advantages of Soft RI over Hard RI is that there is  that there is no overhead of maintaining Referential integrity tables in case of Soft RI & it would help in better performance in case of tables with large values and there is no constraints put on referenced columns when their values change.

Soft referential integrity is ideal in case of large datawarehhouses becuase of ETL mechanisms which cleanse unnecessary records .

Well, its upto you to decide which is better option considering the restrictions and performance concerns when these tables tend to grow