## Sunday, November 4, 2012

### COLLECTION OF STATISTICS USING SAMPLE

Here is the step by step process on how to collect statistics using SAMPLE

SAMPLE statistics is collected using Random amp sampling  and it is  recommended to use when we dont have stats collected on  index or set of columns.

As a preparation step we check whether table is suitable for SAMPLE STATISTICS collection using following query
/* suggested to use when data skew is less and also when more rows are there in table than number of amps*/
SEL
TAB1.A AS TABLECOUNT,
TAB2.B AS AMPCOUNT,
CASE WHEN TABLECOUNT > AMPCOUNT THEN ' RANDOM AMP SAMPLING CAN BE SUGGESTED'
ELSE 'RANDOM AMP SAMPLING NOT NEEDED'
END
FROM
(SEL COUNT (*) AS A FROM TABLENAME) TAB1,
(SEL HASHAMP () +1 AS B) TAB2;

Below is step by step process on Collecting statistics using SAMPLING

1.         Check if Stats are already collected on the column of table for which Random AMP sampling is to be considered using
HELP STATISTICS ON <YOUR_DB>.<YOUR_TB>;

If YES, then this situation is tricky and do you still want to try out SAMPLING or look for other recommendations is up-to you..

2.         If NO then, check if column is highly skewed using following Query.
SELECT HASHAMP (HASHBUCKET (HASHROW (<YOUR_COLUMN>))) ,
COUNT (*)
FROM <YOUR_DB>.<YOUR_TB>
GROUP BY 1;
If you see that Data is equally distributed among all the amps (Variance of +-5 % is  accepted),

If there is large amount of DATASKEW in one AMP, then SAMPLING is not a good option

3.         If you don’t find data skew on any particular AMP then,
Run sample statistics on column of particular table as follows.
COLLECT STATICSTICS ON <YOUR_DB>.<YOUR_TB> COLUMN (<YOUR_COLUMN>) USING SAMPLE;

4.         Check the performance of query after running sample STATS, also note the time taken for collecting sample stats.

5.         If not satisfied with performance, try to run full statistics on columns and measure performance and time taken to collect full stats

6.         Decide which is the best option “FULL STATS or SAMPLE“considering factors like
• Performance,
• Time taken for statistics collection on scenarios,
• Table size,
• Data skew,
• Frequency of table being loaded
• How many times this table would be used in your environment.

Advantages of Collecting Stats using Sample
1.         Only a sample of table rows is scanned. Default being 2%.
It is based on random amp sampling estimate of total rows.  If you want to override the  default value for particular session then use,
DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=n" ON FOR SESSION;

2.         It uses less CPU and I/O resources compared to full statistics hence saving considerable  amount of time and resources.

It is not recommended to used for
1.         Columns which are not indexed
2.         Indexes which has lot of duplicates or non unique combinations
3.         for small tables like dimension /key tables
4.         for tables that have greater data skew.

Please note that Sample statistics cannot be collected on
1.         Global temporary tables
2.         Join indexes