Bigdata-dwbi: November 2013

Monday, November 25, 2013

How to use Netezza External Tables

You can use Netezza's external table to view data from an external file and use it like a database table. When you create an external table, the actual data still sits in that physical external file, but you can query it from Netezza like you can query a normal database table.

You can also use external tables to export data out of a Netezza table into a file.

From Netezza's Data Loading Manual - An external table allows Netezza to treat an external file as a database table. An external table has a definition (a table schema), but the actual data exists outside of the Netezza appliance database. External tables can be used to access files which are stored on the Netezza host server or, in the case of a remote external table, Netezza can treat a file on a client system as an external table (see REMOTESOURCE option).

Below are 2 examples to show you how to do both data export and "import".

I am using Aginity Workbench for Netezza on my windows machine (which comes with a netezza odbc driver), and I have a text file stored in my local drive C:\temp\testfile.txt which has the following column header and 3 rows:

employeeid,employeename,salary
1,'John Lee',100000
2,'Marty Short', 120000
3,'Jane Mars', 150000

CREATE EXTERNAL TABLE ext_tbl_employees(
employee_id integer,
employee_name character varying(100),
salary decimal (10,2))
USING (
dataobject('c:\temp\testfile.txt')
remotesource 'odbc'
delimiter ','
skiprows 1);

Then in your Aginity workbench object browser, expand the folder External Tables, and you will your new external table listed there.

You can then query the table just like a normal database table:

select *
from ext_tbl_employees

You can also create a transient external table, in which it only exists for the duration of your query, which means the external table definition does not persist in the system catalog.

--transient external table
SELECT
employee_id,
employee_name,
salary
FROM EXTERNAL 'c:\temp\testfile.txt'
(
employee_id integer,
employee_name character varying(100),
salary decimal (10,2))
USING (
remotesource 'odbc'
delimiter ','
skiprows 1);

Transient external table is also a very useful way to export data from a netezza database out to a text file. You can export not just an entire table, but the output of any sql statement. The beauty of it is you don't have to specify the schema definition of the data, which can save you a lot of typing:

create table mytesttable (
studentid integer,
studentname character varying(100)
);

insert into mytesttable
select 1, 'Jane Doe'
union
select 2, 'John Smith';

CREATE EXTERNAL TABLE 'c:\Temp\ExportDataTest.csv' USING (remotesource 'ODBC' DELIM ',') AS
SELECT *
from mytesttable;

drop table mytesttable;

Note: if the file already exists, any existing data will be deleted by Netezza before it inserts data.

If there's comma in your data, and you want to export to csv, use escapeChar option:

CREATE EXTERNAL TABLE 'c:\Temp\OfferAttributesNightlyFeed.csv' USING (remotesource 'ODBC' DELIM ',' ESCAPECHAR '\') AS
SELECT *
from COUPONATTRIBUTESTEST;

Netezza Performance Check Querys

----Active Moniter---
select current_timestamp, count(*) from _v_qrystat

---Swapspace Monitor--
select current_timestamp, (sum(USED_KB) / (1024))::DECIMAL(9,1) as used from _vt_swapspace
--NPS Memory Allocation--
select current_timestamp, (select sum(v1.bytes)/(1024*1024) from _vt_memory_usage v1 where hwid = 0)::INTEGER as host_total, avg(a.tot)::INTEGER, max(a.tot)::INTEGER from (select sum(bytes)/(1024*1024) as tot from _vt_memory_usage where hwid <>0 group by hwid) a
---Throughput Per Minute---
select current_timestamp, case when ( max(qh_tsubmit) - min(qh_tsubmit) = 0) then 0 else (count(*) * (60.0 / ( max(qh_tsubmit) - min(qh_tsubmit))::float ))::integer end as qpm from _v_qryhist where qh_tsubmit > current_timestamp - interval '1 minutes'
--Throughput Per Hour--
select current_timestamp, case when ( max(qh_tsubmit) - min(qh_tsubmit) = 0) then 0 else (count(*) * (3600.0 / ( max(qh_tsubmit) - min(qh_tsubmit))::float ))::integer end as qph from _v_qryhist where qh_tsubmit > current_timestamp - interval '60 minutes'
--Query Performance----
select current_timestamp, avg(qh_tend - qh_tsubmit) as foo, avg(qh_tstart - qh_tsubmit) as foo2 from _v_qryhist where qh_tend::timestamp > current_timestamp - interval '1 minutes'

---Que Moniter--
select current_timestamp, sn.sn_short, sn.sn_long, gra.gra_short, gra.gra_long from (select sum(plans_waiting_short) as sn_short, sum(plans_waiting_long) as sn_long from _v_sched_sn where entry_ts = (select max(entry_ts) from _v_sched_sn)) sn CROSS JOIN (select sum(plans_waiting_short) as gra_short, sum(plans_waiting_long) as gra_long from _v_sched_gra where entry_ts = (select max(entry_ts) from _v_sched_gra)) gra

--Cpu use--
select current_timestamp, avg(HOST_CPU) * 100.0 as host_cpu_percent, avg(SPU_CPU) * 100.0 as spu_cpu_percent, max(MAX_SPU_CPU) * 100.0 as max_spu_cpu_percent from _vt_system_util where entry_ts / 1000000 > ((select max(entry_ts) from _vt_system_util) / 1000000 - 180)

--Disk Utilization---
select current_timestamp, avg(HOST_DISK) * 100.0 as host_disk_percent, avg(SPU_DISK) * 100.0 as spu_disk_percent, max(MAX_SPU_DISK) * 100.0 as max_spu_disk_percent from _vt_system_util where entry_ts / 1000000 > ((select max(entry_ts) from _vt_system_util) / 1000000 - 180)
--Memory Utilization---
select current_timestamp, avg(HOST_MEMORY) * 100.0 as host_memory_percent, avg(SPU_MEMORY) * 100.0 as spu_memory_percent, max(MAX_SPU_MEMORY) * 100.0 as max_spu_memory_percent from _vt_system_util where entry_ts / 1000000 > ((select max(entry_ts) from _vt_system_util) / 1000000 - 180)

Check row count by distribution key

SELECT datasliceid,
COUNT(*) Num_rows
FROM IMPRINT_LOOKUP
GROUP BY datasliceid;

delete duplicate records
delete from TABLE_A where rowid not in (
SELECT min(rowid)
FROM TABLE_A
group by
COL1,COL2,COL3..
)

How to do update in netezza

Netezza doesn't allow like inner join in the update statement, the other way to do it

lets say you have to update tablea from tableb
here is the sql

update tablea T1 set T1.COl=T2.COL
,T1.Col1=T2.COl1
from
tableb T2 where T1.ID=T2.ID

How to do PIVOT in netezza SQL

If you have table like below

GROUP_NAME GROUP_ID PASS_FAIL COUNT

GROUP1 5 FAIL 382

GROUP1 5 PASS 339

and you want the result to be like

GROUP_NAME  GROUP_ID      PASS      FAIL
GROUP1        4           339       382

ELECT 
    GROUP_NAME,
    GROUP_ID,
    SUM(CASE WHEN PASS_FAIL = 'PASS' THEN 1 ELSE 0 END) as PASS,
    SUM(CASE WHEN PASS_FAIL = 'FAIL' THEN 1 ELSE 0 END) as FAIL
FROM 
    log a
    join group b 
    on a.group_id=b.group_id
GROUP BY 
    b.group_name,
    a.group_id

Friday, November 1, 2013

IBM Netezza Best Practices & Guidelines

I am sharing the some of the best practices we followed on designing the netezza table and database

In order to leverage the Netezza box to its fullest potential and achieve optimum performance, we recommend the best practices and guidelines listed below:

Distribution :-

§ The distribution of the data across the various disks is the single most important factor that can impact performance. Consider below points when choosing the distribution key:

§ Column with High Cardinality.

§ Column(s) that will frequently be used in join conditions.

§ Avoid using a boolean column, which causes Data Skew.

§ Avoid distributing the tables on columns that are often used in the where clause as it will cause processing skew. Date columns would be an example of where not to use this, especially in DW environments.

§ Good distribution is a fundamental element of performance!

§ If all data slices have the same amount of work to do, a query will be 94 times quicker than if one data slice was asked to do the same work

§ Bad distribution is called data skew

§ Skew to one data slice is the worst case scenario

§ Skew affects the query in hand and others as the data slice has more to do

§ Skew also means that the machine will fill up much quicker

§ Simple rule. Good distribution – Good performance

To create an explicit distribution key, the Netezza SQL syntax is:

usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON [HASH] ( <column> [ ,… ] ) ;

The phrase distribute on specifies the distribution key, hash is optional.

You cannot update columns specified as the DISTRIBUTE ON key.

To create a random distribution, the Netezza SQL syntax is:

usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON RANDOM;

The phrase DISTRIBUTE ON RANDOM specifies round-robin

§ Select the common key between the Dim and Fact tables if possible; if not select the key to ensure that the larger table (Fact) is not redistributed.

§ Choose Random Distribution only as the last resort as it will more often lead to a table being redistributed or broadcasted. This is okay for a small table but will impact performance if done to large tables.

§ Use Char(x) instead of Varchar(x) when you expect the data to be a fixed length as this not only helps to save disk space but also helps performance due to reduced I/O.

§ Where possible, use the NOT NULL constraint for columns in the tables, especially for columns that are used in where clauses (Restriction) or join conditions.

§ Use Same Data Type and Length for columns that are often used for joining tables.

§ Use joins over sub queries.

Create materialized views for vertically partitioning small sets of columns that are often used in queries. The Optimizer automatically decides when to use the MV or the underlying table.