Bigdata-dwbi: 2013

Wednesday, December 11, 2013

Netezza Importing Data from File to a Table

The following method will allow you to import data from a file(any delimited file) into a table

The sample file is a tab delimited file with two column in it, one as char and other as integer, first row has column name

here is the query to read data from the file and dump into a table called TEMP_TABLE
insert into TEMP_TABLE
SELECT
COL1,
COL2
FROM EXTERNAL 'C:\Sample.txt'
(
COL1 varchar(255),COL2 integer)
USING (
remotesource 'ODBC' delimiter '\t'
skiprows 1);

Tuesday, December 10, 2013

Netezza ISNUMERIC Data Check Logic

Netezza dosen't have any isnumeric check function, so in order to find out whether the column has non-numeric, use the following logic

nvl(instr(translate(Column_Name,
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'),'X'),0)

if the column has non numeric then it will return 1 else it will be 0, I haven't included the special characters in the logic, if you have '%,$,£..' then include them in the translate part and it will work

One of our visitor posted the following method

istrue(length(translate(Column_Name,'0123456789',''))=0) as isnumeric

If the resulting length is zero then it is numeric, or the string was empty. If the latter is a possibility then you could add another check

istrue(length(translate(Column_Name,'0123456789',''))=0 and length(Column_Name)>0) as isnumeric

Friday, December 6, 2013

Netezza Optimizer Parameters

Netezza has 3 internal planner

• Fact Relationship Planner (factrel_planner)

• Snowflake Planner

• Star Planner

In the course of a query plan, the optimizer may use multiple planners in a particular plan, based on least cost

Fact Relationship Planner

The fact relationship or factrel planner should be enabled for most star or snowflake schemas. It is enabled in

/nz/data/postgresql.conf by adding the following line:

enable_factrel_planner = true

By default, the factrel planner will classify any table over 100,000,000 rows as a FACT table. The optimizer will

try and avoid broadcasting or moving FACT tables as it is almost always much less expensive to broadcast a

smaller dimension table. Depending on the size of the tables involved, the default setting may be inappropriate.

The default threshold can be adjusted by setting an override in postgresql.conf. In the example below, we set it

to 50 million rows, which would cause the optimizer to avoid broadcasting tables of more than 50 million rows.

factrel_size_threshold = 50000000

Short Topics in Netezza Administration Common Optimizer Parameters

Snowflake Planner

The snowflake planner is enabled in /nz/data/postgresql.conf by adding the following line:

enable_snowflake_planner = true

For the snowflake planner to classify a table as a FACT table, by default it must be at least 1 million rows and at

least 10 times larger than the next smallest table. In most Netezza environments, the default value of 1 million is

too low, and should be adjusted higher, e.g.:

snowflake_min_fact_size = 40000000

The snowflake_min_ratio parameter controls the ratio required for a table to be classified as a FACT table.

Star Planner

The star planner is enabled in /nz/data/postgresql.conf by adding the following line:

enable_star_planner = true

In most cases, no further adjustments are necessary. In some cases, adjusting the parameter

num_star_planner_rels from its default value of 8 can be beneficial. Adjusting this number up slightly will

cause the optimizer to search more deeply for a lower cost plan which can help with some queries that are

running out of system resources. Netezza does not recommend setting this value above 10; in most cases the

default value of 8 is sufficient. Should a higher value be desired, it can be set as follows:

num_star_planner_rels = 10

Thursday, December 5, 2013

Informatica Job Interview Question & Answers

Source :-dwh-ejtl.blogspot.com

Q1) Tell me what exactly, what was your role?
A1) I worked as ETL Developer. I was also involved in requirement gathering, developing mappings, checking source data. I did Unit testing (using TOAD), helped in User Acceptance Testing.

Q2) What kind of challenges did you come across in your project?
A2) Mostly the challenges were to finalize the requirements in such a way

So that different stakeholders come to a common agreement about the scope and expectations from the project.

Q3) Tell me what the size of your database was?
A3) Around 3 TB. There were other separate systems, but the one I was mainly using was around 3 TB.

Q4) what was the daily volume of records?
A4) It used to vary, We processed around 100K-200K records on a daily basis, on weekends, and it used to be higher sometimes around 1+ Million records.

Q5) So tell me what your sources were?
A5) Our Sources were mainly flat files, relational databases.

Q6) What tools did you use for FTP/UNIX?
A6) For UNIX, I used Open Source tool called Putty and for FTP, I used WINSCP, Filezilla.

Q7) Tell me how did you gather requirements?
A7) We used to have meetings and design sessions with end users. The users used to give us sketchy requirements and after that we used to do further analysis and used to create detailed Requirement Specification Documents (RSD).

Q8) Did you follow any formal process or methodology for Requirement gathering?
A8) As such we did not follow strict SDLC approach because requirement gathering is an iterative process.
But after creating the detailed Requirement Specification Documents, we used to take User signoff.

Q9) How did you do Error handling in Informatica?
A9) Typically we can set the error flag in mapping based on business requirements and for each type of error, we can associate an error code and error description and write all errors to a separate error table so that we capture all rejects correctly.

Also we need to capture all source fields in a ERR_DATA table so that if we need to correct the erroneous data fields and Re-RUN the corrected data if needed.

Usually there could be a separate mapping to handle such error data file.

Typical errors that we come across are

1) Non Numeric data in Numeric fields.
2) Incorrect Year / Months in Date fields from Flat files or varchar2 fields.

Q10) Did you work in Team Based environments?
A10) Yes, we had versioning enabled in Repository.

Q11) Tell me what are the steps involved in Application Development?
A11) In Application Development, we usually follow following steps:
ADDTIP.

a) A - Analysis or User Requirement Gathering
b) D - Designing and Architecture
c) D - Development
d) T - Testing (which involves Unit Testing, System Integration Testing,
UAT - User Acceptance Testing )
e) I - Implementation (also called deployment to production)
f) P - Production Support / Warranty

Q12) What are the drawbacks of Waterfall Approach ?
A12) This approaches assumes that all the User Requirements will be perfect before start of design and development. That is not the case most of the time. Users can change their mind to add few more detailed
requirements or worse change the requirements drastically. So in those cases this approach (waterfall) is likely to cause a delay in project which is a RISK to the project.

Q13) what is mapping design document?
A13) In a mapping design document, we map source to target field, also document any special business logic that needs to be implemented in the mapping.

Q14) What are different Data Warehousing Methodologies that you are familiar with?
A14) In Data Warehousing, two methodologies are popular, 1st one is Ralph Kimball and 2nd one is Bill Inmon.
We mainly followed Ralph Kimball's methodology in my last project.
In this methodology, we have a fact tables in the middle, surrounded by dimension tables.
This is also a basic STAR Schema which is the basic dimensional model.
A Snowflake schema. In a snowflake schema, we normalize one of the dimension tables.

Q15) What do you do in Bill Inmon Approach?
A15) In Bill Inmon's approach, we try to create an Enterprise Data Warehouse using 3rd NF, and then Data Marts are mainly STAR Schemas in 2nd NF.

Q16) How many mappings have you done?
A16) I did over 35+ mappings, around 10+ were complex mappings.

Q17) What are Test cases or how did you do testing of Informatica Mappings?
A17) Basically we take the SQL from Source Qualifier and check the source / target data in Toad.

Then we try to spot check data for various conditions according to mapping document and look for any error in mappings.

For example, there may be a condition that if customer account does not exist then filter out that record and write it to a reject file.

Q18) What are the other error handlings that you did in mappings?
A18) I mainly looked for non-numeric data in numeric fields, layout of a flat file may be different.
Also dates from flat file come as a string

Q19) How did you debug your mappings?
A19) I used Informatica Debugger to check for any flags being set incorrectly. We see if the logic / expressions are working or not. We may be expecting data
We use Wizard to configure the debugger.

Q20) Give me an example of a tough situation that you came across in Informatica Mappings and how did you handle it?
A20) Basically one of our colleagues had created a mapping that was using Joiner and mapping was taking a lot of time to run, but the Join was in such a way that we could do the Join at Database Level (Oracle Level).
So I suggested and implemented that change and it reduced the run time by 40%.

Q21) Tell me what various transformations that you have used are?
A21) I have used Lookup, Joiner, Update Strategy, Aggregator, Sorter etc.

Q22) How will you categorize various types of transformation?
A22) Transformations can be connected or unconnected. Active or passive.

Q23) What are the different types of Transformations?
A23) Transformations can be active transformation or passive transformations. If the number of output rows is different than number of input rows then the transformation is an active transformation.

Like a Filter / Aggregator Transformation. Filter Transformation can filter out some records based on condition defined in filter transformation.

Similarly, in an aggregator transformation, number of output rows can be less than input rows as after applying the aggregate function like SUM, we could have fewer records.

Q24) What is a lookup transformation?
A24) We can use a Lookup transformation to look up data in a flat file or a relational table, view, or synonym.
We can use multiple Lookup transformations in a mapping.
The Power Center Server queries the lookup source based on the lookup ports in the transformation. It compares Lookup transformation port values to lookup source column values based on the lookup condition.
We can use the Lookup transformation to perform many tasks, including:
1) Get a related value.
2) Perform a calculation.
3) Update slowly changing dimension tables.

Q25) Did you use unconnected Lookup Transformation? If yes, then explain.
A25) Yes. An Unconnected Lookup receives input value as a result of: LKP Expression in another transformation. It is not connected to any other transformation. Instead, it has input ports, output ports and a Return Port.
An Unconnected Lookup can have ONLY ONE Return PORT.

Q26) What is Lookup Cache?
A26) The Power Center Server builds a cache in memory when it processes the first row of data in a cached Lookup transformation.
It allocates the memory based on amount configured in the session. Default is
2M Bytes for Data Cache and 1M bytes for Index Cache.
We can change the default Cache size if needed.
Condition values are stored in Index Cache and output values in Data cache.

Q27) What happens if the Lookup table is larger than the Lookup Cache?
A27) If the data does not fit in the memory cache, the Power Center Server stores the overflow values in the cache files.
To avoid writing the overflow values to cache files, we can increase the default cache size.
When the session completes, the Power Center Server releases cache memory and deletes the cache files.
If you use a flat file lookup, the Power Center Server always caches the lookup source.

Q28) What is meant by "Lookup caching enabled”?
A28) By checking "Lookup caching enabled" option, we are instructing Informatica Server to Cache lookup values during the session.

Q29) What are the different types of Lookup?
A29) When configuring a lookup cache, you can specify any of the following options:
a) Persistent cache. You can save the lookup cache files and reuse them the next time the Power Center Server processes a Lookup transformation configured to use the cache.
b) Recache from source. If the persistent cache is not synchronized with the lookup table, you can configure the Lookup transformation to rebuild the lookup cache.
c) Static cache. You can configure a static, or read-only, cache for any lookup source.
By default, the Power Center Server creates a static cache. It caches the lookup file or table and looks up values in the cache for each row that comes into the transformation.
When the lookup condition is true, the Power Center Server returns a value from the lookup cache. The Power Center Server does not update the cache while it processes the Lookup Transformation.
d) Dynamic cache. If you want to cache the target table and insert new rows or update existing rows in the cache and the target, you can create a Lookup transformation to use a dynamic cache.
The Power Center Server dynamically inserts or updates data in the lookup cache and passes data to the target table.
You cannot use a dynamic cache with a flat file lookup.
e) Shared cache. You can share the lookup cache between multiple transformations. You can share an unnamed cache between transformations in the same mapping. You can share a named cache between transformations in the same or different mappings.

Q30) What is a Router Transformation?
A30) A Router transformation is similar to a Filter transformation because both transformations allow you to use a condition to test data. A Filter transformation tests data for one condition and drops the rows of data that do not meet the condition.
However, a Router transformation tests data for one or more conditions and gives you the option to route rows of data that do not meet any of the conditions to a default output group.

Q31) What is a sorter transformation?
A31) The Sorter transformation allows you to sort data. You can sort data in ascending or descending order according to a specified sort key. You can also configure the Sorter transformation for case-sensitive sorting, and specify whether the output rows should be distinct. The Sorter transformation is an active transformation.
It must be connected to the data flow.

Q32) What is a UNION Transformation?
A32) The Union transformation is a multiple input group transformation that you can use to merge data from multiple pipelines or pipeline branches into one pipeline branch. It merges data from multiple sources similar to the UNION ALL SQL statement to combine the results from two or more SQL statements. Similar to the UNION ALL statement, the Union transformation does not remove duplicate rows.
You can connect heterogeneous sources to a Union transformation. The Union
transformation merges sources with matching ports and outputs the data from one output group with the same ports as the input groups.

Q33) What is Update Strategy?
A33) Update strategy is used to decide on how you will handle updates in your project. When you design your data warehouse, you need to decide what type of information to store in targets. As part of your target table design, you need to determine whether to maintain all the historic data or just the most recent changes.
For example, you might have a target table, T_CUSTOMERS that contains customer data. When customers address changes, you may want to save the original address in the table instead of updating that portion of the customer row. In this case, you would create a new row containing the updated address, and preserve the original row with the old customer address.
This illustrates how you might store historical information in a target table. However, if you want the T_CUSTOMERS table to be a snapshot of current customer data, you would update the existing customer row and lose the original address.

The model you choose determines how you handle changes to existing rows.

In Power Center, you set your update strategy at two different levels:
1) Within a session. When you configure a session, you can instruct the Power Center Server to either treat all rows in the same way (for example, treat all rows as inserts), or use instructions coded into the session mapping to flag rows for different database operations.
2) Within a mapping. Within a mapping, you use the Update Strategy transformation to flag rows for insert, delete, update, or reject.

Note: You can also use the Custom transformation to flag rows for insert, delete, update, or reject.

Q34) Joiner transformation?
A34) A Joiner transformation joins two related heterogeneous sources residing in different location. The combination of
sources can be varied like
- two relational tables existing in separate database.
- two flat files in potentially different file systems.
- two different ODBC sources.
- two instances of the same XML sources.
- a relational table and a flat file source.
- a relational table and a XML source.

Q35) How many types of Joins can you use in a Joiner?
A35) There can be 4 types of joins

a) Normal Join (Equi Join)
b) Master Outer Join - In master outer join you get all rows from Detail table
c) Detail Outer Join - In Detail Outer Join you get all rows from Master table
d) FULL Outer Join

Q36) What are Mapping Parameter & variables ?
A36) We Use Mapping parameter and variables to make mappings more flexible.

Value of a parameter does not change during session, whereas the value stored in a variable can change.

Q37) TELL ME ABOUT PERFORMANCE TUNING IN INFORMATICA?
A37) Basically Performance Tuning is an Iterative process, we can do lot of tuning at database level and if database queries are faster than Informatica workflows will be automatically faster.

For Performance tuning, first we try to identify the source / target bottlenecks. Meaning that first we see what can be doing so that Source data is being retrieved as fast possible.

We try to filter as much data in SOURCE QUALIFIER as possible. If we have to use a filter then filtering records should be done as early in the mapping as possible.

If we are using an aggregator transformation then we can pass the sorted input to aggregator. We need to ideally sort the ports on which the GROUP BY is being done.

Depending on data an unconnected Lookup can be faster than a connected Lookup.

Also there should be as less transformations as possible. Also in Source Qualifier, we should bring only the ports which are being used.

For optimizing the TARGET, we can disable the constraints in PRE-SESSION SQL and use BULK LOADING.

IF the TARGET Table has any indexes like primary key or any other indexes / constraints then BULK Loading
will fail. So in order to utilize the BULK Loading, we need to disable the indexes.

In case of Aggregator transformation, we can use incremental loading depending on requirements.

Q38) What kind of workflows or tasks have you used?
A38) I have used session, email task, command task, event wait tasks.

Q39) Explain the process that happens when a WORKFLOW Starts?
A39) when a workflow starts, the Informatica server retrieves mappings, workflows & session metadata from the repository to extract the data from the source, transform it & load it into Target.

- it also runs the task in the workflow.
- The informatica server uses load manager & Data Transformation manager (DTM) process to run the workflow.
- The informatica server can combine data from different platforms & source types. For ex. joins data from flat file & an oracle source. It can also load data to different platforms & target types. For ex. can load, transform data to both a FF target & a MS SQL server db in same session.

Q40) What all tasks can we perform in a Repository Manager?
A40) The Repository Manager allows you to navigate through multiple folders & repositories & perform basic repository tasks.

Some examples of these tasks are:
- Add or remove a repository
- work with repository connections: can connect to one repository or multiple repositories.
- View object dependencies: b4 you remove or change an object can view dependencies to see the impact on other objects.
- terminate user connections: can use the repo manager to view & terminate residual user connections
- Exchange metadata with other BI tools: can export & import metadata from other BI tools like cognos, BO..

IN REPOSITORY MANAGER NAVIGATOR WINDOW, WE FIND OBJECTS LIKE:
_ Repositories: can be standalone, local or global.
- Deployment groups: contain collections of objects for deployment to another repository in the domain.
- Folders: can be non-shared.
- Nodes: can include sessions, sources, targets, transformation, mapplets, workflows, tasks, worklets & mappings.
- Repository objects: same as nodes along with workflow logs & sessions logs.

Q41) Did you work on ETL strategy?
A41) Yes, my Data modeler & ETL lead along with developers analyzed & worked on dependencies between tasks (workflows).
well there are Push & Pull strategies which are used to determine how the data comes from source systems to ETL server.
Push strategy: with this strategy, the source system pushes data (or send the data) to the ETL server.
Pull strategy: with this strategy, the ETL server pulls the data (or gets the data) from the source system.

Q42) How did you migrate from Dev environment to UAT / PROD Environment?
A42) We can do a folder copy or export the mapping in XML Format and then Import it another Repository or folder.
In my last project we used Deployment groups.

Q43) External Scheduler?
A43) with external schedulers, we used to run informatica jobs like workflows using pmcmd command in parallel with some oracle jobs like stored procedures. There were various kinds of external schedulers available in market like AUtosys, Maestro, and Control M. So we can use for mix & match for informatica & oracle jobs using external schedulers.

Q44) What is a Slowly Changing Dimension?
A44) In a Data Warehouse, usually the updates in Dimension tables don't happen frequently.

So if we want to capture changes to a dimension, we usually resolve it with Type 2 or
Type 3 SCD. So basically we keep historical data with SCD.

Q11) Explain SLOWLY CHANGING DIMENSION (SCD) Type, which one did you use?
A11) There are 3 ways to resolve SCD. First one is Type 1, in which we overwrite the
changes, so we loose history.

Type 1

OLD RECORD
==========

Surr Dim Cust_Id Cust Name
Key (Natural Key)
======== =============== =========================
1 C01 ABC Roofing

NEW RECORD
==========

Surr Dim Cust_Id Cust Name
Key (Natural Key)
======== =============== =========================
1 C01 XYZ Roofing

I mainly used Type 2 SCD.

In Type 2 SCD, we keep effective date and expiration date.

For older record, we update the exp date as the The current Date - 1, if the changes happened today.
In the current Record, we keep Current Date as
Surr Dim Cust_Id Cust Name Eff Date Exp Date
Key (Natural Key)
======== =============== ========================= ========== =========
1 C01 ABC Roofing 1/1/0001 12/31/9999

Suppose on 1st Oct, 2007 a small business name changes from ABC Roofing to XYZ Roofing, so if we want to store the old name, we will store data as below:

Surr Dim Cust_Id Cust Name Eff Date Exp Date
Key (Natural Key)
======== =============== ========================= ========== =========
1 C01 ABC Roofing 1/1/0001 09/30/2007
101 C01 XYZ Roofing 10/1/2007 12/31/9999

We can implment TYPE 2 as a CURRENT RECORD FLAG Also
In the current Record, we keep Current Date as

Surr Dim Cust_Id Cust Name Current_Record
Key (Natural Key) Flag
======== =============== ========================= ==============
1 C01 ABC Roofing Y

Suppose on 1st Oct, 2007 a small business name changes from ABC Roofing to XYZ Roofing, so if we want to store the old name, we will store data as below:

Surr Dim Cust_Id Cust Name Current_Record
Key (Natural Key) Flag
======== =============== ========================= ==============
1 C01 ABC Roofing N
101 C01 XYZ Roofing Y

Q45) What is a Mapplets? Can you use an active transformation in a Mapplet?
A45) A mapplet has one input and output transformation and in between we
can have various mappings.
A mapplet is a reusable object that you create in the Mapplet Designer. It contains a set of transformations and allows you to reuse that transformation logic in multiple mappings.
Yes we can use active transformation in a Mapplet.

1) A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an
extraction, transportation, transformation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, client analysis tools, and other applications
that manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Subject Oriented
Integrated
Nonvolatile
Time Variant

2) Surrogate Key
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
There are actually two cases where the need for a "dummy" dimension key arises:
1) the fact row has no relationship to the dimension (as in your example), and
2) the dimension key cannot be derived from the source system data.
3) Facts & Dimensions form the heart of a data warehouse. Facts are the metrics that business users would use for making business decisions. Generally, facts are mere numbers. The facts cannot be used without their dimensions. Dimensions are those attributes that qualify facts. They give structure to the facts. Dimensions give different views of the facts. In our example of employee expenses, the employee expense forms a fact. The Dimensions like department, employee, and location qualify it. This was mentioned so as to give an idea of what facts are.
Facts are like skeletons of a body.
Skin forms the dimensions. The dimensions give structure to the facts.
The fact tables are normalized to the maximum extent.
Whereas the Dimension tables are de-normalized since their growth would be very less.

4) Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

SCD Type 2
Slowly changing dimension Type 2 is a model where the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear.
The fields 'effective date' and 'current indicator' are very often used in that dimension and the fact table usually stores dimension key and version number.

4) CRC Key
Cyclic redundancy check, or CRC, is a data encoding method (noncryptographic) originally developed for detecting errors or corruption in data that has been transmitted over a data communications line.
During ETL processing for the dimension table, all relevant columns needed to determine change of content from the source system (s) are combined and encoded through use of a CRC algorithm. The encoded CRC value is stored in a column on the dimension table as operational meta data. During subsequent ETL processing cycles, new source system(s) records have their relevant data content values combined and encoded into CRC values during ETL processing. The source system CRC values are compared against CRC values already computed for the same production/natural key on the dimension table. If the production/natural key of an incoming source record are the same but the CRC values are different, the record is processed as a new SCD record on the dimension table. The advantage here is that CRCs are small, usually 16 or 32 bytes in length, and easier to compare during ETL processing versus the contents of numerous data columns or large variable length columns.

5) Data partitioning, a new feature added to SQL Server 2005, provides a way to divide large tables and indexes into smaller parts. By doing so, it makes the life of a database administrator easier when doing backups, loading data, recovery and query processing.
Data partitioning improves the performance, reduces contention and increases availability of data.

Objects that may be partitioned are:

• Base tables
• Indexes (clustered and nonclustered)
• Indexed views

Q46) Why we use stored procedure transformation?
A46) Stored Procedure transformation is an important tool for populating and maintaining databases.
Database administrators create stored procedures to automate time-consuming tasks that are too complicated for standard SQL statements.

You might use stored procedures to do the following tasks:
Check the status of a target database before loading data into it.
Determine if enough space exists in a database.
Perform a specialized calculation.
Drop and recreate indexes.

Q47) What is source qualifier transformation?
A47) When you add a relational or a flat file source definition to a mapping, you need to connect
it to a Source Qualifier transformation. The Source Qualifier represents the rows that the
Informatica Server reads when it executes a session.
The Transformation which Converts the source (relational or flat) data type to
Informatica datatype.So it works as an intemediator between and source and informatica server.

Tasks performed by qualifier transformation: -
1. Join data originating from the same source database.
2. Filter records when the Informatica Server reads source data.
3. Specify an outer join rather than the default inner join.
4. Specify sorted ports.
5. Select only distinct values from the source.
6. Create a custom query to issue a special SELECT statement for the Informatica Server to read source data.

Q48) What is CDC, changed data capture?

A48) Whenever any source data is changed we need to capture it in the target system also this can be basically in 3 ways
Target record is completely replaced with new record.
Complete changes can be captured as different records & stored in the target table.
Only last change & present data can be captured.
CDC can be done generally by using a timestamp or version key

Q49) What is Load Manager and DTM (Data Transformation Manager)?

A49) Load manager and DTM are the components of Informatica server. Load manager manages the load on the server by maintaining a queue of sessions and release the session based on first come and first serve. When the session is released from the load manager it initializes the master process called DTM. DTM modifies the data according to the instructions coded in the session mapping.

The Load Manager creates one DTM process for each session in the workflow. It performs the following tasks:
Reads session information from the repository.
Expands the server session and mapping variables and parameters.
Creates the session log file.
Validates source and target code pages.
Verifies connection object permissions.
Runs pre-session shell commands stored procedures and SQL.
Creates and run mapping reader writer and transformation threads to extract transform and load data.
Runs post-session stored procedures SQL and shell commands.
Sends post-session email.

Netezza Job Interview Questions

Explain FPGA and how is it useful for query performance.

FPGA: Field Programmable Gate Array (FPGA) is located on each SPU. Netezza is different from other architectures. Netezza can do a “hardware upgrade” through software by using FPGA. Hardware is reconfigured during install.

While reading data from disk, FPGA on each SPU also helps in ‘filtering’ unnecessary data before getting loaded into memory on each SPU. This way, FPGA does not overwhelm with all the data from disk.

What is a zone map.

Zone map in Netezza is similar (concept wise) to partitions in Oracle. Netezza maintains map for data so that it does relies on zone map to pull only the range it is interested in. For example, if we need to pull out data from Jan 2009 till June 2009 from a table that is distributed on date column, zone map helps us to achieve this. Zone map is maintained by Netezza automagically, no user intervention needed. Zone mapping is done at a block (extent) level. Netezza has zone maps for all columns (not just distributed column) and includes information such as minimum, maximum, total number of records.

How do you deal with historical data, with respect to zone maps.

Sort data first, based on historical data (for example, date) and load this in using nzload.

What are different ways to load

nzload
External tables
Create table AS (aka, CTAS).
Inserts (Eeeewee!!)

Does everything gets cached in Netezza (or any other data appliance).

Typically only schema and other database objects are cached in appliances. Data is not cached, in general. In most cases, data is not saved any where (in any cache or on host computer) and is streamed directly from SPU to client software.

What is the best data appliance.

Obviously, it all depends. This is my (limited) view:

From features respect, Green Plum.
Popularity with a bit of hype, Netezza.
Matured and well respected, Teradata.
With existing database integration, Dataupia.

Largest implementations:

* Teradata: 72 nodes (two quad-core CPUs, 32GB RAM,104 / 300GB disks per node) and manages 2.4PB.
* Greenplum: Fox Interactive Media using a 40-node, Sun X4500 with two dual-core CPUs, 48 / 500GB disks, and 16 GB RAM (1PB total disk space)
Source: Vertica’s Michael Stonebraker!

How is load achieved in Netezza and why is that quick / fast.

Loads by pass a few steps that typically a query would go through (a query goes through plan generation, optimization and transaction management). Loads are done in terms of “sets” and this set is based on underlying table structure (thus loads for two different tables are different as their sets are based on table structures). Data is processed to check format and distribution of records calculated very quickly (in one step), fills into ‘set’ structure and writes to storage structure. Storage also performs space availability and other admin tasks, all these operations go pretty quick (think of them as UNIX named pipes that streams data and SPU stores these records).

When are we likely to receive incorrect (aggregate) results.

Very rarely a driver may return aggregated results that are still getting processed back to client. In this case, client may assume that calculation is complete, instead of updating with latest or final results. Obviously, driver has to wait for Netezza to complete operation on host computer, before delivering results.

Explain how data gets stored in Netezza and how does SPU failover take place.

Data is stored based on a selected field(s) which are used for distibution.

==Data (A)==> Hash Function (B) ==> Logical SPU identifier list (C) ==> Physical SPU list (D) ==> Storage (E)

When data arrives, it is hased based on field(s) and a hash function (B) is used for this purpose. For example, for a hypothetical 32 node system system, logical SPU identifier list has 32 unique entries. If there are 1000 hashed data items from (B), there are 1000 entries in (C), all having only 32 SPU entries (a number of data items go to the same SPU, thus multiple (B) entries map to same (C)). For instance, (C) has values [3,19,30,7,20,25,11,3,22,19....]. This way, 1000 data entries are mapped. (D) has physical IP address of both primary and failover SPU. If there is a failover, this is the only place where Netezza need to update its entries. Same goes for a system that has a new SPU added. It is a little complicated, in principle, this is the concept.

what are 4 environment variables that are required. What are different states on Netezza.
Environment variables: NZ_HOST, NZ_DATABASE, NZ_USER and NZ_PASSWORD
* Online: Normal or usual state.
* Stopped: Netezza will shutdown after completing current queries, no new queries allowed.
* Offline: Waits for completion of current queries, new or queries in queue receive error.
* Paused: Same as above, but no error displayed. Typically caused during Netezza bootup or startup.
* Down: Just plain down, could be due to Netezza server problem or user initiated.

Does Netezza support concurrent update of the same recordIn case of conflict in which the same record is set for modification, Netezza rolls back recent transaction that is attempted on the same record, in fact same table. This is generally acceptable in DW environments. Netezza does support serialization transactions and does not permit dirty reads.

How Netezza updates records. Give an idea of how transactions are maintained and how read consistency is maintaned.
Netezza does not update records in place, it marks records with delete flag. In fact, each record contains two slots, one for create xid another for delete xid. Delete xid allows us to mark a record with current transaction for deletion, up to 31 transactions are allowed in Netezza for all tables. As noted earlier, only one update at a time allowed on the same table though. Here update refers to transactions that are not committed yet. Coming back to delete xid, this is how Netezza maintains transaction roll back and recovery. Once a record is modified, it’s delete xid is given transaction id; this is changed from previous value of 0, all records when loaded will contain 0 for delete xid. Note that FPGA uses its intelligence to scan data before delivering them to host or applications.
Sample data:
[ROW id][Create xid][Delete xid]
[R1][T1][0]                           // First time a record is loaded, record R1
                                                  // After some time, updating the same record
[R1][T1][T33]                      // Record R1 is updated; note T33
[R33][T33][0]                     // New update record R33; similar to a new record this has zero for Delete Xid

If the record is deleted, simply deletion xid will contain that transaction id.
* Based on the above, how do you know a record is the latest. It has zero in delete xid flag.
* Extending same logic, how do we know a record is deleted. It has non zero value in delete xid flag.
* How do you roll back to transaction. Follow similar to above listing, we can roll back a transaction of our interest.
Note that transaction id is located in create xid flag and that is our point of interest in this case. From what I know, row id and create id is never modified by Netezza.

What happens to records that are loaded during nzload process, but were not committed.They are logically deleted and administrator can run nzreclaim, we may also truncate table.

Can a group become a member of another group in Netezza user administration. Can we use same group name for databases.
In Netezza, public group is created automatically and every one is a memeber of this group by default. We can create as many groups and any user can be member of any group(s). Group can not be a member of another group. Group names, user names and database names are unique. That is, we can not have a database called sales and a group also called sales.

How can we give a global permission to user joe so that he can create table in any database.
Login into system database and give that permission to user by saying “grant create table to joe;”

What permission will you give to connect to a database.List. Grant list, select on table to public (if logged into sales database, this allows all users to query tables in sales database).

Do we need to drop all tables and objects in that database, before dropping a database.
No, drop database will take care of it.

What constraints on a table are enforced.Not null and default. Netezza does not apply PK and FK.

Why NOT NULL specification is better in Netezza.
Specifying not null results in better performance as NULL values are tracked at rowheader level. Having NULL values results in storing references to NULL values in header. If all columns are NOT NULL, then there is no record header.

Create Table AS (CTAS), does it distribute data randomly or based on table on which it received data.Response: Newly created table from CTAS gets distribution from the original table.

Why do you prefer truncate instead of drop table command.Just empties data from table, keeping table structure and permission intact.

When no distribution clause is used while creating a table, what distribution is used by Netezza.First column (same as in Teradata).

Can we update all columns in a Netezza table.No, the column that is used in distribution clause cannot be used for updates. Remember, up to four columns can be used for distribution of data on SPU. From practical sense, updating distribution columns result in redistribution of data; the single most performance hit when large table is involved. This restriction makes sense.

What is dataslice and SPU.For me, they are the same! Of course, this answer is not accurate reply in your interview(s).

What data type works best for zone maps.
Zone maps work best for integer data types.

What feature in Netezza you do not like.

Of course, a large list, especially when compared to Oracle. PK and FK enforcement is a big drawback though this is typically enforced at ETL or ELT process [ELT: Extract, Transform and Load. Note that 'Transform' and 'Load' can happen within Netezza].

Netezza Date Functions Use

SELECT date(current_Date) from dual;
–10/28/10 (if current date is Oct 28, 2010)
SELECT (current_timestamp) from dual;
/*
SELECT DATE_PART(‘MONTH’,current_date)from dual;
SELECT current_time, DATE_PART(‘SECOND’,current_time)from dual;
SELECT current_timestamp, DATE_PART(‘SECOND’,current_timestamp) from dual;*/
select add_months(current_date,-1) from dual;
— DATE_TRUNC FUNCTION:
SELECT add_months (date_trunc(‘month’, current_timestamp),-1) from dual;
select add_months (date_trunc(‘month’, current_date),-1) from dual;
–ADD_MONTHS FUNCTION:
select add_months(current_date,-1) – date_part(‘day’, current_date) from dual;
–8/31/10
select add_months(current_date,1) – date_part(‘day’, current_date) from dual;
–10/31/10
–DATE_PART FUNCTION:
select date_part(‘day’, current_date) from dual;
–28
select date_part(‘month’, current_date) from dual;
–10
— date_part will NOT work here:
—SELECT add_months (date_part(‘month’, current_date),-1) from dual;
select add_months(current_date,0) from dual;
–10/28/10
select add_months(current_date,-1) from dual;
select add_months(current_date,-28) from dual;
select add_months(current_date,-1) – date_part(‘month’, current_date) from dual;
–9/18/10
select add_months(current_date,0) – date_part(‘day’, current_date) from dual;
–9/30/10
SELECT date(current_Date) from dual;
–DOESN’T WORK
SELECT date(current_Date,-1) from dual;
—
SELECT (current_Date)-1 from dual;
select add_months(current_date,-1) – date_part(‘day’, current_date) from dual;
select add_months(current_date,1) – date_part(‘day’, current_date)
from dual;
select date_part(‘day’, current_date) from dual;
select date_part(‘month’, current_date) from dual;
select add_months(current_date,0) from dual;
select add_months(current_date,-1) from dual;
select add_months(current_date,-1) – date_part(‘month’, current_date) from dual;
select add_months(current_date,0) – date_part(‘day’, current_date) from dual;
select extract (day from current_date)-7;

Wednesday, December 4, 2013

Steps to Create your a simple SSIS Package

I am sharing an simple a post found in the codeproject

Introduction

The purpose of this tutorial is to provide detailed steps on how to create a SSIS Package. The package given below reads the data from a source tab delimited text file and based on certain conditions, saves the output in two separate Excel Output Files. The output Excel files also have the same header as the input text file. The structure and data for the source input file is given below:

EmpID	EmpName	DOB	DOJ	Salary
1	User1	1/1/1976	1/4/2000	20000
2	User2	1/2/1976	1/5/2000	20000
3	User3	1/3/1976	1/6/2000	20000
4	User4	1/4/1976	1/7/2000	30000
5	User5	1/5/1976	1/8/2000	20000
6	User6	1/6/1976	1/9/2000	40000
7	User7	1/7/1976	1/10/2000	20000
8	User8	1/8/1976	1/11/2000	35000
9	User9	1/9/1976	1/12/2000	20000
10	User10	1/10/1976	1/6/2000	20000

Steps to Create Your First SSIS Package

1. Open business intelligence development studio.

2. Click on File-> New -> Project.

3. Select Integration service project in new project window and give the appropriate name and location for project. And click ok.

4. The new project screen contains the following:

Tool Box on left side bar
Solution Explorer on upper right bar
Property Window on lower right bar
Control flow, data flow, event Handlers, Package Explorer in tab windows
Connection Manager Window in the bottom

5. Right click on the Connection Manager Tab, click on new FLAT File Connection Menu Item.

6. Connection manager editor opens up which contains 4 tabs, General, Columns, Advanced and Preview.

In General Tab, enter connection manager name and description (optional). Select source file, file format and delimiter. If first row of source file contains headers, then select the checkbox “Column names in the first data row".
Select Column tab and check whether all columns are properly mapped or not.
Select advance tab. Here you can add, remove or modify columns as per output stream requirement.
Select preview tab to check how your output will look like:

7. Click on OK. It will create a flat file connection manager for your source file.

8. Now Drag Data Flow Task from the Toolbox into the Control Flow Container.

9. Double Click on the Data Flow Task. It will show Data flow Container tab for selected Data Flow Task. You can see three item categories in Toolbox.

Data flow sources - Source makes data from different external data sources available to the other components in the data flow.
Data flow transformations - Transformations can perform tasks such as updating, summarizing, cleaning, merging, and distributing data.
Data flow destinations - Destination writes the data from a data flow to a specific data store, or creates an in-memory dataset.

10. Drag a Flat file Source Component from Data Flow Sources into Data Flow Container window.

11. Double Click on the Flat File Source Component, it will display flat File source Editor. The window contain three tabs:

Connection Manager - Here we will specify source connection manager which we created for source file. If source file contains null values, select “Retain null values from Source as null values in the data flow” checkbox.
Columns -This tab allows the user to select required output columns and user can also change the output column names.
Error Output - Using this tab, the user can decide the behavior of the component in case of failure. There are three options:
1. Ignore Failure: Selecting this will ignore any failure while reading rows from source and the package will continue executing even any error occurred.
2. Redirect Row: Selecting this will redirect the failed rows to other component which is connected with the error precedence constraints.
3. Fail component: Selecting this will stop the execution of package in case of failure.

12. Drag and drop a Conditional Split Component from Data Flow Transformations Tab into Data flow Task Container window. Drag and Connect the success output (which is shown by Green arrow) of Flat File Source Component to Conditional Split Component.

13. Double click on the Conditional Split component, it will open Conditional Spilt Component editor window. Here user can specify the condition(s) as per the requirement and click ok. For example:

HigherSalary: [Salary] > 20000 (Redirect records if salary is greater than 20000)
LowerSalary: For rest of the records

14. Drag and drop Excel Destination component from Data Destinations tab into Data Flow Task Container. Connect the success arrow of the Conditional Split Component to Excel Destination, Input Output selection window will be popped up.

15. In the Input Output Selection popup window, select appropriate conditional output for example “HigherSalary” conditional output and click ok.

16. Double Click on the Excel Destination Component which will open Excel Destination Editor Window. Click on the new Ole DB Connection Manager Button.

17. Select Destination File location and the appropriate Excel Sheet Name where you want to insert the success output data with salary values higher than 20000.

18. Click on the mapping tab and map the appropriate input columns with output columns.

19. On click of OK, error icon is shown in the destination excel file component and it displays the message “Cannot Convert between Unicode and Non-Unicode string data types”.

20. To resolve this issue, we need to insert a Data Conversion Transformation Component between Conditional Split and Excel Destination Component.

21. Double Click on the Data Conversion Component, it will open Data Conversion Transformation editor. Using this component, convert input data types to required output data types.

22. Click OK and connect the success arrow of Data Conversion component into Excel Destination Component. Double click on Excel Data Conversion Component and click on the mapping tab and map the output of Data Conversion Component to input of Excel Destination Component and click ok.

23. Rename the Excel Destination Component as “Records with Salary > 20000”

24. Now add one more Data Conversion Transformation Component and connect the second success output of Conditional Split to it. Do the necessary data type conversions. Add one more Excel Destination Component and rename it as “Remaining Records”. Create a new connection manager and configure it to point to the second Output File. Connect the input of the newly added Data Conversion Component to it and do the mapping as required.

25. Now the package is ready to be executed. Go to the Solution Explorer and right click on the package and select “Execute Package". If all components turn “GREEN”, it means package has run successfully, if there is any error, the component which has failed to execute will be shown in “RED” Color. We can see the package execution steps in the “Progress” tab.

26. Once you run the package, data will be saved in the destination output files as per the condition specified in the Conditional Split Component.