Thursday, September 30, 2010
Wednesday, September 29, 2010
DataStage Performance Tuning
Performance Tuning - BasicsBasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally , if the amount of data to be processed is small , configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.
Partioning :
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
Memory :
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
Performance Analysis of Various stages in DataStagSequential File Stage -
The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link .It can also have one reject link. While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
Data Set Stage :
The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link. It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix .dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.
Lookup Stage –
A Look up Stage is an Active Stage. It is used to perform a lookup on any parallel job Stage that can output data. The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory
Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.There can be 3 types of join operations Inner Join, Left/Right outer Join, Full outer join. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.Join uses the paging method for the data matching.
Merge Stage :
The Merge Stage is an active Stage. It can have multiple input links, a single output link, and it supports as many reject links as input links. The Merge Stage takes sorted input. It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.
Sort Stage :
The Sort Stage is an active Stage. The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc. Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.
Transformer Stage :
The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality. Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria. Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers .
Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.
Overall Job Design :While designing DataStage Jobs care should be taken that a single job is not overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together (more resources for the transformer Stage!!!). Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs. Sorting in a job should be taken care try to minimise number sorts in a job. Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys. Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal. In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job.In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point. In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value. We need not use StringToDate function, which is used to convert a String to Date type.Implicit conversion of data types.
Reduce the number of Stage variables used. It was observed in our previous project by removing 5 Stage variables and 6 function calls, runtime for the job was reduced from 2 hours to 1 hour 10 min (approximately) with 100 million records input.Try to balance load on transformers by sharing the transformations across existing transformers. This would ensure smooth flow of data.If you require type casting, renaming of columns or addition of new columns, use Copy or Modify Stages to achieve thisWhenever you have to use Lookups on large tables, look at the options such as unloading the lookup tables to datasets and using, user defined join SQL to reduce the look up volume with the help of temp tables, etc.The Copy stage should be used instead of a Transformer for simple operations including:o Job Design placeholder between stages o Renaming Columnso Dropping Columnso Implicit (default) Type Conversions The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded. Or Determine , if the record already exists or not to have “Insert” and “Update” separately.It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. Don’t read from a Sequential File using SAME partitioning. Unless more than one source file is specified, this scenario will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is repartitioned)
DataStage Performance Tuning
Performance Tuning - BasicsBasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally , if the amount of data to be processed is small , configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.
Partioning :
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
Memory :
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
Performance Analysis of Various stages in DataStag
Sequential File Stage -
The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link .It can also have one reject link. While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).
Data Set Stage :
The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link. It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix .dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.
Lookup Stage –
A Look up Stage is an Active Stage. It is used to perform a lookup on any parallel job Stage that can output data. The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory
Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.There can be 3 types of join operations Inner Join, Left/Right outer Join, Full outer join. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.Join uses the paging method for the data matching.
Merge Stage :
The Merge Stage is an active Stage. It can have multiple input links, a single output link, and it supports as many reject links as input links. The Merge Stage takes sorted input. It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.
Sort Stage :
The Sort Stage is an active Stage. The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc. Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.
Transformer Stage :
The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality. Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria. Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers .
Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.
Overall Job Design :While designing DataStage Jobs care should be taken that a single job is not overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together (more resources for the transformer Stage!!!). Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs. Sorting in a job should be taken care try to minimise number sorts in a job. Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys. Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal. In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job.In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point. In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value. We need not use StringToDate function, which is used to convert a String to Date type.Implicit conversion of data types.
Reduce the number of Stage variables used. It was observed in our previous project by removing 5 Stage variables and 6 function calls, runtime for the job was reduced from 2 hours to 1 hour 10 min (approximately) with 100 million records input.Try to balance load on transformers by sharing the transformations across existing transformers. This would ensure smooth flow of data.If you require type casting, renaming of columns or addition of new columns, use Copy or Modify Stages to achieve thisWhenever you have to use Lookups on large tables, look at the options such as unloading the lookup tables to datasets and using, user defined join SQL to reduce the look up volume with the help of temp tables, etc.The Copy stage should be used instead of a Transformer for simple operations including:o Job Design placeholder between stages o Renaming Columnso Dropping Columnso Implicit (default) Type Conversions The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded. Or Determine , if the record already exists or not to have “Insert” and “Update” separately.It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. Don’t read from a Sequential File using SAME partitioning. Unless more than one source file is specified, this scenario will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is repartitioned)
Tuesday, September 28, 2010
BusinessObjects Enterprise XI session timeout
Article refers to:BusinessObjects Enterprise XI
How can you change the default session timeout value for licenses?
CMC
------
To change the default session timeout for the CMC:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the webcompadapter.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the webcompadapter.war is deployed and navigate to the location of the webcompadapter.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the webcompadapter.war file is deployed, use the following command to extract the WEB-INF/web.xml file:
jar -xvf webcompadapter.war WEB-INF/web.xml
==============
NOTES:
• You need JDK 1.4 version or higher to use –u switch. Type "java –version" at the command prompt to confirm your JDK version.
• You can also use Winzip to zip the file at the command prompt by typing:
zip -u webcompadapter.war WEB-INF/weblogic.xml
==============
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the webcompadapter.war directory and update the webcompadapter.war file with the modified web.xml file. Use the following command:
jar -uvf webcompadapter.war WEB-INF/web.xml
7. Restart the JAS and redeploy webcompadapter.war for the update to be effective.
Upon completing these steps, the default session timeout value in the CMC is updated.
InfoView
-----------
To change the default session timeout for InfoView:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the desktop.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the desktop.war is deployed and navigate to the location of the desktop.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the desktop.war file is deployed, use the following command to
extract the web.xml file:
jar -xvf desktop.war WEB-INF/web.xml
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the desktop.war directory and update the desktop.war file with the modified web.xml file. Use the following command:
jar -uvf desktop.war WEB-INF/web.xml
7. Restart the JAS and redeploy desktop.war for the update to be effective.
Upon completing these steps, the default session timeout value in InfoView is updated.
Symptom
In BusinessObjects Enterprise XI, the default session timeout for licenses is set to 20 minutes in the Java Central Management Console (CMC) and InfoView.How can you change the default session timeout value for licenses?
Resolution
To change the default session timeout value for licenses for the CMC and InfoView, use the following steps:CMC
------
To change the default session timeout for the CMC:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the webcompadapter.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the webcompadapter.war is deployed and navigate to the location of the webcompadapter.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the webcompadapter.war file is deployed, use the following command to extract the WEB-INF/web.xml file:
jar -xvf webcompadapter.war WEB-INF/web.xml
==============
NOTES:
• You need JDK 1.4 version or higher to use –u switch. Type "java –version" at the command prompt to confirm your JDK version.
• You can also use Winzip to zip the file at the command prompt by typing:
zip -u webcompadapter.war WEB-INF/weblogic.xml
==============
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the webcompadapter.war directory and update the webcompadapter.war file with the modified web.xml file. Use the following command:
jar -uvf webcompadapter.war WEB-INF/web.xml
7. Restart the JAS and redeploy webcompadapter.war for the update to be effective.
Upon completing these steps, the default session timeout value in the CMC is updated.
InfoView
-----------
To change the default session timeout for InfoView:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the desktop.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the desktop.war is deployed and navigate to the location of the desktop.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the desktop.war file is deployed, use the following command to
extract the web.xml file:
jar -xvf desktop.war WEB-INF/web.xml
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the desktop.war directory and update the desktop.war file with the modified web.xml file. Use the following command:
jar -uvf desktop.war WEB-INF/web.xml
7. Restart the JAS and redeploy desktop.war for the update to be effective.
Upon completing these steps, the default session timeout value in InfoView is updated.
BusinessObjects Enterprise XI session timeout
Article refers to:BusinessObjects Enterprise XI
How can you change the default session timeout value for licenses?
CMC
------
To change the default session timeout for the CMC:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the webcompadapter.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the webcompadapter.war is deployed and navigate to the location of the webcompadapter.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the webcompadapter.war file is deployed, use the following command to extract the WEB-INF/web.xml file:
jar -xvf webcompadapter.war WEB-INF/web.xml
==============
NOTES:
• You need JDK 1.4 version or higher to use –u switch. Type "java –version" at the command prompt to confirm your JDK version.
• You can also use Winzip to zip the file at the command prompt by typing:
zip -u webcompadapter.war WEB-INF/weblogic.xml
==============
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the webcompadapter.war directory and update the webcompadapter.war file with the modified web.xml file. Use the following command:
jar -uvf webcompadapter.war WEB-INF/web.xml
7. Restart the JAS and redeploy webcompadapter.war for the update to be effective.
Upon completing these steps, the default session timeout value in the CMC is updated.
InfoView
-----------
To change the default session timeout for InfoView:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the desktop.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the desktop.war is deployed and navigate to the location of the desktop.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the desktop.war file is deployed, use the following command to
extract the web.xml file:
jar -xvf desktop.war WEB-INF/web.xml
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the desktop.war directory and update the desktop.war file with the modified web.xml file. Use the following command:
jar -uvf desktop.war WEB-INF/web.xml
7. Restart the JAS and redeploy desktop.war for the update to be effective.
Upon completing these steps, the default session timeout value in InfoView is updated.
Symptom
In BusinessObjects Enterprise XI, the default session timeout for licenses is set to 20 minutes in the Java Central Management Console (CMC) and InfoView.How can you change the default session timeout value for licenses?
Resolution
To change the default session timeout value for licenses for the CMC and InfoView, use the following steps:CMC
------
To change the default session timeout for the CMC:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the webcompadapter.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the webcompadapter.war is deployed and navigate to the location of the webcompadapter.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the webcompadapter.war file is deployed, use the following command to extract the WEB-INF/web.xml file:
jar -xvf webcompadapter.war WEB-INF/web.xml
==============
NOTES:
• You need JDK 1.4 version or higher to use –u switch. Type "java –version" at the command prompt to confirm your JDK version.
• You can also use Winzip to zip the file at the command prompt by typing:
zip -u webcompadapter.war WEB-INF/weblogic.xml
==============
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the webcompadapter.war directory and update the webcompadapter.war file with the modified web.xml file. Use the following command:
jar -uvf webcompadapter.war WEB-INF/web.xml
7. Restart the JAS and redeploy webcompadapter.war for the update to be effective.
Upon completing these steps, the default session timeout value in the CMC is updated.
InfoView
-----------
To change the default session timeout for InfoView:
1. Verify the following:
• The Java SDK is installed and is in the path. You need to be able to execute the 'jar' command
from the command prompt.
• You have access to the desktop.war file from the Enterprise XI installation.
2. Stop the Java Application Server (JAS) on which the desktop.war is deployed and navigate to the location of the desktop.war file. The default location for this file is in the Enterprise XI installation directory:
Program Files\Business Objects\BusinessObjects Enterprise11\java\applications
3. From the directory from where the desktop.war file is deployed, use the following command to
extract the web.xml file:
jar -xvf desktop.war WEB-INF/web.xml
4. Open web.xml in a text editor like Notepad and search for the following section:
<session-config>
<session-timeout>20</session-timeout>
</session-config>
5. Change the value between <session-timeout> to the number of minutes for the session to timeout. Save web.xml.
6. Navigate back to the desktop.war directory and update the desktop.war file with the modified web.xml file. Use the following command:
jar -uvf desktop.war WEB-INF/web.xml
7. Restart the JAS and redeploy desktop.war for the update to be effective.
Upon completing these steps, the default session timeout value in InfoView is updated.
10 Don't Do on DataStage
1. Use the same configuration file for all your jobs.
You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!
2. Use a sparse database lookup on high volumes.
This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!
3. Keep resorting your data.
Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!
4. Design single threaded bottlenecks
This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.
5. Turn on debugging and forget that it's on
In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!
6. Let the disks look after themselves
Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.
7. Keep Writing that Data to Disk
Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.
8. Validate every field
A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.
9. Write extra steps in database code
The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".
10. Don't do Performance Testing
Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.
Source by-->Vincent McBurney
10 Don't Do on DataStage
1. Use the same configuration file for all your jobs.
You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!
2. Use a sparse database lookup on high volumes.
This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!
3. Keep resorting your data.
Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!
4. Design single threaded bottlenecks
This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.
5. Turn on debugging and forget that it's on
In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!
6. Let the disks look after themselves
Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.
7. Keep Writing that Data to Disk
Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.
8. Validate every field
A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.
9. Write extra steps in database code
The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".
10. Don't do Performance Testing
Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.
Source by-->Vincent McBurney
The Deciding factors for ETL Tools
Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical warehousing project. ETL tools promise quick results, improved manageability and meta data integration with other common design and implementation tools. However, due to the potentially huge amounts of money involved in a tool decision, choosing the correct ETL tool for your project can present a daunting challenge. With a bit of internal questioning in advance followed by a careful review of your key needs against the choices available on the market, you should be able to choose the most effective ETL tool for your project.
ETL tools perform, as you may guess, at least three specific functions all of which focus around the movement of data from one place (file type, server, location, etc.) or system to another. More encompassing than a simple file copy process, this class of software generally reads data from an input source (flat file, relational table, message queue, etc.); passes the stream of information through either an engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; and then writes the resultant data set back out to a flat file, relational table, etc. As you may have guessed, these three steps are known as extraction, transformation and loading, respectively.
You may also see these tools classified as data cleansing tools, though here we should make a careful distinction. While data cleansing definitively can encompass the ETL functions, most true data cleansing tools are not architected to perform true ETL and, vice versa, most ETL tools provide only limited true data cleansing functions.
A quick example can show the difference between the two types of tools. Suppose you have an input data file containing the full name of a customer on an account. Particularly in financial services, this data may contain any manner of formats; but for this example, let's use "John S. and Mary Smith." A true data-cleansing tool will possess integrated functions to parse this phrase into the two named account holders, "John S. Smith" and "Mary Smith." This type of complex string parsing is not generally a strong function of an ETL tool. On the other hand, the ETL tool will generally be better at efficiently looking up the name "Mary Smith" in a relational database customer table and returning the integer numeric key value related to the "Mary Smith" customer name.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
Function
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
Engine Type
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development Environment
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Key Questions To Ask
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity
Complexity is generally evaluated through the following series of questions:
How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?
Concurrency
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?
Continuity
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?
Cost
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?
Conformity
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?
Evaluating the ETL Purchase Decision
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
First Steps – "Deal Breakers"
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
Making the Final Decision
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!
Source by-->http://www.information-management.com/issues/20010301/3084-1.html
ETL tools perform, as you may guess, at least three specific functions all of which focus around the movement of data from one place (file type, server, location, etc.) or system to another. More encompassing than a simple file copy process, this class of software generally reads data from an input source (flat file, relational table, message queue, etc.); passes the stream of information through either an engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; and then writes the resultant data set back out to a flat file, relational table, etc. As you may have guessed, these three steps are known as extraction, transformation and loading, respectively.
You may also see these tools classified as data cleansing tools, though here we should make a careful distinction. While data cleansing definitively can encompass the ETL functions, most true data cleansing tools are not architected to perform true ETL and, vice versa, most ETL tools provide only limited true data cleansing functions.
A quick example can show the difference between the two types of tools. Suppose you have an input data file containing the full name of a customer on an account. Particularly in financial services, this data may contain any manner of formats; but for this example, let's use "John S. and Mary Smith." A true data-cleansing tool will possess integrated functions to parse this phrase into the two named account holders, "John S. Smith" and "Mary Smith." This type of complex string parsing is not generally a strong function of an ETL tool. On the other hand, the ETL tool will generally be better at efficiently looking up the name "Mary Smith" in a relational database customer table and returning the integer numeric key value related to the "Mary Smith" customer name.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
Function
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
Engine Type
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development Environment
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Key Questions To Ask
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity
Complexity is generally evaluated through the following series of questions:
How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?
Concurrency
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?
Continuity
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?
Cost
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?
Conformity
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?
Evaluating the ETL Purchase Decision
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
First Steps – "Deal Breakers"
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
Making the Final Decision
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!
Source by-->http://www.information-management.com/issues/20010301/3084-1.html
The Deciding factors for ETL Tools
Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical warehousing project. ETL tools promise quick results, improved manageability and meta data integration with other common design and implementation tools. However, due to the potentially huge amounts of money involved in a tool decision, choosing the correct ETL tool for your project can present a daunting challenge. With a bit of internal questioning in advance followed by a careful review of your key needs against the choices available on the market, you should be able to choose the most effective ETL tool for your project.
ETL tools perform, as you may guess, at least three specific functions all of which focus around the movement of data from one place (file type, server, location, etc.) or system to another. More encompassing than a simple file copy process, this class of software generally reads data from an input source (flat file, relational table, message queue, etc.); passes the stream of information through either an engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; and then writes the resultant data set back out to a flat file, relational table, etc. As you may have guessed, these three steps are known as extraction, transformation and loading, respectively.
You may also see these tools classified as data cleansing tools, though here we should make a careful distinction. While data cleansing definitively can encompass the ETL functions, most true data cleansing tools are not architected to perform true ETL and, vice versa, most ETL tools provide only limited true data cleansing functions.
A quick example can show the difference between the two types of tools. Suppose you have an input data file containing the full name of a customer on an account. Particularly in financial services, this data may contain any manner of formats; but for this example, let's use "John S. and Mary Smith." A true data-cleansing tool will possess integrated functions to parse this phrase into the two named account holders, "John S. Smith" and "Mary Smith." This type of complex string parsing is not generally a strong function of an ETL tool. On the other hand, the ETL tool will generally be better at efficiently looking up the name "Mary Smith" in a relational database customer table and returning the integer numeric key value related to the "Mary Smith" customer name.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
Function
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
Engine Type
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development Environment
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Key Questions To Ask
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity
Complexity is generally evaluated through the following series of questions:
How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?
Concurrency
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?
Continuity
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?
Cost
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?
Conformity
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?
Evaluating the ETL Purchase Decision
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
First Steps – "Deal Breakers"
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
Making the Final Decision
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!
Source by-->http://www.information-management.com/issues/20010301/3084-1.html
ETL tools perform, as you may guess, at least three specific functions all of which focus around the movement of data from one place (file type, server, location, etc.) or system to another. More encompassing than a simple file copy process, this class of software generally reads data from an input source (flat file, relational table, message queue, etc.); passes the stream of information through either an engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; and then writes the resultant data set back out to a flat file, relational table, etc. As you may have guessed, these three steps are known as extraction, transformation and loading, respectively.
You may also see these tools classified as data cleansing tools, though here we should make a careful distinction. While data cleansing definitively can encompass the ETL functions, most true data cleansing tools are not architected to perform true ETL and, vice versa, most ETL tools provide only limited true data cleansing functions.
A quick example can show the difference between the two types of tools. Suppose you have an input data file containing the full name of a customer on an account. Particularly in financial services, this data may contain any manner of formats; but for this example, let's use "John S. and Mary Smith." A true data-cleansing tool will possess integrated functions to parse this phrase into the two named account holders, "John S. Smith" and "Mary Smith." This type of complex string parsing is not generally a strong function of an ETL tool. On the other hand, the ETL tool will generally be better at efficiently looking up the name "Mary Smith" in a relational database customer table and returning the integer numeric key value related to the "Mary Smith" customer name.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
Function
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
Engine Type
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development Environment
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Key Questions To Ask
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity
Complexity is generally evaluated through the following series of questions:
How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?
Concurrency
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?
Continuity
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?
Cost
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?
Conformity
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?
Evaluating the ETL Purchase Decision
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
First Steps – "Deal Breakers"
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
Making the Final Decision
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!
Source by-->http://www.information-management.com/issues/20010301/3084-1.html
Business Objects Tips and Tricks
Sourceww.upenn.edu
Retrieving Data
Password Issues
Desktop Intelligence and InfoView do check your password when you log in. However, you must be sure to use the password change application to synchronize your Data Warehouse, Business Objects and Business Objects DB Credentials each time you change your Data Warehouse password. (The same applies if you are using Business Objects to query other Oracle databases, such as Penn Community.) If your passwords are not synchronized, your query attempts will return errors:
Other Error messages and error conditions
Retrieving Data
- When using the "in list" feature in Deski, if you are typing items in yourself (rather than picking them from a list of values), the items in your list should be separated by commas. Do not type in any quote marks -- Desktop Intelligence will add them itself -- and do not put a space after a comma. However, if you are filling in a list of values in response to a prompt, you should separate the values with a semicolon, rather than a comma (but still no space between values).
- If you notice that the objects in your universe aren't refreshed, be sure that when you first log on to Desktop Intelligence that the box labeled "Use is Offline Mode" is not checked.
- If you get a "No data to fetch" message, and you're pretty sure you should be getting something, check to make sure you have configured your conditions correctly. For example: everything in the student data collection uses upper case letters. Another example: if you are entering a condition for a date in any of the financial data collections, use the full year, in other words, 06-01-2008 (instead of 06-01-08)
- If you get a "Table or View does not exist" message, check to make sure the data collection is available on the warehouse status page.
- To determine how many rows your query retrieved, choose "View" from the Data menu, then click on the Definition tab of the Data Manager window. The most recent execution of the query will be listed first, by date and time the query was executed, along with the number of rows retrieved, and the amount of time the database took to execute the query (please note that this time is not the time elapsed on your desktop machine).
- When first testing a query you may wish to limit the number of rows retrieved to determine if your results are as expected. You can do this by clicking the Options button at the lower left corner of the Query Panel. Select "10 Rows", "20 Rows" or enter another value by clicking "Other" in the Partial Results area of the window. Remember to return the setting to "Default Value" when you are ready to retrieve all rows matching your query criteria.
- You can use use wildcards in conjunction with the "Matches Pattern" operator to retrieve data that is like a value, rather than exactly equal to it. For example, using a percent sign ( % ) in the condition COA_Fund Matches Pattern 5% will retrieve all funds (to which you have access) from 500000-599999. Similarly, if you're trying to match almost the exact syntax, you can use an underscore ( _ ). For example, COA_ORG Matches Pattern '91_2' will retrieve ORG values 9132, 9142 and 9152.
- If you have multiple queries in one report (one .rep file), you may want to rename your queries to better describe their use, so that when you go to edit or refresh them, you'll have a better idea of what data you'll be retrieving. To do this, go to the Data menu, and select "View Data" to display the Data Manager window. The General section of the Definition tab contains a field called "Name", which you can reuse to rename the query from the default "Query x with Universe" (i.e., "Query 1 with FINQUERY").
- If you are trying to create a User Defined Object (UDO), but the universe you are in won't let you (in the Query Panel, clicking on User Objects does nothing, or Deski suddenly quits) try this:
Log out, delete the .udo file for that universe, log back in and try creating the User Objects again. (Caution! doing this will remove any user objects you previously created for this universe; you will have to re-create them.)
User Defined Objects all live on your local computer. They reside within one file per universe, in the following path:
C:\Documents and Settings\<user_name>\Application Data\Business Objects\Business Objects 11.5\Universes\
(where <user_name> is your user name on your computer)
The file names will be <universe_name>.udo
(where <universe_name> is the name of the universe, for example, "FINQUERY.udo" holds your local User Defined Objects for the FINQUERY universe.) - You can re-use queries between similar universes, so long as all the result objects and conditions from your query exist in the universe to which you'd like change. To do this, go to the Data menu, and select "View Data" to display the Data Manager window. The General section of the Definition tab contains a field called "Universe", which has a small button with "..." directly to the right. When you first display this window, the Universe field will display the universe against which the query is currently directed. Click the "..." button to display a list of all other universes to which you have access. Choose the one to switch your query to, and click ok. As long as all the objects in your query are available in the new target universe, that universe name will now appear in the Universe field. (This is convenient for switching from the FINQUERY to FINQUERY Template Universe, for example.)
- If you have taken the time to create a query, but want to be able to save your work without running it, you have two options. The most convenient is to use click the "Save and Close" button in the Query Panel to simply save the structure of your report. If you do this, be sure to then use "Save As" to save it with a distinctive name, in the location you wish. You can then open the query and simply click the Refresh button to retrieve your data. Alternatively, once you've built your query, you can click the Options button at the bottom left of the Query Panel, and click the "Do Not Retrieve Data" checkbox, click OK, and then click Run. Again, only the structure of your report will be displayed, which you can then save. Keep in mind, though, that you must uncheck the Do Not Retrieve Data checkbox before you choose to run the query in the future.
- By default, Desktop Intelligence saves queries (.rep files) in \My Documents\My Business Objects Documents\userDocs. If you work with several universes, you may find it convenient to create subfolders within that directory to better organize your work.
- If you want to export the data you've retrieved for use in another application, you have a couple options: Save As: Desktop Intelligence allows you to use the "Save As" feature to save documents in Excel, Adobe Acrobat PDF or CSV formats. Once you've refreshed a report, click on the File menu and select Save As. Select the file format you'd like from the "Save as type" drop-down list. Desktop Intelligence will then save your file with the data as it appears on the screen. This means the output will be saved and reflect section breaks, filters, special formatting, etc. The results of any variables or calculations will be saved as text, not underlying formulae. If your report has multiple tabs, and you choose to save as Excel, each tab will appear as a separate worksheet within one workbook. Similarly, multiple tabs will be accessible indidually in Adobe Acrobat (PDF) files. InfoView users can perform a similar action. If you experience undesired changes in Excel color formatting when overwritting a Excel file previoulsy saved from Deski, save it instead to different file name.
(Note: This feature was not available in Business Objects 5.1.4 - the work around was to use the Business Objects Edit menu and select "Copy All", then open a blank worksheet in Excel and choose "Paste Special" from the Excel Edit menu. Choose the "Unformatted Text" option and click Ok to paste your data into the worksheet.)
Export: You can also export the raw data retrieved in your query (retaining no Business Objects post-query formatting or calculations), in a variety of file formats. From the Data menu, select "View Data" to display the Data Manager window. Click the Export button at the bottom of the Results tab, and designate file name and path, and file format (options include text formats, and .xls for Excel). If you choose the All files (*.*) format, you can also control the field delimiters used. [A word of caution about using the .xls format: some versions of Excel object to the field names used in the Business Objects classes. If you encounter an error when exporting/importing using the Excel format, try it again with either the All Files or the Text Files format, and then use the Excel Import Wizard to bring the data into your speadsheet.]Report Formatting
- If you're having a problem displaying or printing pages after the first page of a report, and you have an image of some sort as part of your report layout, check to see whether the image is part of a table with other elements. If it is, move it outside the table and see if this fixes your problem.
- To minimize the processing time for your query, consider using the Desktop Intelligence toolbar icons or the Slice and Dice panel (rather than the query itself) to perform the following operations on the desktop once your query has finished:
Sort the data: You can place sorts on multiple objects using the Slice and Dice panel. You may also add breaks or report sections based on objects, and then sort those as well.
Filter the data: Once you have retrieved all the data meeting your conditions you may wish to further filter the data. Filtering via the Slice and Dice Panel enables you to include or exclude specific values retrieved. This is particularly useful when working with a large data set that you wish to manipulate many ways before deciding upon the final report format. You may set, change and remove filters as often as you wish without having to rerun the original query.
Perform calculations: You may add calculations to result objects, such as Sum, Count, Average, Minimum, Maximum and Percentage. This allows you to both view the data at the level of detail retrieved by your query and calculate at break levels you have designated. Available calculations are dependent upon the datatype of the object. More advanced calculations are also available using report Variables. - Duplicate rows - to show or not to show
Unless you intentionally go to the Options in the Query Panel and select “no duplicate rows”, Business Objects will return duplicate rows if they exist in the database.
However, Desktop Intelligence will by default show the aggregate of those rows in the report. In other words, the report may not be showing you all the underlying data. You can see all the rows if you go to the “View Data” tool, but if everything on two rows is identical in every column, then Deski will only display one row in the resulting table on the report.
If this isn’t what you want it to do, to force Deski to show all rows, do the following:- Right click on the table in your report, and select Format Table.
- On the General tab, check the box that says “Avoid Duplicate Rows Aggregation” and then click OK.
Password Issues
Desktop Intelligence and InfoView do check your password when you log in. However, you must be sure to use the password change application to synchronize your Data Warehouse, Business Objects and Business Objects DB Credentials each time you change your Data Warehouse password. (The same applies if you are using Business Objects to query other Oracle databases, such as Penn Community.) If your passwords are not synchronized, your query attempts will return errors:
- If you check only the Business Objects boxes in the PassWord Changer application, and not the Data Warehouse and other database boxes, when you submit a query you will get a message saying your access is denied for password errors.
- If you check the database boxes (i.e., Data Warehouse) and only one of the Business Objects boxes, you will get this error:
Connection or SQL sentence error: (DA0005)
Exception: CS, Unexpected behavior
The resolution for all of the above situations is to go to the password change application and reset your password, and be sure to check ALL of the boxes.Installation Issues
- Symptoms are: You retrieve the installation files and unzip them, double-click the installer.bat, and get a message: "Please go to the control panel to install and configure system components." This can happen if you downloaded the installation files to a drive that is different from the one on which you are trying to install the BusinessObjects Deski client. Try downloading and unzipping to the user's desktop, and run the installer.bat from there.
- Symptoms are: You either 1) get a DA0005 error, and when you click on the error message "details" it says that the DBDriver failed to load, or 2) you are installing Deski on a remote computer and you get an "openSessionLogon" error telling you it cannot establish a CMS connection.
Resolution: You must log in to InfoView to activate Desktop Intelligence the first time you use it following installation. See the "Configure the Client" step in the Installation instructions for details. - You get an "Operation TimeOut" error and the message that "Your internet server is not responsive." This probably means you are trying to use an older version of Business Objects (for example, this can happen when your desktop shortcut is still pointing to the old version). When you launch, make sure you are actually using Business Objects Enterprise XI release 2 / Desktop Intelligence.
Other Error messages and error conditions
- Reports with date prompts returning no data: If you're not getting data back when running reports with date prompts (and the same values worked in Business Objects 5), make sure you're entering the date in 4-digit year format. For example, rather than entering "7/1/07" (for July 1, 2007), enter "7/1/2007".
- If you attempt to open a report and you get an error:
You are not authorized to use this document. (FRM0008)
try one of the following solutions:
1) Is this a Corporate Document? In which case, do not try to open the local copy, but instead go to File->Import from Repository...
and use the new one from the repository.
2) If this is a document you created yourself on your local computer using the old version (or modified and saved on your local computer using the old version), then, if possible, use the old version -- version 5.1.4 -- to open it. Go to File->Save As... and before you save it, click the box next to "Save for all users" in the lower left corner of the screen. Then click on Save, and close it. You should now be able to open it using Deski.
3) You can also get this error if you created this document yourself on your local computer, then exported it to the Repository (for example, to your "Favorites" box in InfoView), and then subsequently deleted the one on the Repository. The local one will no longer be available to you -- or anyone else! To avoid this problem, always make a backup copy of your local document, before you delete a report on the Repository. - If you've run a query, but are not seeing any results on screen other than the column header cells, you may want to check to see if any of the following features are set:
·You are viewing the Structure of the report, rather than the actual data retrieved. (Check the View menu, Strucure setting.)
·There are Filters in your report. (Click on any cell that you can see, and choose Filters from the Format menu.)
·The report is folded, so that only headers appear. (First check the old checkbox on the General tab of the Standard Report Styles window, accessible from the Tools menu. If Fold there is unchecked, you may also want to check the same setting in Format/Table, to see if the report is folded there. - You log in to Deski and try to refresh a report, but get: "Error During SQL Execution: (DA0003) - CS, Job already in use".
One possible cause of this error message is because Internet Explorer 7 has a default time out of 30 seconds compared with the Internet Explorer 6 time out of 60 seconds. For some queries, this is not enough time to refresh the entire query, since the client is connecting using an HTTP protocol to the server.
It is necessary to extend the time-out limit on Internet Explorer 7. This can be done by adding a key to the registry.
(CAUTION! The following resolution involves editing the registry. Using the Registry Editor incorrectly can cause serious problems that may require you to reinstall the Microsoft Windows operating system. Use the Registry Editor at your own risk. It is strongly recommended that you make a backup copy of the registry files before you edit the registry. End users should NOT attempt this without first consulting with their Local Support Providers.)
To resolve the error message (if you are using IE 7)
1. Click Start > Run. The Run dialog box appears.
2. Type “regedit” in the Open: text box. Click OK. The Registry Editor appears.
3. Navigate to HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings.
4. Add a ReceiveTimeout DWORD value with a data value of (<number of seconds>)*1000. For example, if the required time out duration is eight minutes, set the ReceiveTimeout data value to 480000 (<480>*1000).
5. Exit the Registry Editor.
6. Restart the computer. - If your Business Objects menu bar disappears, try the following steps:
- From the Windows 'Start' menu, select 'Run'.
- Type "regedit" in the 'Run' dialog box.
- Locate and delete this folder in the registry:
HKEY_CURRENT_USER\Software\Business Objects\Suite 11.5\default\BusinessObjects\Application
Preferences\BusinessReporter\Desktop Intelligence - Restart Desktop Intelligence
- If the steps above do not resolve the issue, try deleting the Desktop Intelligence folder under HKEY_USERS & the user's SID: HKEY_USERS\<user's SID>\ Software\Business Objects\Suite 11.5\default\BusinessObjects\Application Preferences\BusinessReporter\Desktop Intelligence
- Restart Desktop Intelligence
- If the Classes and Object pane disappears from the Query Panel, try the following steps:
- From the Windows 'Start' menu, select 'Run'.
- Type "regedit" in the 'Run' dialog box.
- Locate and delete these two registry keys:
- HKEY_CURRENT_USER\Software\Business Objects\Suite 11.5\default\BusinessObjects\Administrator User Prefs
- HKEY_CURRENT_USER\Software\Business Objects\Suite 11.5\default\BusinessObjects\Application Preferences\BusinessDesigner
- Restart Desktop Intelligence
- If you're having problems accessing the Slice and Dice window, click the Slice and Dice button again, and, when the window doesn't appear, hit Alt+space, and then Maximize the window.