Bigdata-dwbi

Thursday, July 3, 2014

ETL Rule of Tumb for loading Dimension and Fact

by Unknown | in DW at 6:47 AM

I would recommend following simple rules for loading Dimension and Fact tables

Dimensions:

1) No duplicate values in Natural Keys.

2) Dummy record (-1) has to be present in the table, if the table is truncate and load then inserting the dummy row should be first step.

3) The Natural key shouldn't be blank except the dummy record.

Facts:

1) Any fact field that is not mapped from source should always be 0.

2) Any fact field that is mapped from source should always be NON-0 value.

3) All mapped fact key values should tie up with dimensions.

4) If no date value is received from source then it will be resolved to 12/31/4444 or any other date you like

5) If no values for Measures are received from source then it will be resolved to 0 .

There are many rules to apply, I recommend these are the important ones.

Wednesday, July 2, 2014

SAP BO Information Design Tool UNX universes

by Unknown | in SAP BO at 5:59 AM

The 4.0 release of SAP BusinessObjects introduced a new universe format called UNX. This new format will eventually replace the existing UNV universe format. UNX universes also add a bunch of new features, perhaps most notably the ability to create multi-source universes. These UNX universes are created using a new tool called the Information Design Tool(IDT)

Universe Consumption in BI 4.0

Web Intelligence

Web Intelligence tool supports both UNV and UNX universes, you can create report from both formats

Crystal Reports for Enterprise

Crystal Reports for Enterprise is the new Java-based rewrite of the Crystal client. Because of this, CR4E can only consume the newer UNX universe format and is completely incompatible with the older UNV universes.

Dashboards (Xcelsius)

The Dashboards tool supports both the UNV and UNX formats, but with a caveat. UNX universes must be consumed through the new integrated Query Browser. However, UNV universes can only be consumed via the traditional Query as a Web Service tool. This limits you from utilizing some new features of Dashboards, such as exporting to a mobile-compatible HTML5 dashboard.

Explorer

The Explorer tool can only access UNX universes in BI 4.0

Live Office

Live Office only supports the legacy UNV universe format.

Tuesday, June 24, 2014

File Archive using Hadoop Archive

by Unknown | in Big Data at 5:28 AM

Archiving small files

The Hadoop Archive's data format is called har, with the following layout:

foo.har/_masterindex //stores hashes and offsets
foo.har/_index //stores file statuses
foo.har/part-[1..n] //stores actual file data

The file data is stored in multiple part files, which are indexed for keeping the original separation of data intact. Moreover, the part files can be accessed by MapReduce programs in parallel. The index files also record the original directory tree structures and the file statuses. In Figure 1, a directory containing many small files is archived into a directory with large files and indexes.

HarFileSystem – A first-class FileSystem providing transparent access

Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the user had to de-archive the archive before use.

Hadoop Archive is integrated in the Hadoop’s FileSystem interface. The HarFileSystemimplements the FileSystem interface and provides access via the har:// scheme. This exposes the archived files and directory tree structures transparently to the users. Files in a har can be accessed directly without expanding it. For example, we have the following command to copy a HDFS file to a local directory:

hadoop fs –get hdfs://namenode/foo/file-1 localdir

Suppose an archive bar.har is created from the foo directory. Then, the command to copy the original file becomes

hadoop fs –get har://namenode/bar.har#foo/file-1 localdir

Users only have to change the URI paths. Alternatively, users may choose to create a symbolic link (from hdfs://namenode/foo to har://namenode/bar.har#foo in the example above), then even the URIs do not need to be changed. In either case,HarFileSystem will be invoked automatically for providing access to the files in the har. Because of this transparent layer, har is compatible with the Hadoop APIs, MapReduce, the shell command -ine interface, and higher-level applications like Pig, Zebra, Streaming, Pipes, and DistCp.

Monday, June 9, 2014

SAP BO Open Document in 4.1

by Unknown | in SAP BO at 1:31 AM

The open document functionality between 3 and 4 is exactly the same expect a change the default URL

The default URL to the OpenDocument web application bundle has changed in SAP BusinessObjects
Business Intelligence platform 4.0. New absolute OpenDocument links need to use the new default
URL:
http://<servername>:<port>/BOE/OpenDocument/opendoc/openDocument.jsp?<parameter1>&<parameter2>&...&<parameterN>
If you are migrating reports with existing links from an XI 3.x release platform, resolve the issue by
setting up the following redirect in your web server:
• Redirect: ../OpenDocument/opendoc/openDocument.jsp
• To: ../BOE/OpenDocument/opendoc/openDocument.jsp
Note:
• Ensure that all URL request parameters are forwarded correctly by your redirect. Refer to your web
server documentation for detailed steps on implementing a redirect.
• SAP BusinessObjects Business Intelligence platform 4.0 only supports a Java deployment of
OpenDocument. The OpenDocument web bundle is part of the BOE.war file.

Friday, June 6, 2014

SAP BO 4.1 New Features

by Unknown | in SAP BO at 2:54 AM

Source SAP BO:

SSIS 2012 New features

by Unknown | in Other at 2:48 AM

Source - MSDN BLog

#1 – Change Data Capture

We’ve partnered with Attunity to provide some great CDC functionality out of the box. This includes a CDC Control Task, a CDC Source component, and a CDC Splitter transform (that splits the output based on the CDC operation – insert/update/delete). It also includes CDC support for Oracle. More details to follow.

#2 – ODBC Support

ODBC Source and Destination components, also from Attunity, and included in the box.

#3 – Connection Manager Changes

RC0 makes some minor improvements to Shared Connection Managers (they are now expressionable), and changes the icons used to designate connection managers that are shared, offline, or have expressions on them. We also added a neat feature for the Cache Connection Manager – it can now share it’s in-memory cache across package executions (i.e. create a shared connection manager, load the cache with a master package, and the remaining child packages will all share the same in-memory cache).

#4 – Flat File Source Improvements

Another feature that was added in CTP3, but worth calling out again. The Flat File Source now supports a varying number of columns, and embedded qualifiers.

#5 – Package Format Changes

Ok, another CTP3 feature – but when I demo’d it at PASS, I did a live merge of two data flows up on stage. And it worked. Impressive, no?

#6 – Visual Studio Configurations

You can now externalize parameter values, storing them in a visual studio configuration. You can switch between VS configurations from the toolbar (like you can with other project types, such as C# or VB.NET), and your parameter values will automatically change to the value within the configuration.

#7 - Scripting Improvements

We upgraded the scripting engine to VSTA 3.0, which gives us a Visual Studio 2010 shell, and support for .NET 4.

Oh… and we also added Script Component Debugging. More about that to follow.

#8 – Troubleshooting & Logging

More improvements to SSIS Catalog based logging. You can now set a server wide default logging level, capture data flow component timing information, and row counts for all paths within a data flow.

#9 – Data Taps

Another CTP3 feature that didn’t get enough attention. This feature allow you to programmatically (using T-SQL) add a “tap” to any data flow path on a package deployed to the SSIS Catalog. When the package is run, data flowing through the path will be saved out to disk in CSV format. The feature was designed to make debugging data issues occurring in a production environment (that the developer doesn’t have access to).

#10 – Server Management with PowerShell

We’ve added PowerShell support for the SSIS Catalog in RC0. See the follow up post for API examples.

Other Changes

Updated look for the Control Flow and Data Flow
Pivot UI
Row Count UI
New Expression:

REPLACENULL

BIDS is now SQL Server Data Tools
Many small fixes and improvements based on CTP feedback – thank you!!

Tuesday, May 13, 2014

Big Data Technologies

by Unknown | in Big Data at 2:08 AM

Column-oriented databases

Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases

There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce

This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:

The "Map" task, where an input dataset is converted into a different set of key/value pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map" task are combined to form a reduced set of tuples (hence the name).

Hadoop

Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive

Hive is a "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it's a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG

PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData

WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.

PLATFORA

Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.

Storage Technologies

As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree

SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive.

Big Data in the cloud

As we can see, from Dr. Kaur's roundup above, most, if not all, of these technologies are closely associated with the cloud. Most cloud vendors are already offering hosted Hadoop clusters that can be scaled on demand according to their user's needs. Also, many of the products and platforms mentioned are either entirely cloud-based or have cloud versions themselves.

Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.

Translate to your Language

Labels

Disclaimer Statement

Total Pageviews

Subscribe To

Category

Track

Follow us on FaceBook

About