07 - Data Extraction, Transformation and Loading (ETL)

ETL Overview

ETL is a complex consolidation of process and technology that consumes an important portion of the data warehouse development efforts and depends on the skills of the business analysts, database designers, and application developers
It is not a onetime process since new data is added to the data warehouse periodically – monthly, daily, hourly , and sometimes on adhoc basis
The ETL process converts the relevant data from the source systems into useful information to be stored in a data warehouse

Extraction

The extraction process determines the correct subset of source data that is used for the further process of extraction. It takes place at idle times of the business, preferably at night

Data is extracted from heterogeneous sources. Every data source has its specific set of characteristics that need to be managed and consolidated into the ETL system in order to extract data
Extraction is mostly performed by COBOL routines but it is not recommended due to its high program maintenance

Target

Source

Transformation

Table Name

Column Name

Data Type

Table Name

Column Name

Data Type

Log out

· The logical data mapping illustrated shows the source and target columns. The content of the logical data mapping document is a critical component of the ETL process

· The transformation column consists of the conversion rule that needs to be applied on the source column to achieve the target column. Typically it is in the form of SQL

Types of sources

Cooperative Sources

· Replicated sources – the published mechanism

· Call back sources – calls external code when changes happen

· Internal action sources –only internal actions when changes occur

Non -cooperative Sources

Snapshot sources – offers only full copy of source
Specific sources – Legacy systems
Logged sources – captures change log
Queryable sources – provides query interface

Transformation

The data extracted from the source systems is not in a usable format. It has to undergo few conversions since the source systems are not similar in nature. The data is transformed to a format that adheres to the business rules. Transformation deals with

Schema-level problems

Structural conflicts where the objects have different representations or conversion types
Naming conflicts resulting due to homonyms

Record-level problems

Duplicate records, e.g. ‘’John’’ and ‘’Jhon’’
Conflicting info, e.g. different birth dates for same person

Value-level problems

Different formats, e.g. ‘’female’’ vs ‘’f’’

Functions handles in transformations

Normalizing
Denormalizing
Reformatting
Recalculating
Summarizing
Merging data

Staging area

The data staging area owned by the ETL team is the intermediate area in a data warehouse where the transformation activities take place.
The source data snapshots are compared with the previous snap shots to identify the newly inserted or updated data.
The data then undergoes several filters and transformations.
The data can be reloaded from the staging tables without going back to the sources in cases of any environmental failures
Users are not allowed access to the staging areas
Reports cannot be accessed in the staging area
Only the ETL processes can read or write the staging area
The staging area consists of both RDBMS tables and data files

Data structures in the staging area

Flat files : Fast to write, append to, sort and filter

Slow to update, access or join

Relational tables: Metadata, DBA support, SQL interface
Dimensional Model: Facts, Dimensions, OLAP cubes

Loading

After the data has been cleansed and converted into a structure adhering with the data warehouse requirements, data is ready to be loaded into a data warehouse. The data loaded into a data warehouse is made presentable to the users.

Checks to be performed after loading data

Referential integrity between the dimensions and facts is verified. This is done to ensure that all the records in the fact table are related to the records in the dimension table.

For example, if a fact table of product sales is to be related to the dimension tables for products, time, and customers, then for each sale record of the product there must be a record in each dimension table that relates to the sale record routed through the primary keys. This verifies that for every sale, the customer who purchased a product, the product sold, time and location of the sale is recognized

To check for the referential integrity in a star schema we can use a simple query that counts the rows returned when all the required dimension tables are joined to the fact tables using inner joins
To check for the referential integrity in a snow flake schema we need to check for the referential integrity between the dimension tables and the subordinate tables to which they are linked

Features of loading

SQL-based update is slow due to large overhead caused by locking and optimization
Bulk-loading using DBMS specific utility is faster
Index on tables slows the load

Problems faced by ETL

Technical challenges involving integrating, moving and transforming data from contrasting environments
Short load windows, long load times
Inconsistent and difficult to maintain business rules
Source systems missing certain critical data

Effective loading process

Leave all the indexes before loading the tables and rebuild them after loading the tables
Partitioning the tables improves the loading. Partitioning allows the tables to divide it into smaller tables for administration purposes and to improve the query performance on large fact tables. The most recommended method is to partition the tables by a data interval like year, month or quarter
Incremental load enhances the performance of the system

Desired Features in ETL tools

Expandable infrastructure
Reusability of custom-made functions
Proper application of the available hardware resources
Consolidated metadata management

ETL Tools

Commercial ETL Tools

IBM Infosphere DataStage
Informatica PowerCenter
Oracle Warehouse Builder (OWB)
Oracle Data Integrator (ODI)
SAS ETL Studio
Business Objects Data Integrator(BODI)
Microsoft SQL Server Integration Services(SSIS)
Ab Initio

Freeware, open source ETL tools:

Pentaho Data Integration (Kettle)
Talend Integrator Suite
CloverETL
Jasper ETL

Reporting Tools

OBIEE
SAP Crystal Reports
Jaspersoft
Business Objects
Pentaho Reporting
BIRTS
Cognos
Microstrategy
Actuate
Qlikview
Proclarity
Excel
Crystal Reports
Data Integrator 8.12 Pervasive
Transformation Server 5.4 IBM DataMirror
Transformation Manager 5.2.2 ETL solutions

Data Integration

Data integration is the process of consolidating the data from different sources which are stored using different technologies and provide a unified representation of the data. Data integration becomes critical when two companies merge or for combining applications within a company

The sub-areas of data integration are

Data warehousing
Data migration
Enterprise application/information integration
Master data management

Data Integration techniques

Manual integration: Users operate with all the applicable data accessing all the source systems or web page interfaces
Application Based Integration: Needs the specific applications to incorporate all the integration efforts. This method is feasible only in case of a limited number of applications
Middleware Data Integration: Transmits the integration logic from specific applications to a new middleware layer.
Common Data Storage or Physical Data Integration: A new system is created which keeps a copy of the data from the source systems to store and control it independently of the original system

Like us on Facebook

Wideskills

Wideskills

C
H
A
P
T
E
R
S Data Warehousing Tutorial

O
T
H
E
R

S
U
B
J
E
C
T
S Subjects

07 - Data Extraction, Transformation and Loading (ETL)

ETL Overview

Extraction

Types of sources

Cooperative Sources

Non -cooperative Sources

Transformation

Schema-level problems

Record-level problems

Value-level problems

Functions handles in transformations

Staging area

Data structures in the staging area

Loading

Checks to be performed after loading data

Features of loading

Problems faced by ETL

Effective loading process

Desired Features in ETL tools

ETL Tools

Commercial ETL Tools

Freeware, open source ETL tools:

Reporting Tools

Data Integration

Data Integration techniques

Like us on Facebook

Search form

07 - Data Extraction, Transformation and Loading (ETL)

ETL Overview

Extraction

Types of sources

Cooperative Sources

Non -cooperative Sources

Transformation

Schema-level problems

Record-level problems

Value-level problems

Functions handles in transformations

Staging area

Data structures in the staging area

Loading

Checks to be performed after loading data

Features of loading

Problems faced by ETL

Effective loading process

Desired Features in ETL tools

ETL Tools

Commercial ETL Tools

Freeware, open source ETL tools:

Reporting Tools

Data Integration

Data Integration techniques

Like us on Facebook