07 - Data Extraction, Transformation and Loading (ETL)

ETL Overview

  • ETL is a complex consolidation of process and technology  that consumes an important portion of the data warehouse development efforts and depends on the skills of the business analysts, database designers, and application developers
  • It is not a onetime process since new data is added to the data warehouse periodically – monthly, daily, hourly , and sometimes on adhoc basis
  • The ETL process converts the relevant data from the source systems into useful information to be stored in a data warehouse

Extraction

The extraction process determines the correct subset of source data that is used for the further process of extraction. It takes place at idle times of the business, preferably at night

  • Data is extracted from heterogeneous sources. Every data source has its specific set of characteristics that need to be managed and consolidated into the ETL system in order to extract data
  • Extraction is mostly performed by COBOL routines but it is not recommended due to its high program maintenance 

Target

Source

Transformation

Table Name

Column Name

Data Type

Table Name

Column Name

Data Type

 Log out

       

·         The logical data mapping illustrated shows the source and target columns. The content of the logical data mapping document is a critical component of the ETL process

·         The transformation column consists of the conversion rule that needs to be applied on the source column to achieve the target column. Typically it is in the form of SQL

Types of sources

Cooperative Sources

·         Replicated sources – the published mechanism

·         Call back sources – calls external code when changes happen

·         Internal action sources –only internal actions when changes occur 

Non -cooperative Sources

  • Snapshot sources – offers  only full copy of source
  • Specific sources – Legacy systems
  • Logged sources – captures change log
  • Queryable sources – provides query interface

Transformation

The data extracted from the source systems is not in a usable format. It has to undergo few conversions since the source systems are not similar in nature. The data is transformed to a format that adheres to the business rules. Transformation deals with

Schema-level problems

  • Structural conflicts where the objects have different representations or conversion types
  • Naming conflicts resulting due to homonyms

Record-level problems

  • Duplicate records, e.g. ‘’John’’ and ‘’Jhon’’
  • Conflicting info, e.g. different birth dates for same person

Value-level problems

  • Different formats, e.g.  ‘’female’’ vs ‘’f’’

Functions handles in transformations

  • Normalizing
  • Denormalizing
  • Reformatting
  • Recalculating
  • Summarizing
  • Merging data

Staging area

  • The data staging area owned by the ETL team is the intermediate area in a data warehouse where the transformation activities take place.
  • The source data snapshots are compared with the previous snap shots to identify the newly inserted or updated data.
  • The data then undergoes several filters and transformations.
  • The data can be reloaded from the staging tables without going back to the sources in cases of any environmental failures
  • Users are not allowed access to the staging areas
  • Reports cannot be accessed in the staging area
  • Only the ETL processes can read or write the staging area
  • The staging area consists of both RDBMS tables and data files

Data structures in the staging area

  • Flat files : Fast to write, append to, sort and filter

                Slow to update, access or join

  • Relational tables: Metadata, DBA support, SQL interface
  • Dimensional Model: Facts, Dimensions, OLAP cubes

Loading

After the data has been cleansed and converted into a structure adhering with the data warehouse requirements, data is ready to be loaded into a data warehouse. The data loaded into a data warehouse is made presentable to the users.

Checks to be performed after loading data

  • Referential integrity between the dimensions and facts is verified. This is done to ensure that all the records in the fact table are related to the records in the dimension table.

For example, if a fact table of product sales is to be related to the dimension tables for products, time, and customers, then for each sale record of the product there must be a record in each dimension table that relates to the sale record routed through the primary keys. This verifies that for every sale, the customer who purchased a product, the product sold, time and location of the sale is recognized

  • To check for the referential integrity in a star schema we can use a simple query that counts the rows returned when all the required dimension tables are joined to the fact tables using inner joins
  • To check for the referential integrity in a snow flake schema we need to check for the referential integrity between the dimension tables and the subordinate tables to which they are linked

Features of loading

  • SQL-based update is slow due to large overhead caused by locking and optimization
  • Bulk-loading using DBMS specific utility is faster
  • Index on tables slows the load

Problems faced by ETL

  • Technical challenges involving integrating, moving and transforming data from contrasting environments
  • Short load windows, long load times
  • Inconsistent and difficult to maintain business rules
  • Source systems missing certain critical data

Effective loading process

  • Leave all the indexes before loading the tables and rebuild them after loading the tables
  • Partitioning the tables improves the loading. Partitioning allows the tables to divide it into smaller tables for administration purposes and to improve the query performance on large fact tables. The most recommended method is to partition the tables by a data interval like year, month or quarter
  • Incremental load enhances the performance of the system

Desired Features in ETL tools

  • Expandable infrastructure
  • Reusability of custom-made functions
  • Proper application of the available hardware resources
  • Consolidated metadata management

ETL Tools

Commercial ETL Tools

  • IBM Infosphere DataStage
  • Informatica PowerCenter
  • Oracle Warehouse Builder (OWB)
  • Oracle Data Integrator (ODI)
  • SAS ETL Studio
  • Business Objects Data Integrator(BODI)
  • Microsoft SQL Server Integration Services(SSIS)
  • Ab Initio

Freeware, open source ETL tools:

  • Pentaho Data Integration (Kettle)
  • Talend Integrator Suite
  • CloverETL
  • Jasper ETL

Reporting Tools

  • OBIEE
  • SAP Crystal Reports
  • Jaspersoft
  • Business Objects
  • Pentaho Reporting
  • BIRTS
  • Cognos
  • Microstrategy
  • Actuate
  • Qlikview
  • Proclarity
  • Excel
  • Crystal Reports
  • Data Integrator 8.12 Pervasive
  • Transformation Server 5.4 IBM DataMirror
  • Transformation Manager 5.2.2 ETL solutions

Data Integration

Data integration is the process of consolidating the data from different sources which are stored using different technologies and provide a unified representation of the data. Data integration becomes critical when two companies merge or for combining applications within a company

The sub-areas of data integration are

  • Data warehousing
  • Data migration
  • Enterprise application/information integration
  • Master data management

Data Integration techniques

  • Manual integration: Users operate with all the applicable data accessing all the source systems or web page interfaces
  • Application Based Integration: Needs the specific applications to incorporate all the integration efforts. This method is feasible only in case of a limited number of applications
  • Middleware Data Integration: Transmits the integration logic from specific applications to a new middleware layer.
  • Common Data Storage or Physical Data Integration: A new system is created which keeps a copy of the data from the source systems to store and control it independently of the original system

Like us on Facebook