ETL Overview
- ETL is a complex consolidation of process and technology that consumes an important portion of the data warehouse development efforts and depends on the skills of the business analysts, database designers, and application developers
- It is not a onetime process since new data is added to the data warehouse periodically – monthly, daily, hourly , and sometimes on adhoc basis
- The ETL process converts the relevant data from the source systems into useful information to be stored in a data warehouse
Extraction
The extraction process determines the correct subset of source data that is used for the further process of extraction. It takes place at idle times of the business, preferably at night
- Data is extracted from heterogeneous sources. Every data source has its specific set of characteristics that need to be managed and consolidated into the ETL system in order to extract data
- Extraction is mostly performed by COBOL routines but it is not recommended due to its high program maintenance
Target | Source | Transformation | ||||
Table Name | Column Name | Data Type | Table Name | Column Name | Data Type | Log out |
· The logical data mapping illustrated shows the source and target columns. The content of the logical data mapping document is a critical component of the ETL process
· The transformation column consists of the conversion rule that needs to be applied on the source column to achieve the target column. Typically it is in the form of SQL
Types of sources
Cooperative Sources
· Replicated sources – the published mechanism
· Call back sources – calls external code when changes happen
· Internal action sources –only internal actions when changes occur
Non -cooperative Sources
- Snapshot sources – offers only full copy of source
- Specific sources – Legacy systems
- Logged sources – captures change log
- Queryable sources – provides query interface
Transformation
The data extracted from the source systems is not in a usable format. It has to undergo few conversions since the source systems are not similar in nature. The data is transformed to a format that adheres to the business rules. Transformation deals with
Schema-level problems
- Structural conflicts where the objects have different representations or conversion types
- Naming conflicts resulting due to homonyms
Record-level problems
- Duplicate records, e.g. ‘’John’’ and ‘’Jhon’’
- Conflicting info, e.g. different birth dates for same person
Value-level problems
- Different formats, e.g. ‘’female’’ vs ‘’f’’
Functions handles in transformations
- Normalizing
- Denormalizing
- Reformatting
- Recalculating
- Summarizing
- Merging data
Staging area
- The data staging area owned by the ETL team is the intermediate area in a data warehouse where the transformation activities take place.
- The source data snapshots are compared with the previous snap shots to identify the newly inserted or updated data.
- The data then undergoes several filters and transformations.
- The data can be reloaded from the staging tables without going back to the sources in cases of any environmental failures
- Users are not allowed access to the staging areas
- Reports cannot be accessed in the staging area
- Only the ETL processes can read or write the staging area
- The staging area consists of both RDBMS tables and data files
Data structures in the staging area
- Flat files : Fast to write, append to, sort and filter
Slow to update, access or join
- Relational tables: Metadata, DBA support, SQL interface
- Dimensional Model: Facts, Dimensions, OLAP cubes
Loading
After the data has been cleansed and converted into a structure adhering with the data warehouse requirements, data is ready to be loaded into a data warehouse. The data loaded into a data warehouse is made presentable to the users.
Checks to be performed after loading data
- Referential integrity between the dimensions and facts is verified. This is done to ensure that all the records in the fact table are related to the records in the dimension table.
For example, if a fact table of product sales is to be related to the dimension tables for products, time, and customers, then for each sale record of the product there must be a record in each dimension table that relates to the sale record routed through the primary keys. This verifies that for every sale, the customer who purchased a product, the product sold, time and location of the sale is recognized
- To check for the referential integrity in a star schema we can use a simple query that counts the rows returned when all the required dimension tables are joined to the fact tables using inner joins
- To check for the referential integrity in a snow flake schema we need to check for the referential integrity between the dimension tables and the subordinate tables to which they are linked
Features of loading
- SQL-based update is slow due to large overhead caused by locking and optimization
- Bulk-loading using DBMS specific utility is faster
- Index on tables slows the load
Problems faced by ETL
- Technical challenges involving integrating, moving and transforming data from contrasting environments
- Short load windows, long load times
- Inconsistent and difficult to maintain business rules
- Source systems missing certain critical data
Effective loading process
- Leave all the indexes before loading the tables and rebuild them after loading the tables
- Partitioning the tables improves the loading. Partitioning allows the tables to divide it into smaller tables for administration purposes and to improve the query performance on large fact tables. The most recommended method is to partition the tables by a data interval like year, month or quarter
- Incremental load enhances the performance of the system
Desired Features in ETL tools
- Expandable infrastructure
- Reusability of custom-made functions
- Proper application of the available hardware resources
- Consolidated metadata management
ETL Tools
Commercial ETL Tools
- IBM Infosphere DataStage
- Informatica PowerCenter
- Oracle Warehouse Builder (OWB)
- Oracle Data Integrator (ODI)
- SAS ETL Studio
- Business Objects Data Integrator(BODI)
- Microsoft SQL Server Integration Services(SSIS)
- Ab Initio
Freeware, open source ETL tools:
- Pentaho Data Integration (Kettle)
- Talend Integrator Suite
- CloverETL
- Jasper ETL
Reporting Tools
- OBIEE
- SAP Crystal Reports
- Jaspersoft
- Business Objects
- Pentaho Reporting
- BIRTS
- Cognos
- Microstrategy
- Actuate
- Qlikview
- Proclarity
- Excel
- Crystal Reports
- Data Integrator 8.12 Pervasive
- Transformation Server 5.4 IBM DataMirror
- Transformation Manager 5.2.2 ETL solutions
Data Integration
Data integration is the process of consolidating the data from different sources which are stored using different technologies and provide a unified representation of the data. Data integration becomes critical when two companies merge or for combining applications within a company
The sub-areas of data integration are
- Data warehousing
- Data migration
- Enterprise application/information integration
- Master data management
Data Integration techniques
- Manual integration: Users operate with all the applicable data accessing all the source systems or web page interfaces
- Application Based Integration: Needs the specific applications to incorporate all the integration efforts. This method is feasible only in case of a limited number of applications
- Middleware Data Integration: Transmits the integration logic from specific applications to a new middleware layer.
- Common Data Storage or Physical Data Integration: A new system is created which keeps a copy of the data from the source systems to store and control it independently of the original system