08 - Data Performance, Cleansing and quality

Data Cleansing

Data cleansing or scrubbing is needed when consolidating heterogeneous data sources and should be handled along with schema-related data transformations. Data cleansing is a major part of the ETL process.

Significant features of data cleansing

  • It deals with identifying and cleaning the errors and discrepancies from the data in order to improve the quality of data
  • The multiple data sources, for e.g., federated database systems or global web-based information systems when integrated together need to be cleansed accordingly.
  • Data cleansing occurs in a separate data staging area before loading the transformed data into the data warehouse
  • The data cleansing approach should be supported by tools to limit the manual intervention an programming effort
  • It should not only be performed in isolation but also with schema-related data transformations based on the detailed metadata

Dirty Data

The data to be cleansed by the data cleansing process is referred as dirty data. The reasons for the dirty data are:

  • Dummy values
  • No data
  • Multipurpose fields
  • Ambiguous data
  • Conflicting data
  • Negligence of business rules
  • Non-unique identifiers
  • Data integration problems

Steps in Data Cleansing

Parsing

Parsing detects and analyzes individual data elements in the source files and then separates these data elements in the target files

For example, parsing includes the first, middle, and the last name

Correcting

The parsed data is corrected using refined data algorithms and secondary data sources

For example, adding a zip code to a state address

Standardizing

Standardizing is the process of converting the transformed data into its required format using both standard and custom business rules

For example, replacing a nickname

Matching

Finding and matching records within and across the parsed, corrected and standardized data based on standard business rules to remove and avoid duplications

For example, identifying and removing similar names and addresses

Consolidating

Analyzing and recognizing the relationships between the matched records and combining them into a single representation

Data Quality

Data quality is the process of data cleaning combined with the following steps:

Data Enrichment

The process of improving the value of the internally stored data by adding the related attributes from external sources

For example, consumer demographic attributes or geographic descriptors

Data Profiling

The process of analysis of the data to acquire statistics (metadata) that offer insight into the quality of the data and help in the identification of the data quality issues. The need to systematically collect summarized data in the database for:

  • Number of rows in each table
  • Number of unique, null values of each field
  • Skewness of distribution of field values
  • Data type, length of field
  • Functional dependencies, keys
  • Join paths

Data Monitoring

The process of setting up of controls to ensure the ongoing similarity of the data to business rules that define the data quality for the business. It is achieved by:

  • Data stewards liable for the data quality
  • Data tracking and auditing
  • Reconciliation of incrementally updated databases with original sources
  • DW-controlled methods
  • Source-controlled methods
  • Constructed programs to check the data quality

Data Quality Attributes

The data quality attributes that collectively characterize the quality of data are

  • Accuracy: Does the data represent the reality or a verifiable source?
  • Integrity: Is the structure of the data and relationships among the entities and attributes maintained consistently?
  • Consistency: Are data elements consistently defined?
  • Completeness: Is all the required data available?
  • Validity: Do all the data values fall within the acceptable ranges as defined by the business?
  • Timeliness: Is the data present when required?
  • Accessibility: Is the data easily obtainable, applicable and understandable?

Technical Challenges faced in Data Quality

  • Discrepancies in data format, structure, and values
  • Missing data, fields filled with nulls or default values
  • Spelling errors
  • Data in incorrect fields
  • Hidden information
  • Data deviations
  • Manual entry of data
  • Parallel entry of data

Measures of Data Quality

  • Reports should be correct and creditable
  • Data driven business process works correctly
  • Invoices are accurate
  • Data quality should be a pioneer for successful ERP, CRM or DSS implementations

Solutions for Data Quality

Preemptive Solutions

  • Process architecture(built-in integrity checks)
  • Process management(data sharing, rewarding correct data entry)

Retrospective Solutions

  • Cleaning focus (duplicate removal, merge/purge, name and address matching, field value standardization)
  • Diagnostic focus (automated identification of problems)

Verifications

  • Checksums, verification parser
  • Do the uploaded files match an expected pattern?

Relationships

Checking for the relationships between the data flow and the processing steps

Interface Agreements

                 Data quality commitment from the data stream supplier

Data Quality Metrics

  • Metrics is used to identify the problem and the solution to eradicate it
  • It should be directionally correct with an improvement in use of data
  • The schema should be conformed and constraints need to be evaluated on a snapshot
  • The business rules should be followed accurately

Data Performance

Over the last decade, the biggest data warehouses have expanded from 5 to 100 terabytes, according to Winter Corp. The data warehouses have increased in size to accommodate the regulatory requirements and competitive demands. There are certain limits to the performance of an individual processor or CPU. Hence, all high-performance computers include multiple CPUs and multiple disks. The best three approaches for parallelizing work over additional hardware are:

Shared memory

  • All the CPUs share a single memory and a single collection of disks
  • The approach is relatively easier to program: the complex distributed locking and commit protocols are not required. This is because the lock manager and buffer pool are both stored in the memory system where they can be easily accessed  by the processors
  • The disadvantage of used a shared memory is that it has limited fundamental scalability options, since all the I/O and memory requests are transferred over the same bus
  • They also require customized hardware in order to keep their L2 data caches consistent

Shared disk

  • A shared disk architecture consists of a large number of independent processor nodes, each with its own memory
  • These nodes have access to a single collection of disks, typically in the form of a storage area network(SAN) or a network-attached storage(NAS) system
  • The drawback of a shared disk is that  the interconnection network  that connects each of the CPUs to the shared-disk subsystem results in a bottleneck
  • There is also no pool of memory that is shared by all the processors, hence there is no obvious place for the lock table  or the buffer pool to reside

Shared nothing

  • A share-nothing approach has each processor having its own set of data
  • Data is horizontally partitioned across nodes, such that each node has a subset of the rows from each table in the database. Each node then processes the rows on its disk
  • These architectures are well suited for star schema queries available in a data warehouse since only a limited amount of communication bandwidth is required to join one or more small dimension tables with the larger fact table
  • Every node also has its own lock table and buffer pool, eradicating the need of a complex locking and software or hardware consistency mechanisms
  • The shared-nothing architecture does not have a severe bus or resource contention as shared-memory or shared-disk machines. This allows it to be scaled to hundreds or even thousands of machines

Features to be implemented for a high performance data warehouse

  • Use a shared-nothing architecture
  • Build the architecture from commodity parts. The cost of a grid should never exceed $700 per CPU
  • Obtain a DBMS with compression
  • Use a column-stored database. They are 10 to 100 times faster than a row-store database on star schema warehouse queries
  • The column-store database should have an executor that runs on compressed data

Like us on Facebook