10 - Advanced Data Warehousing Concepts

Objective of Data warehouse Deployment

Till the year 2011, the architecture of the data warehouses was built to enable the existence of vendor’s specific technologies. The new architectures paved the path for the new products. The companies invested in the vendor’s data warehouses architectures and an entire process of standardization was developed where different choices could be combined together

The organization should be able to move from wherever it is present now into a position where its data warehouse architecture should be able to support the analytics and business framework of a given choice. The technology, data, and the BI applications assemble at deployment

Data warehouse Deployment Lifecycle

Project scope and plan

  • Resolving the scope of the project and the accomplishment to be performed. The number of OLTP sources required
  • The target date for the data to be available in the system for the users
  • The budget of the project, the role and profile requirements of the resources required to make the project happen

Requirement

  • What are the business questions? How can these questions alter the business decisions or initiate actions
  • What is the role of the users? How frequently do they use the system? Do they use any interactive reporting or just view the defined reports in guided navigation?
  • How do the users measure? Do they use metrics?

Front-end design

  • The front-end design needs for both the interactive analysis and the analytic workflows
  • How does the user interact with the system?
  • What is the process of analysis?

Warehouse schema design

  • Dimensional modeling – define the dimensions, facts, and the granularity of the schemas
  • Define the physical schema – depending on the technology design

OLTP to data warehouse mapping

  • Logical mapping – table to table and column to column mapping
  • Defining the transformation rules
  • Perform the data profiling
  • ETL design – include data staging and the detail ETL process flow

Deployment

  • Installation of the analytics and reporting and the ETL tools
  • Specific setup and configuration for OLTP, ETL, and the data warehouse
  • Sizing of the system and the database
  • Performance tuning and optimization

Once the deployment is done the following activities are to be performed as and when required:

Operation

  • Data warehouses contain two or more servers. Tasks involving backups, bug fixes, software updates, hardware maintenance, upgrades, and security patches must be performed regularly
  • Operation and maintenance of such services requires an operations staff. The current trend is to out-source these activities.
  • Outsourcing can save substantial amounts and provide efficient security requirements

Enhancement

  • Requirements always tend to occur frequently even after the data warehouse is deployed. These requirements tend to be adhoc in nature
  • As the users become efficient in handling the data warehouse, they would tend to require a system with more functionalities
  • Always ensuring that the system is built in  a way to accommodate the adhoc requirements goes a long way in building a successful data warehouse
  • It is therefore advised to have a flexible and scalable data warehouse

Help Desk

  • Good manuals and good training are always enough to use a data warehouse, but along with that a knowledgeable, available, responsive help desk is always critical for a successful data warehouse
  • Without a  help desk a data warehouse eventually becomes dated and under-utilized
  • Deployment of an excellent help desk with telephone, fax, e-mail is the single most important feature that ensures the continued success of  a data warehouse

Real-time Data warehousing

Traditionally data warehouses do not include today’s data. They contain the historical data. They utmost contain data from operational systems that are weekly or in some cases nightly, but are in any case something which is past information. The quick pace of business today is causing these historical systems less valuable to the business in the real world. As the current decisions in the business world become more real-time, the systems that support these decisions must be able to incorporate the changes at a greater speed.

The following are the challenges and approaches for making the data warehouse a real-time process:

Enabling the real-time ETL

Challenge

  •  The most difficult part in building a data warehouse is the extraction, transformation, and loading the data from the source system. Almost all the ETL tools and systems, whether based on off-the-shelf or custom coded, operate in a batch-mode. The assumption is made that the data is available as a sort of extract file that is nightly, weekly, or monthly
  • This process typically involves downtime of the data warehouse, so no users can access it when the load occurs. When data is loaded in real-time there can be no downtime of the system
  • The heaviest periods of data warehouse usage occurs along with the peak incoming data. The requirements for continuous updates, with no warehouse downtime are generally inconsistent with the traditional ETL tools

Solution

  •  The cheapest and easiest way to solve the real-time ETL problem is done by simply increasing the frequency of the existing data load. A daily data load can be made an hourly data load
  • The other option is to store the real-time data in an external real-time data cache outside the traditional data warehouse
  • The RTDC (real-time data cache) can be a dedicated database server involved in loading, storing, and processing the real-time data. There is no risk of introducing scalability or performance problems on the existing data warehouse when using an RTDC

Scalability

Challenge

  •  The issue of scalability is the most difficult problem faced by organizations when deploying real-time data warehouse solutions. The scalability of a data warehouse is directly proportional to the amount of data being queried  and the number of users simultaneously running the queries

Solution

  •  Simplifying the real-time reporting. The reports need to be confined to simple and quick single-pass queries
  • The addition of more hardware to deal with the expandability problems is another solution. More nodes can be added to a high-end database system or a stand-alone data warehouse box can be upgraded with faster processors and additional memory

Query Tools

A querying and reporting tool helps in executing regular reports, organize listings, and perform cross-tabular reporting and querying

Significance of SQL

  • SQL is the official database query language that is used to access and update the data involved within a relational data base management system or RDBMS
  • The importance of SQL for querying an reporting is that the language has represented a mostly standard way of accessing different RDBMS products
  • Each RDBMS product has a minor different SQL dialect but the basic syntax is same

Technical Query Tools

  • SQL forms the basis for the technical query tools. Many product-to-product matchups are available in data warehousing environments that offers tools provided by RDBMS vendors and other third-party vendors
  • A series of query tools allows the users to type in and edit the SQL queries. These tools are not designed for the end-users but for the developers involved in the coding and testing process

User Query Tools

User query tools offer visually-oriented, painting, environments that enables users to create screens for report-layouts, the data columns desired for the report, or the rows of data that the user wants to select

Reporting Tools

  • When the requirements of the end-users is complex user interaction or sophisticated formats, a tool with more reporting features is included
  • The reporting tools provide data accessing, filtering, and simple formatting
  • Reporting tools offer an environment that enables the users to create refined layouts that concentrate on formatting the data obtained from the database query

Data warehousing Analytics

Data warehousing analytics administers a framework of database, reports, and data objects that are created to interface with one or more Commerce Server runtime databases. The data warehouse analytics system is incorporated with a SQL server database, an analysis services databases, a set of functionalities that a system administrator uses to import and maintain data

Use of data warehouse analytics

  • Using the SQL Server 2005 or SQL Server 2000 reporting services we can create reports. SQL Server Reporting Services (SSRS) is used to create simplified custom reports
  • Addition of custom objects can be performed to the Commerce Server Data Warehouse resulting in extensibility
  • Tasks can be set up to import data from Internet Information Services(IIS) logs and from the Commerce Server Profiles, Catalogs, and Marketing Databases. Data can also be imported using the Import Wizard

Types of data representation using analytics

Logical Representation

The logical representation consists of the following entities

  • Classes: A class signifies a set of logically related information
  • Data members: A data member represents an attribute of a class
  • Keys: A key is an exclusive identifier for a class
  • Relations: A relation classifies a connection between two classes

Physical Representation

  • The physical representation composes of database tables, columns, and keys
  • A table correlates to a class and a column in turn correlates to a data member
  • Metadata stored in a data warehouse is used to define the conversion of the logical data into its physical representation

Analysis Services Representation

  • Data warehouse analytics system transforms data from the SQL server tables to Online Analytical Processing(OLAP) cubes in Analysis Server
  • The OLAP represents data consisting of cubes, measures, and dimensions

Data Visualization

Data visualization is the process that defines any effort to assist people to understand the importance of data by placing it in a visual context. Patterns, trends, and correlations that might be missed in text-based data can be represented and identified with data visualization software. It is a graphical representation of numerical data.

Types of data visualization

Visual Reporting

  • Visual reporting uses charts and graphics to represent the business performance, usually defined by metrics and time-series information.
  • The best dashboards and scorecards enables the users to drill down one or more levels to view more detailed information about a metric
  • A dashboard is a visual exception report that signifies the ambiguities in performances using visualization techniques

Visual Analysis

  • Visual analysis allows users to visually explore the data to observe the data and discover new insights
  • Visual analysis offers a higher degree of data interactivity
  • Users can visually filter, compare, and correlate the data at the speed of thought incorporating forecasting, modeling, and statistical analysis

Data Visualization Representations

Business Intelligence Dashboard

  • A business intelligence dashboard is a data visualization tool that represents the current status of metrics and key performance indicators for an enterprise
  • Dashboards combine and arrange numbers, metrics and sometimes performance scorecards on a single screen. They can be customized for a specific role and display metrics targeted for a single point of view or department
  • Microsoft and Oracle  are some of the  vendors for business intelligence dashboards. BI dashboards can also be created through other business applications, such as Excel. They are sometimes referred to as enterprise dashboards.

Performance Scorecard

  • It is a graphical representation of the progress over time of some entity, such as an enterprise, an employee or a business unit that functions towards some specified goal
  • The important factors of performance scorecards are targets and key performance indicators (KPIs). KPIs are the metrics that are used to evaluate factors that are essential for the success of an organization

The main difference between a business intelligence dashboard and a performance scorecard is that a business intelligence dashboard, like the dashboard of a car, indicates the status at a particular point in time. A performance scorecard displays the progress over time towards specific goals

Metadata

  • Metadata is the data in a data warehouse that is not typically the data itself but it’s the data about the data.
  • It can termed as the encyclopedia of the data warehouse
  • It consists of information on the database objects used in a data warehouse, system tables, indexes, views, database security levels, roles, and grants
  • It also contains data about the ETL transformations that load data from the staging area to the data warehouse. It also contains aggregate table functions
  • The importance of metadata is mostly compromised. In order to have a control over the data warehouse development and maintenance it’s a best practice to create a repository to store metadata and update it on a daily basis
  • The clarity of relationships, discrepancies , and uncertainty between the entities stored in a data warehouse is resolved using the metadata
  • The information contained in metadata helps in combining the data that has differences such as ‘’clients’’ and ‘’customers’’
  • Metadata contains information about the origin and granularity of the data at the attribute level

Three categories of metadata

  • Business metadata comprises the definitions of the data files and attributes in terms of business. It also consists of data quality metrics and similar information for the business users to navigate through the repository of information
  • Technical metadata is the most common form of metadata. This metadata is designed and used by the tools and applications that create, manage, and use data
  • Operational metadata consists of information available in operational systems and run-time environments. It also contains data file size, date and time of data load, updates and backups

Big Data

What is Big Data

Big data involves the disparate, volatile, and enormous data utilized in the various technologies. Using the big data database the different enterprises can recover their financial prospects, improve their economy and incorporate various business aspirations.

Need for Big Data

Constitute newer applications

Big data is used by enterprises to accumulate a vast amount of data from its real-time processes that might be product-related, customer-related, or resource-related. The data that is collected is modified to suit the business needs which in turn improve the optimal use of the resources and the customer satisfaction. This creates newer applications by recreating the smaller sets of data from the big data

Diminish the application costs

Big data can reinstate the huge applications that are tailor-made, costly legacy systems with a definitive solution that adheres to commodity hardware. Many of the big data technologies run on open source (free of cost) hence they can be enforced on a cheaper scale thus reducing the overall cost of maintaining an application

Adheres the customer-organization relationship

The quantity of data stored and the frequency of update on the data ensures that the businesses to meticulously respond to the customers. This also involves a faster turn out time to the customers which improves the trust factor between the customers and the enterprises

Types of Big Data

Online Big Data

Online Big Data consists of the data that is established and modified in real-time applications to backup the operational applications and the users. The intermission time for these applications should be low and the availability high since it’s predominantly used by the users to meet their SLAs (Service Level Agreements). MongoDB and NoSQL databases are the examples for online Big Data applications

Offline Big Data

Online Big Data include applications that modify the Big Data in a from a batch perspective. They do not constitute new data. The response time for this data is usually low. This data is utilized to provide reports or dashboards due to which even if these applications go offline there is no need for panic among the users. ETL and BI tools are examples for offline Big Data

Enterprise Data Warehouse

An enterprise data warehouse is a collective database that contains all the business information an organization and makes it accessible across the company. It offers access to all the data in an organization without compromising the security and integrity of the data

Advantages of EDW

  • It was conceptualized for analytical and strategic business decisions and to help enhance the client’s competitive positioning
  • A consolidated data mart offers an enterprise view of the organization’s business for advanced analytics and reporting
  • It is designed to simplify information access and standardize it across several source systems that are disparate
  • The lesser the processes and systems to support, the maintenance of systems would be less demanding

OLAP

OLAP is a term used to describe the analysis of complex data from the data warehouse. The three layers of OLAP servers are

ROLAP

  • Relational OLAP is a relational and specialized DBMS to store and manage the data in a data warehouse
  • Involves the optimization of DBMS backend, implementation of greater aggregation navigation logic, and additional tools and services
  • Greater expandability

MOLAP

  • Multi-dimensional OLAP is a array-based storage structure
  • It has a direct access to array data structures
  • Fast-indexing to pre-computed summarized data

HOLAP

  • Hybrid OLAP stores detailed data in RDBMS
  • Also stores aggregated data in MDBMS
  • User access via MOLAP tools

Like us on Facebook