03 - Datawarehouse Infrastructure

Data Warehouse Hardware

Data warehouse designers and administrators should always have forethought about the Input/Output performance while implementing a data warehouse. The data warehouse operations mainly consist of huge data loads and index builds, generation of materialized views, and queries over large volumes of data. The elemental I/O system of a data warehouse should be built to meet these heavy requirements.

Number of CPUs

  • CPUs are responsible for the calculation abilities of a data warehouse
  • Parallel operations are CPU- intensive when compared to the serial operations
  • The number of CPUs is based on the highest throughput. The number of CPUs is calculated roughly based on the below formula:

No of CPUs = Maximum throughput in Mb/s /200

200 Mb/sec is the amount of data a CPU can sustain

Memory of data warehouse

  • Large sorts is an example of a memory-intensive operation
  • The memory requirements of a data warehouse are not the same when compared to mission-critical OLTP applications
  • The amount of memory is calculated based on the number of CPUs as stated below:

Amount of memory in GB = 2 * Number of CPUs

Number of Disks

  • The number of disks is based on the maximum throughput. The storage provider’s specifications should be used to find out the throughput a disk array can withstand
  • The number of disks is calculated as stated below:

Number of disks = Throughput in MB/s / Individual controller throughput in MB/s

Disk Redundancy

  • Data warehouses are the largest storage systems used widely across several enterprises. They have many disks which are liable to failure
  • Disk redundancy is important to avoid the failure of the entire system in case of hardware malfunctioning
  • Redundancy should be achieved based on the cost constraints and performance of a data warehouse. In case of failure of one disk the data is stored in another disk is always critical for a data warehouse

Plan for expansion

  • The data in a data warehouse keeps on growing. The data warehouse designer should focus on the growth of the I/O system without hindering the I/O bandwidth
  • In order to prevent unwanted data from burdening the systems businesses should pay attention to the age and overall quality of the archived data. Depending on the needs of the businesses the archiving methods should be effectively performed periodically
  • An example is IBM InfoSphere Balanced Warehouse available in all configurations and sizes that helps designers discover, model and standardize the data and the IBM Optim software helps in automated archival and storage of historical records

Data Warehouse Software

Data warehousing demands the following prominent features when deciding a platform for functioning that are listed below:

  • A prospect of combining various management systems
  • A possibility of enhancing the structure of the queries
  • A possibility of improving the load processes

The below table states the software with the versions considered for a data warehouse

Vendor Name

Products and Version

IBM

InfoSphere Balanced Warehouse 9.5

SAP

NetWeaver BI (Business Warehouse) 7.0

Teradata

Active Enterprise Data Warehouse 5550, Data Warehouse Appliance 2550, Data Mart Appliance 550, 12

Microsoft

SQL Server 2008

Oracle

Optimized Warehouses, Database 11g, Warehouse Builder 11g

Sybase

Analytic Appliance, IQ 12.7

Netezza

Performance Server 1000 Series Data Warehousing Appliance 4.5

 

IBM

IBM is extensively used for its customer care, offering solutions according to the client’s needs. The distinguishing features of IBM are

  • DB2 a product of IBM is well-known for its efficient enterprise database
  • Complete integration with Webshpere, Cognos and Infosphere
  • IBM appliances are adaptable for supporting mixed workloads - OLAP, ETL, and

Ad-hoc queries

SAP

SAP is not famous among new customers but focuses on its already existing users. The features of Sap are

  • Row-oriented storage capabilities
  • Wide range of EDW services by SAP itself or significant partnerships
  • Complete integration with service-based architecture capabilities, BI, middleware, and desktop software

Teradata

Among all the vendors Teradata wins hands down. Teradata’s scalability makes this possible. The following are the other significant reasons

  • EDW packaging, pricing, and licensing options
  • Solutions prepared by the teradata designers are well understood and supportive for the external applications and middle ware
  • Extensive range of products and services
  • Extraordinary scalability allows the Teradata customers to expand data warehouses through Massively Parallel Processors(MPP)

Microsoft

Magnificent acquisitions achieved by Microsoft have made it a well known platform among different enterprises. Its features are

  • SQL Server 2008 is a platform worth observing. It can be easily deployed in different BI and EDW technologies
  • Cost-based query optimization, indexing, partitioning and caching
  • Efficiently adjustable database that can be used by businesses for various size and trade

Oracle

Oracle Database 11g and Oracle Warehouse Builder are the tools that have let Oracle achieve its position among the vendors. Its features are

  • The EDWs can be expanded to various nodes that can sustain hundreds of terabytes
  • Provides its customers a range of choice among different Optimized Warehouse EDW appliances
  • Oracle solutions can be deployed on different hardware and software platforms, they are not restrained by a standard configuration

Sybase

Sybase offers low-cost solutions to the mid-market and less budget enterprises. The features of Sybase are

  • Provides scalability options
  • Supports integration and modeling tools
  • Compatibility with various platforms

Netezza

The constant growth of Netezza’s portfolio has placed it among the top data warehousing platforms. Its features are

  • Exclusive model of physical data storage implementing.
  • Powerful in-database analytic framework
  • Efficient BI, OLAP, query, and advanced analytic frameworks

Data warehouse Appliances

Large and small enterprises need to meet the demands of the ever-expanding workload requirements and continuous delivery of ROI while reducing the budgets. Appliances that are purpose-built devices that pre-integrate hardware and software to address particular workloads are promptly being considered

Whole Technology Stack Appliances

  • Netezza was the first vendor to provide a data warehouse appliance in 2002. The Netezza performance server consolidates database and operating system software with server and storage hardware in a complete data warehouse platform
  • DATAllegro started in mid-2005 was similar to Netezza. Microsoft acquired DATAllegro in 2008 combined DATAllegro’s MPP architecture into SQL Server, which runs on commodity hardware.

Partial-Technology Stack Appliances

  • Launched in 2006, new vendors evolved with database management systems. These include DBMS based on the relational model(Greenplum & Kognitio)
  • These appliances typically call themselves software appliances since they focus on the database software

Data warehouse Tools

Data warehousing tools include

  • Dominant database vendors (IBM, Microsoft, Oracle)
  • Independent companies (Informatica – the current leader in the market)
  • Open source (Clover ETL)

IBM

  • Product Name: InfoSphere DataStage
  • Supports various data sources and ability to handle data arriving in real-time
  • Scalability
  • Includes data quality tools, business process management, change data capture, and online marketing tools

Microsoft

  • Integration Services – part of SQL Server
  • Supports different data types
  • Data and text mining tools available

Oracle

  • Oracle Warehouse Builder
  • Migrating data from legacy systems
  • Data profiling
  • Data cleaning

Oracle

  • Oracle Warehouse Builder
  • Migrating data from legacy systems
  • Data profiling
  • Data cleaning
  • Supports functions, procedures, and packages

Informatica

  • Market leader – Informatica PowerCenter
  • Supports migration, synchronization, warehousing
  • Supports multiple data formats
  • Real-time capabilities
  • Data cleaning
  • Supports functions, procedures, and packages

Like us on Facebook