03 - Normalizing and denormalizing data

3.1 Data Redundancy

Redundancy Indicates the data is repeated. It increases the time for updating, adding and deleting data. The disk space and disk I/O operations are increased due to the redundancy.

Consider the following table STUDENT. 

 

STUDID

STUDENT

NAME

STUDENT

CITY

STUDENT

CLASS

STUDENT

TEST SCORE

101

Mary

Wales

MS

50

101

Mary

Wales

MS

89

102

Mark

Chicago

MBA

60

102

Mark

Chicago

MBA

83

103

Peter

Wales

BE

70

103

Peter

Wales

BE

58

104

Joe

Perth

Literature

65

104

Joe

Perth

Literature

90

105

Harry

New York

Science

44

105

Harry

New York

Science

50

 In the above table, the details of the students like STUDID, STUDENTNAME, and STUDENTADDRESS are repeated while adding the marks for each semester. If the user needs to modify the address for a student, it has to be modified in multiple rows for that student. If it is not done it could lead to data inconsistency across rows. 

Redundancy can cause the following issues:

1) Insertion, deletion and modification of data that can cause inconsistencies.
2) Errors can occur when the data is repeated.
3) Unnecessary utilization of the extra disk space.

The concept of normalization can be used to reduce redundancy.

3.2 Normalization Types

Normalization is a process of breaking the complex table structures into simpler table structures by using rules. Using the approach user can reduce the redundancy in a table and eliminate the problem of inconsistency. There is no loss of information.

Normalization has the following benefits:

1) It helps in maintaining data integrity
2) It helps in simplifying the structure of the table and thus makes the database more compact
3) It helps in reducing the null values which reduces the complexity if the data operations

With normalization, tables are formed that follow certain specified rules and represent certain normal forms. The forms are used to check that no error or inconsistency in data is introduced in the database.  A table structure is always in certain normal form. The widely used normal forms are as mentioned below:

1) First Normal Form (1NF)
2) Second Normal Form (2NF)
3) Third Normal Form (3NF)
4) Boyce Codd Normal Form (BCNF)

1) First Normal Form (1NF)

A table is said to be in 1NF when each cell of the table contains only one value.

The rules for converting a table to 1NF are as follows:

1) Place the related data values in a table. Define similar data values with the column name.
2) There should be no repeating group in the table
3) Every table must have a unique primary key

Consider the PROJECT table.

EMPLID

DEPT

DEPTHEAD

PROJECTCODE

Time

E001

Sales

E111

P40

P70

P50

200

181

90

E281

Marketing

E122

P51

P35

190

50

E351

HR

E133

P67

P90

NULL

30

In the above table, the values are related and similar values are defined by common attribute. The table has a primary key as EMPLID. But the columns PROJECTCODE and TIME have multiple values and hence it is not in the normal form.

By applying the 1NF to the table, the new PROJECT table is mentioned below:

EMPLID

DEPT

DEPTHEAD

PROJCODE

HOURS

E001

Sales

E111

P40

200

E001

Sales

E111

P70

181

E001

Sales

E111

P50

90

E281

Marketing

E122

P51

190

E281

Marketing

E122

P35

50

E351

HR

E133

P67

NULL

E351

HR

E133

P90

20

2) Second Normal Form (2NF)

A table is said to be in 2NF when 

1) It is in 1NF
2) No partial dependency exists between the non key and key attributes

The rules for converting a table into 2NF are as follows:

1) Remove the attributes that are functionally dependent on only a part of the key and not on the whole key. Place them in a different table.
2) Group the remaining attributes.

Consider the STUD_SUB_DETAILS table as shown below:

STUD_ID

STUD_FIRST_NAME

STUD_LAST_NAME

SUB_CODE

SUB_NAME

ST32

Mark

Steven

SC345

Java

ST34

Peter

Hanks

SC534

HTML

ST65

Joey

Carol

SC345

Java

ST44

Lee

Jones

SC564

JavaScript

ST34

Peter

Hanks

SC564

JavaScript

ST66

Tom

Sean

SC655

XML

Consider the following points about the table.

1) The primary key of the preceding table is STUD_ID + SUB_CODE.
2) There are repeated values in the column STUD_FIRST_NAME, STUD_LAST_NAME and SUB_NAME. These attributes depend only on the part of the primary key.
3) The values in the column STUD_FIRST_NAME and STUD_LAST_NAME are dependent on STUD_ID.
4) The values in the SUB_NAME are only dependent on SUB_CODE and all non primary fields are not functionally dependent on the whole primary key.

Hence, to normalize the table in 2NF, user needs to break the table into smaller tables.

STUD_DETAILS

STUD_ID

STUD_FIRST_NAME

STUD_LAST_NAME

ST32

Mark

Steven

ST34

Peter

Hanks

ST65

Joey

Carol

ST44

Lee

Jones

ST66

Tom

Sean

SUBJECT_DETAILS

SUB_CODE

SUB_NAME

SC345

Java

SC534

HTML

SC564

JavaScript

SC655

XML

STUD_SUBJECT_DETAILS

STUD_ID

SUB_CODE

ST32

SC345

ST34

SC534

ST65

SC345

ST44

SC564

ST34

SC564

ST66

SC655

3) Third Normal Form (3NF)

A relation is said to be in third normal form if and only if: 

1) It is in 2NF
2) No transitive dependency exists between the non key attributes and key attributes.

The guidelines to convert the table into 3NF are as follows:

1) Find and remove all the non key attributes that are functionally dependent on attributes that are not in the primary key. Place them in different table.
2) Group the remaining attributes

Consider the following EMPLOYEE table.

EMPLID

DEPTNAME

DEPTHEAD

E801

Sales

E880

E231

Marketing

E909

E350

Production

E910

E481

HR

E266

E550

Testing

E714

E271

Accounts

E945

In the above table, there is a single value in each cell. Hence it is in 1NF. 

The primary key in the table is EMPLID. For each value of EMPLID there is only one value of DEPTNAME. Hence, DEPTNAME is functionally dependent on the primary key EMPLID. For each value of EMPLID there is one DEPTHEAD. Hence all the attributes are functionally dependent on the whole key, EMPLID. Hence it is in 2NF.

To convert the table into 3NF, user must remove the DEPTHEAD column from the table as it is not functionally dependent on the primary key, EMPLID. 

The following table shows the 3NF form.

EMPLOYEE

EMPLID

DEPTNMAE

E801

Sales

E231

Marketing

E350

Production

E481

HR

E550

Testing

E271

Accounts

DEPARTMENT

DEPTNAME

DEPTHEAD

Sales

E880

Marketing

E909

Production

E910

HR

E266

TestingE714
AccountsE945

4) Boyce – Codd Normal Form (BCNF)

A relation is said to be in BCNF if and only if every determinant is a candidate key. 

The guidelines for converting a table into BCNF are as follows:

1) Find and remove the overlapping candidate keys. Place the part of the candidate key and the attribute, it is functionally dependent on, in a different table.
2) Group the remaining items in a table.

Consider the following PROJECT table.

EMPLID

EMPLNAME

PROJCTCODE

TIME

E201

Nick

P272

101

E201

Harry

P111

80

E302

Sam

P134

150

E401

Adam

P169

15

E503

Nisha

P170

25

E909

Perry

P180

60

The above table has redundancy. The PROJECT table has the following problems.

1) Multiple candidate keys that are ECODE+PROJCODE and NAME+PROJCODE.
2) Composite candidate keys
3) Candidate keys that overlap since the attribute PROJCODE is common between the two candidate keys.

The tables in BCNF are as shown below:

EMPLID

EMPNAME

E201

Nick

E201

Harry

E302

Sam

E401

Adam

E503Nisha
E909Perry

PROJECT

EMPLID

PROJCTCODE

TIME

E201

P272

101

E201

P111

80

E302

P134

150

E401

P169

15

E503

P170

25

E909

P180

60

3.3 Understanding denormalization

The intentional introduction of redundancy in a table in order to improve performance is called denormalization.

Consider the following tables as ORDERS and PRODUCTS.

ORDERS

ORDERNO

PRODUCTID

QTY

101

P1

2

102

P3

1

103

P1

1

104

P2

3

105

P2

2

PRODUCTS

PRODUCTID

PRODUCT DESC

COST

P1

Toy Dog

20

P2

Toy Tiger

10

P3

Toy Lion

12

If a store keeper wants to calculate the total sale of a toy, there are thousands of rows. The server will take a lot of time to process the query and return the result. Hence, to speed up the query, the user can store the values of TAX, ORDERCOST, and PRODUCTCOST in one table as shown below.

ORDERS

ORDERNO

PRODUCTID

QTY

PRODUCTCOST

TAX

ORDERCOST

101

P1

2

40

8

88

102

P3

1

12

1.2

13.2

103

P1

1

20

2

22

104

P2

3

30

9

99

105

P2

2

20

4

44

Like us on Facebook