June 05, 2011

Data Normalization

Data Normalization
The process of modeling data into relational tables is known as normalization.
Common normal forms: 1NF, 2NF and 3NF.
Data stored in a relational database and accessed with SQL should be normalized to 3NF.
PK: Attribute of set of attributes which uniquely identify a record.

Books Table:
ISBN(PK) Title Authors Publisher
123 Oracle 10g Auth1,Auth2 Pub1,Stree1,City1, State1
124 MS SQL Auth1,Auth3 Pub2,Stree2,City2, State1

Problem with above schema:
1. Insertion anomaly: It is impossible to enter details of authors whose books are not published as there is no ISBN .
2. Deletion anomaly: It is impossible to delete book without losing publisher details.
3. Update anomaly: If some publisher address changes, it is necessary to update every book that publisher has published.
4. Data inconsistency errors due to incorrect entry of publisher address is any of the row.
5. Author value is multi valued, so finding books written by Auth1 need to search all values in Authors column.
6. Storage inefficiency as replication of address of each publisher across rows.


1NF:
Remove the repeating groups, i.e. Author column, pull them out into a separate table, Authors.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1,Stree1,City1, State1
124 MS SQL Pub2,Stree2,City2, State1

Authors:
ISBN Author Name
123 Auth1
123 Auth2
124 Auth1
124 Auth3

Problems solved by 1NF:
1. Insertion anomaly: No need to insert unpublished book in Books table and can insert Author record in Authors table.
2. Query to find no. of books written by Auth1 or find author who written maximum books is now simplified.
2NF:
Remove columns which are not dependent on PK of table. Here publisher address is not dependent on ISBN in books table.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1
124 MS SQL Pub2

Publishers Table:
Publisher (PK) Street City State
Pub1 Stree1 City1 State1
Pub2 Stree2 City2 State1

Problems solved by 2NF:
1. Update anomaly: Need to update only 1 record if publisher details changed.
2. Storage efficiency: Solves problem of storing publisher details many times.
3. Data consistency errors caused by incorrect entry in publisher column.
3NF:
Remove all columns that are interdependent. In Publishers table, street -- > City and City --> State, i.e. street can belongs to only 1 city and 1 city belongs to only 1 state. This can be achieved by adding address code column in Publishers table and have separate table for city, state and street.
Publisher (PK) Address code
Pub1 AC1
Pub2 AC2

Address Table:
Address code (PK) Street City State
AC1 Stree1 City1 State1
AC2 Stree2 City2 State1

Normalized Books database:
Table Primary Key Foreign Key
Books ISBN Publisher
Authors ISBN + Author Name ISBN
Publishers Publisher Address code
Address Address Code

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...