June 06, 2011

Allocation units in SQL server

In Microsoft SQL Server, the database is basically divided into different allocation units, viz. Index Allocation Map or IAM pages, Global Allocation Map or GAM pages, Shared Global Allocation Map or SGAM, and or Page Free Space or PFS pages.
SQL Server uses the IAM pages to find the extents allocated to the object.
For each extent, SQL Server searches the PFS pages to see if there is a page
with enough free space to hold the row.If SQL has to allocate new space, the GAM and/or SGAM is used. This would
occur AFTER SQL has determined that there is not enough space to insert the
new row on pages already allocated to the object.
All of these allocation units are required to be in healthy condition for the successful functioning of your MS SQL database. However, corruption in any of these allocation units or pages can result in stopping SQL database mounting; thereby, resulting in the inaccessibility of databases elements like tables, views, stored procedures, indexes, triggers, etc.
Various factors can be responsible for such corruption in SQL Server allocation pages. Naming a few of them – wrong system or application shutdown, MS SQL Server damage, Trojan infections or virus attacks on the system, software malfunctioning, human errors, media errors, etc. However, in the above situation, the major possibility of damage is due to the corruption in any of GAM, SGAM, or PFS pages of SQL database Server Meta structure.

For resolving the issue and for regaining access to your SQL database elements again, you can do the following measures:
* If corruption has occurred due to physical damage, then you must replace that system component which is damaged by a new component
* However, if the problem is as a result of some logical damage, then you can run the ‘DBCC CHECKDB' command with correct repair clause
* In case above resolutions fail, try to restore your damaged or deleted SQL database from an updated backup

June 05, 2011

Data Normalization

Data Normalization
The process of modeling data into relational tables is known as normalization.
Common normal forms: 1NF, 2NF and 3NF.
Data stored in a relational database and accessed with SQL should be normalized to 3NF.
PK: Attribute of set of attributes which uniquely identify a record.

Books Table:
ISBN(PK) Title Authors Publisher
123 Oracle 10g Auth1,Auth2 Pub1,Stree1,City1, State1
124 MS SQL Auth1,Auth3 Pub2,Stree2,City2, State1

Problem with above schema:
1. Insertion anomaly: It is impossible to enter details of authors whose books are not published as there is no ISBN .
2. Deletion anomaly: It is impossible to delete book without losing publisher details.
3. Update anomaly: If some publisher address changes, it is necessary to update every book that publisher has published.
4. Data inconsistency errors due to incorrect entry of publisher address is any of the row.
5. Author value is multi valued, so finding books written by Auth1 need to search all values in Authors column.
6. Storage inefficiency as replication of address of each publisher across rows.


1NF:
Remove the repeating groups, i.e. Author column, pull them out into a separate table, Authors.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1,Stree1,City1, State1
124 MS SQL Pub2,Stree2,City2, State1

Authors:
ISBN Author Name
123 Auth1
123 Auth2
124 Auth1
124 Auth3

Problems solved by 1NF:
1. Insertion anomaly: No need to insert unpublished book in Books table and can insert Author record in Authors table.
2. Query to find no. of books written by Auth1 or find author who written maximum books is now simplified.
2NF:
Remove columns which are not dependent on PK of table. Here publisher address is not dependent on ISBN in books table.
Books Table:
ISBN(PK) Title Publisher
123 Oracle 10g Pub1
124 MS SQL Pub2

Publishers Table:
Publisher (PK) Street City State
Pub1 Stree1 City1 State1
Pub2 Stree2 City2 State1

Problems solved by 2NF:
1. Update anomaly: Need to update only 1 record if publisher details changed.
2. Storage efficiency: Solves problem of storing publisher details many times.
3. Data consistency errors caused by incorrect entry in publisher column.
3NF:
Remove all columns that are interdependent. In Publishers table, street -- > City and City --> State, i.e. street can belongs to only 1 city and 1 city belongs to only 1 state. This can be achieved by adding address code column in Publishers table and have separate table for city, state and street.
Publisher (PK) Address code
Pub1 AC1
Pub2 AC2

Address Table:
Address code (PK) Street City State
AC1 Stree1 City1 State1
AC2 Stree2 City2 State1

Normalized Books database:
Table Primary Key Foreign Key
Books ISBN Publisher
Authors ISBN + Author Name ISBN
Publishers Publisher Address code
Address Address Code

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...