April 17, 2008

Database Fundamentals

1. What is a database?
The term "database" is used to describe both the collection of data and the software tool used to manage the data (usually called Database Management System).

2. What is RDBMS ?

RDBMS stands for Relational Database Management System. RDBMS data is structured in database tables, fields and records. Each RDBMS table consists of database table rows. Each database table row consists of one or more database table fields.

RDBMS store the data into collection of tables, which might be related by common fields (database table columns). RDBMS also provide relational operators to manipulate the data stored into the database tables. Most RDBMS use SQL as database query language.

Edgar Codd introduced the relational database model. Many modern DBMS do not conform to the Codd’s definition of a RDBMS, but nonetheless they are still considered to be RDBMS.

The most popular RDBMS are MS SQL Server, DB2, Oracle and MySQL.

3. What is SQL?
SQL stands for Structured Query Language. SQL language is used to create, transform and retrieve information from RDBMS (Relational Database Management Systems). SQL is pronounced SEQUEL. SQL was developed during the early 70’s at IBM.

4. Codd's laws :
The concept of relational databases was first described by Edgar Frank Codd (almost exclusively referenced as E. F. Codd in technical literature) in the IBM research report RJ599, dated August 19th, 1969.1 However, the article that is usually considered the cornerstone of this technology is "A Relational Model of Data for Large Shared Data Banks," published in Communications of the ACM(Vol. 13, No. 6, June 1970, pp. 377-87). Only the first part of the article is available online.
Additional articles by E. F. Codd throughout the 1970s and 80s are still considered gospel for relational database implementations. His famous "Twelve Rules for Relational Databases"2 were published in two Computerworld articles "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" on October 14, 1985, and October 21, 1985, respectively. He has since expanded on the 12 rules, and they now number 333, as published in his book "The Relational Model for Database Management, Version 2" (Addison -Wesley, 1990).
In 1985 Codd produced the following set of 'rules' by which systems should be judged:
4.1. Information.
All information is represented at the logical level and in exactly one way - by values in tables.
4.2. Guaranteed access.
Each datum (atomic value) in a relational database is guaranteed to be logically accessible through a combination of table-name, primary key value and column-name.
4.3. Systematic treatment of null values.
Null values (distinct from the empty character string or a string of blank characters, and distinct from zero or any other number) are supported in a fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.
4.4. Database description.
The database description is represented at the logical level in the same way as ordinary data, so that authorised users can apply the same relational language to its interrogation as they apply to the regular data.
4.5. Comprehensive sub-language.
A relational system may support several languages and various modes of terminal use (for example the fill-in-the-blanks mode). However there must be at least one language whose statements are expressible through some well defined syntax as character strings, and that is comprehensive in supporting all the following items:
1. Data definition
2. View definition
3. Data manipulation (interactive and by program)
4. Integrity constraints
5. Authorisation
6. Transaction boundaries (begin, commit and rollback)
4.6. View updating.
All theoretically-updatable views are also updatable by the system. (A view is theoretically updatable if there is a time-independent algorithm for unambiguously determining a single series of changes to the base tables, having as their effect precisely the requested changes in the view.)
4.7. Insert and update.
The capability of handling a base table or a derived table as a single operand applies not only to retrieval of data but also to insertion, updating, and deletion. (This allows the system to optimise its execution sequence by determining the best access paths. It may be important in obtaining efficient handling of transactions across a distributed database, avoiding the communications costs of transmitting separate requests for each record obtained from remote sites.)
4.8. Physical data independence.
Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods. (There must be a clear distinction between logical and physical design levels.)
4.9. Logical data independence.
Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables. (This rule permits logical database design to be changed dynamically, e.g. by splitting or joining base tables in ways which do not entail loss of information.)
4.10. Integrity independence.
Integrity constraints must be definable in the relational data sub-language and storable in the catalogue, not in the applications program. Certain integrity constraints hold for every relational database, further application-specific rules may be added. The general rules relate to:
1. Entity integrity : no component of a primary key may have a null value.
2. Referential integrity : for each distinct non-null 'foreign key' value in the database, there must exist a matching primary key value from the same domain.
4.11. Distribution independence.
A relational DBMS has distributional independence - i.e. if a distributed database is used it must be possible to execute all relational operations upon it without knowing or being constrained by the physical locations of data. This must apply both when distribution is originally introduced, and when data is redistributed.
4.12. Non-subversion.
A low-level (single record-at-a-time) language cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher-level (multiple record-at-a-time) language.

1 comment:

Anonymous said...

get the how to repair word document utility to fix data corruption issues in the files of specified format

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...