A Laymans Approach to Relational Database Normalization

 

 

Application of the Relational Database model to a data set involves the removal of duplication. Removal of duplication is performed using a process called Normalization. Normalization is comprised of a set of rules called Normal Forms. Normalization is applied to subsets of data or tables in a database. Tables are for placing directly associated data into. Tables can be related or linked to each other through the use of index identifiers. An index identifier identifies a row of data in a table much like an index is used in a book. The index is used to locate an item of interest without having to read the whole book.

 

There are five levels or layers of Normalization called 1st, 2nd, 3rd, 4th and 5th Normal Forms. Each Normal Form is a refinement of the previous Normal Form. 4th and 5th Normal Forms are rarely applied. In designing tables for performance it is common practice to ignore the steps of Normalization and jump directly to 2nd Normal Form. 3rd Normal Form is often not applied either; unless many-to-many joins cause an absolute need for unique values at the application level.

i  Over-Normalization can lead to poor performance in both OLTP and Data Warehouse type databases. Over-Normalization is common in top-down designed Java object applications. In this situation an object structure is imposed onto a relational database. Object and relational data structures are completely different methodologies.

 

That is far too much jargon. Let us make the understanding of Normalization very simple. Forget about it! Normalization is for Academics and in its strictest form is generally impractical due to its adverse effect on performance in a commercial environment, especially 3rd, 4th and 5th Normal Forms. The simplest way to describe what Normalization attempts to achieve can be explained in three ways.

 

1.       Divide the whole into smaller more manageable parts.

2.       Removal of duplicated data into related subsets.

3.       Linking of two indirectly related tables by the creation of a new table. The new table contains indexes from the two indirectly related tables. This is commonly known as a many-to-many join.

 

These three points are meaningless without further explanation of Normalization. So let us go through the rules and try to explain it in a non-Academic fashion. Let us start with some relational database buzzwords.

 

     A table contains many repetitions of the same row. A table defines the structure for a row. An example of a table is a list of customer names and addresses.

     A row is a line of data. Many rows make up the data in a table. An example of a row is a single customer name and address within a table of many customers. A row is also known as a record or a tuple.

     The structure of a row in a table is divided up into columns. Each column contains a single item of data such as a name or address. A column can also be called a field or attribute.

     Referential Integrity is a process of validation between related tables where references between different tables are checked against each other. A primary key is placed on a parent or superset table as the primary identifier or key to each row in the table. The primary key will always point to a single row only and it is unique within the table. A foreign key is a copy of a primary key value in a subset table. An example of a function of Referential Integrity is that it will not allow the deletion of a subset record where a foreign key value exists in a parent table. Primary keys in this document are referred to as PK and foreign keys as FK. Note that both primary and foreign keys can consist of more than one column. A key consisting of more than one column is known as a composite key.

     An index is used to gain fast access to a table and to enforce relationships between tables. An index allows direct access to rows by duplicating a small part of each row to an additional (index) file. An index is a copy of the contents of a small number of columns in a table. The most efficient indexes are made up of single columns containing integers.

i  Primary and foreign keys are special types of indexes, applying referential integrity.

 

1st Normal Form

 

1st Normal Form removes repetition by creating one-to-many relationships. Data repeated many times in one table is removed to a subset table. The subset table becomes the container for the removed repeating data. Each row in the subset table will contain a single reference to each row in the original table. The original table will then contain only non-duplicated data.

 

In the example in Figure 1 a 1st Normal Form transformation is shown. The purchase order table on the left contains customer details, purchase order details and descriptions of multiple items on the purchase order. Application of 1st Normal Form removes the multiple items from the purchase order table by creating a one-to-many relationship between the purchase order and the purchase order item tables. This has three benefits.

 

     Saves space.

     Reduces complexity.

     Ensures that every purchase order item will belong to a purchase order.

 

In Figure 1 the crows-foot pointing to the purchase order item table indicates that for a purchase order to exist, the purchase order has to have at least one purchase order item. The line across the pointer to the purchase order table signifies that at least one purchase order is required in this relationship. The crows-foot is used to denote an inter-entity relationship.

i  Inter-entity relationships can be zero, one or many to zero, one or many.

 

The relationship shown in Figure 1 between the purchase order and purchase order item table is that of one-and-only-one to one-or-many.

 

Figure 1 1st Normal Form

2nd Normal Form

 

2nd Normal Form creates not one-to-many relationships but many-to-one relationships, effectively separating static from dynamic information. Static information is potentially repeatable. This repeatable static information is moved into separate tables. In Figure 2 the customer information is removed from the purchase order table. Customer information can be duplicated for multiple purchase orders or have no purchase orders; thus the one-and-only-one to zero-one-or-many relationship between customer and purchase order.

Figure 2 2nd Normal Form

3rd Normal Form

 

My version of 3rd Normal Form is used to resolve many-to-many relationships into unique values.

i  This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.

In Figure 3 a student can be enrolled in many courses and a course can have many students enrolled. The point to note is that it is impossible to find a unique course-student item without joining every student with every course. Therefore each unique item can be found with the combination of values. Thus the coursestudent entity in Figure 3 is a many-to-many join resolution entity. In a commercial environment it is very unlikely that an application will ever need to find this unique item, especially not a modern-day Java object web application where the tendency is to drill-down through list collections rather than display individual items. Many-to-many join resolutions should only be created when they are specifically required by the application. It can sometimes be better to resolve these joins in the application to improve database performance.

i  Be very careful using 3rd Normal Form and beyond.

 

Figure 3 3rd Normal Form

4th Normal Form

 

4th Normal Form is intended to separate multi-valued facts in a single table into multiple tables. In Figure 4 employee skill and certification lists or collections are removed into separate entities. An employee could have skills or certifications, or both. The point to note is that there is no connection between the attributes of the employees table, other than the employee number, and the details of skills or certifications for each employee.

Figure 4 4th Normal Form

5th Normal Form

 

5th Normal Form divides related columns into separate tables based on those relationships. In Figure 5 product, manager and employee are all related to each other. Thus three separate entities can be created to explicitly define those inter-relationships. The result is information that can be reconstructed from smaller parts. An additional purpose of 5th Normal Form is to remove redundancy or duplication not covered by 1st to 4th Normal Forms of Normalization.

Figure 5 5th Normal Form

A Summary of Normalization

 

     1st Normal Form removes repetition by creating one-to-many relationships.

 

     2nd Normal Form creates not one-to-many relationships but many-to-one relationships, effectively separating static from dynamic information. 2nd NF removes items from tables independent of the primary key.

 

     3rd Normal Form is used to resolve many-to-many relationships into unique values. 3rd NF allows for uniqueness of information by creation of additional many-to-many join resolution tables. These tables are rarely required in modern day applications.

i  This is not actually 3rd Normal Form but this simplified interpretation of Normalization could possibly be twisted to allow it.

 

     4th Normal Form is intended to separate multi-valued facts in a single table into multiple tables. 5th Normal Form divides related columns into separate tables based on those relationships. 4th and 5th NF minimize nulls and composite primary keys by removing null capable fields and subset composite primary key dependent fields to new tables. 4th and 5th Normal Forms are rarely useful.

 

Disclaimer Notice: This information is available AS IS. I am in no way responsible or liable for any mishaps as a result of using this information.

 

More available on Denormalization in my latest book Oracle High Performance Tuning for 9i and 10g