Introduction
As a database developer, we might often come across terms like normalization and denormalization of a database. Database normalization is a technique that helps to efficiently organize data in a given database. Essentially, it is a systematic approach to decompose a larger table into smaller tables that would help to get rid of data redundancy and other inserts/ update anomalies.
According to the definition in Wikipedia -
"Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as part of his relational model.
Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design)."
Database Normalization
Database Normalization is a set of rules that are applied to a database, such that the schema of the database ensures that all the rules are being followed. These rules are also known as Normal Forms and are widely used while designing database solutions.
The database normalization process can be divided into following types:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form or Fourth Normal Form (BCNF of 4NF)
- Fifth Normal Form (5NF)
- Sixth Normal Form (6NF)
In this article, we will only understand the concepts of 1NF, 2NF and 3NF with examples. The next normal forms are out of scope for this article and will not be discussed here.
Now, let us understand the rules that needs to be applied for each normal form.
First Normal Form (1NF)
- Data is stored in tables with rows that can be uniquely identified by a Primary Key.
- Data within each table is stored in individual columns in its most reduced form.
- There are no repeating groups.
Second Normal Form (2NF)
- All the rules from 1NF must be satisfied.
- Only those data that relates to a table’s primary key is stored in each table.
Third Normal Form (3NF)
- All the rules from 2NF must be satisfied.
- There should be no intra-table dependencies between the columns in each table.
Example
In this tutorial, we will be taking an already existing database sample and re-design it so that it supports all the three Normal Forms.
Let us consider the following database schema. As you can see in Fig 1, there are four tables (Existing Database) - Projects, Employees, ProjectEmployees, and JobOrders. Recently, the Customers table has also been added to the database to store the customers' information. As you can see in the diagram below, the Customers table has not been designed in a proper way to support the normal forms, let's go ahead and fix it.
I have provided the script to create all the tables if you want to try it out on your local system.
CREATE TABLE Projects([ID] INT PRIMARY KEY IDENTITY,[Name] VARCHAR(100),[Value] DECIMAL(5,2),StartDate DATE,EndDate DATE)GOCREATE TABLE Employees([ID] INT PRIMARY KEY IDENTITY,[FirstName] VARCHAR(50),[LastName] VARCHAR(50),[HourlyWage] DECIMAL(5,2),[HireDate] DATE)GOCREATE TABLE ProjectEmployees([ID] INT PRIMARY KEY IDENTITY,[ProjectID] INT,[EmployeeID] INT,[Hours] DECIMAL(5,2),CONSTRAINT FK_ProjectEmployees_Projects FOREIGN KEY ([ProjectID]) REFERENCES [Projects] ([ID]),CONSTRAINT FK_ProjectEmployees_Employees FOREIGN KEY ([EmployeeID]) REFERENCES [Employees] ([ID]))GOCREATE TABLE JobOrders([ID] INT PRIMARY KEY IDENTITY,[EmployeeID] INT,[ProjectID] INT,[Description] TEXT,[OrderDateTime] DATETIME,[Quantity] INT,[Price] DECIMAL(5,2),CONSTRAINT FK_JobOrders_Projects FOREIGN KEY ([ProjectID]) REFERENCES [Projects] ([ID]),CONSTRAINT FK_JobOrders_Employees FOREIGN KEY ([EmployeeID]) REFERENCES [Employees] ([ID]))GOCREATE TABLE Customers ( [Name] VARCHAR(100), [Industry] VARCHAR(100), [Project1_ID] INT, [Project1_Feedback] TEXT, [Project2_ID] INT, [Project2_Feedback] TEXT, [ContactPersonID] INT, [ContactPersonAndRole] VARCHAR(255), [PhoneNumber] VARCHAR(12), [Address] VARCHAR(255), [City] VARCHAR(255), [Zip] VARCHAR(5) ) GO
First Normal Form
As we have already discussed above, the first normal form relates to the duplication of data and also over-grouping of data in the columns.
The Customers table in the diagram violates all the three rules of the first normal form.
- We do not see any Primary Key in the table.
- The data is not found in its most reduced form. For example, the column ContactPersonAndRole can be divided further into two individual columns - ContactPerson and ContactPersonRole.
- Also, we can see there are two repeating groups of columns in this table - (Project1_ID, Project1_FeedBack) and (Project2_ID, Project2_Feedback). We need to get these removed from this table.
The diagram below shows dummy data stored in the Customers table.
Let us now get our hands dirty and start modifying the table, so that it satisfies the first normal form.
The first thing that we need to do is to add a primary key to this table. For this, we can add a new column ID with datatype as INT and also assign it as an Identity column. The script is given below.
ALTER TABLE [Customers]ADD [ID] INT IDENTITY PRIMARY KEYGO
When you execute this script, a new column gets added at the end of all the columns. This is the primary key of the table and now it satisfies the first rule of the First Normal Form.
Secondly, we need to split the column ContactPersonAndRole into two individual columns. This can be done in two steps as follows:
- Rename the original column from ContactPersonAndRole to ContactPerson.
- Add a new column for ContactPersonRole.
The script below, when executed, will rename the original column and add a new column to store the ContactRole information.
sp_rename 'Customers.[ContactPersonAndRole]', 'ContactPerson', 'COLUMN'GOALTER TABLE [Customers]ADD [ContactPersonRole] VARCHAR(20)GO
Finally, in order to satisfy the third rule of the First Normal Form, we need to move the columns Project1_ID, Project1_Feedback, Project2_ID, and Project2_Feedback into a new table. This can be done by creating a new table ProjectFeedbacks and link it back with the Customers and the Projects table.
When the script below is executed, it will remove the above-mentioned columns from the Customers table and create a new table ProjectFeedbacks with Foreign Key references to the Customers and Projects table.
ALTER TABLE [Customers]DROP COLUMN Project1_IDALTER TABLE [Customers]DROP COLUMN Project1_FeedbackALTER TABLE [Customers]DROP COLUMN Project2_IDALTER TABLE [Customers]DROP COLUMN Project2_FeedbackGOCREATE TABLE ProjectFeedback([ID] INT PRIMARY KEY IDENTITY,[ProjectID] INT,[CustomerID] INT,[Feedback] TEXT,CONSTRAINT FK_ProjectFeedbacks_Projects FOREIGN KEY ([ProjectID]) REFERENCES [Projects] ([ID]),CONSTRAINT FK_ProjectFeedbacks_Customers FOREIGN KEY ([CustomerID]) REFERENCES [Customers] ([ID]))GO
The database schema after applying all the rules of the first normal form is as below.
As you can see, the Customers table has been altered and a new table ProjectFeedbacks has been added into the schema. Thus, there are no repeating groups in the Customers or the ProjectFeedbacks table. We can also know about the feedbacks as it refers to both the Customers and the Projects table.
Now, that the Customers table supports 1NF, let's go ahead and apply the second normal form.
Second Normal Form
To satisfy the conditions of the second normal form, all the rules of the first normal form should satisfy. And along with that, we also need to ensure that all the columns in the table relate directly to the primary key of the record in the table.
However, if you see the database schema diagram above (Fig 3), you can see that the ContactPerson, ContactPersonRole and the PhoneNumber do not directly relate to the ID of the Customers table. That is because the primary key refers to a customer and not to any person or role or the phone number of the contact person. If ever, the contact person for a customer changes, we would have to update all of these columns, running the risk that we will update the values in one of the columns but forget to modify the other.
So, in order to satisfy this rule, we need to remove these three columns from the Customers table and put them in a separate table. This table should contain data that is related only to the contact person and not the customer.
Let us remove all these columns from the Customers table which do not relate to the primary key of the table directly. The script below removes the three columns from the table as these are not related to the customer, instead of to the contact person only.
ALTER TABLE [Customers]DROP COLUMN ContactPersonALTER TABLE [Customers]DROP COLUMN ContactPersonRoleALTER TABLE [Customers]DROP COLUMN PhoneNumberGO
Once, the columns are removed from the Customers table, we need to create a new table that'll store the data for the contact persons. Let us create a new table ContactPersons and relate it to the Customers table with a foreign key relation. The script is provided below.
CREATE TABLE ContactPersons([ID] INT PRIMARY KEY IDENTITY,[ContactPerson] VARCHAR(100),[ContactPersonRole] VARCHAR(20),[PhoneNumber] VARCHAR(12))GOALTER TABLE [Customers]ADD CONSTRAINT FK_Customers_ContactPersons FOREIGN KEY ([ContactPersonID])REFERENCES ContactPersons([ID])GO
Once this script is executed, you can see in the diagram below (Fig 4) that a new table has been added to the schema and now it satisfies the second normal form of the database.
Now, if the contact person for customer changes, we just need to insert a record into the ContactPersons table and change the ContactPersonID in the Customers table.
Third Normal Form
To satisfy the conditions of the third normal form, all the rules of the second normal form must satisfy. And with that, we also need to ensure that each column must be non-transitively dependent on the primary key of the table. This means that all columns in a table should rely only on the primary key and no other column. If ColumnA relies on the primary key and also on the ColumnB, then ColumnA is known to be transitively dependent on the primary key and it violates the third normal form.
After applying 1NF and 2NF, below is what the Customers table looks like now (Fig 5).
If you look carefully, there are transitive dependent columns in this table and it violates the 3NF. The transitive dependent relationship is between the columns - City and Zip. The city in which a customer is situated relates to the primary key of the customer, so this satisfies the second normal form. However, the city also depends on the zip code. If a customer changes its location, there may be a chance we update one column but not the other. Because of this relationship between the City and Zip, the database is not in 3NF.
In order to fix this and bring the table to satisfy the third normal form, we need to remove the City from the Customers table and create a new table ZipCodes to store the Zip and City. This new table can be related to the Customers table via a foreign key relation. The script is provided below.
ALTER TABLE [Customers]DROP COLUMN CityGOCREATE TABLE ZipCodes([ZipID] VARCHAR(5) PRIMARY KEY,[City] VARCHAR(255))GOALTER TABLE [Customers]ADD CONSTRAINT FK_Customers_ZipCodes FOREIGN KEY ([Zip])REFERENCES ZipCodes([ZipID])GO
Now that all the changes are performed, lets look at the schema after the third normal form has also been satisfied (Fig 6). As you can see, the new table ZipCodes has been added and it relates to the Customers table.
That's all for the third normal form. The Customers table now supports all the three normal forms and can be used as required. It is always tricky to find issues that are caused by a violation of the third normal form. However, for good database design, these are quite essential that all the normal forms are satisfied.
Summary
In this article, we have seen what is database normalization and how can we implement it in a SQL Server database. To learn more please follow the link below.
FAQs
What is normalization in SQL Server with example? ›
Normalization is the process to eliminate data redundancy and enhance data integrity in the table. Normalization also helps to organize the data in the database. It is a multi-step process that sets the data into tabular form and removes the duplicated data from the relational tables.
What are the four 4 types of database normalization? ›First Normal Form (1 NF) Second Normal Form (2 NF) Third Normal Form (3 NF) Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)
What is normalization explain about 1NF 2NF 3NF and Bcnf with examples? ›1NF: A relation is in 1NF if all its attributes have an atomic value. 2NF: A relation is in 2NF if it is in 1NF and all non-key attributes are fully functional dependent on the candidate key in DBMS. 3NF: A relation is in 3NF if it is in 2NF and there is no transitive dependency.
What is normalization in database with example? ›Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
What is normalization explain its 4 types with examples? ›Normalization is the process of organizing the data in the database. Normalization is used to minimize the redundancy from a relation or set of relations. It is also used to eliminate undesirable characteristics like Insertion, Update, and Deletion Anomalies.
What is 1NF 2NF and 3NF? ›What is 1NF 2NF and 3NF? 1NF, 2NF, and 3NF are the first three types of database normalization. They stand for first normal form, second normal form, and third normal form, respectively. There are also 4NF (fourth normal form) and 5NF (fifth normal form).
Which normalization method is best? ›The best normalization technique is one that empirically works well, so try new ideas if you think they'll work well on your feature distribution. When the feature is more-or-less uniformly distributed across a fixed range. When the feature contains some extreme outliers.
What is the most commonly used normalization technique? ›Perhaps the most common type of normalization is z-scores. In simple terms, a z-score normalizes each data point to the standard deviation. The formula is the following: where X is the data value, μ is the mean of the dataset, and σ is the standard deviation.
What is the difference between 2NF and 3NF use example to explain? ›...
Difference between 2NF and 3NF :
S.NO. | 2NF(Second Normal Form) | 3NF(Third Normal Form) |
---|---|---|
4. | Stronger normal form than 1NF but lesser than 3NF | Stronger normal form than 1NF and 2NF. |
BCNF is an extension of 3NF and it is has more strict rules than 3NF. Also, it is considered to be more stronger than 3NF. This relation is in BCNF as it is already in 3Nf (there is no prime attribute deriving no prime attribute) and on the left hand side of the functional dependency there is a candidate key.
How 2NF is different than 1NF describe with an example? ›
1NF: This is the First Normal Form in which a relation contains an atomic value. 2NF: The second normal form used for the normalization process. A relation in 2NF must be in 1NF, and all the non-key attributes depend on the primary key in the Second Normal Form.
What are the 5 rules of data normalization? ›This pdf document, created by Marc Rettig, details the five rules as: Eliminate Repeating Groups, Eliminate Redundant Data, Eliminate Columns Not Dependent on Key, Isolate Independent Multiple Relationships, and Isolate Semantically Related Multiple Relationships.
What are the three goals of normalization? ›A properly normalised design allows you to: Use storage space efficiently. Eliminate redundant data. Reduce or eliminate inconsistent data.
How do you normalize an existing database? ›- Remove any repeating groups of data (i.e. beware of duplicative columns or rows within the same table)
- Create separate tables for each group of related data.
- Each table should have a primary key (i.e. a field that identifies each row with a non-null, unique value)
Normalize and normalise are both English terms. Normalize is predominantly used in 🇺🇸 American (US) English ( en-US ) while normalise is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) ( en-GB ). In the United States, there is a preference for "normalize" over "normalise" (99 to 1).
What is denormalized vs normalized? ›Normalization is the technique of dividing the data into multiple tables to reduce data redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization is the technique of combining the data into a single table to make data retrieval faster.
How do you normalize a table to 3NF? ›To normalize a table from 1NF to 3NF, you need to normalize it to 2NF first then to 3NF. In the normalization process, you decompose a table into multiple tables that contain the same information as the original table. The normalization process usually removes many problems related to data modification.
What are the keys in SQL? ›In SQL, keys are the set of attributes that used to identify the specific row in a table and to find or create the relation between two or more tables i.e keys identify the rows by combining one or more columns.
How do I normalize 2NF to 3NF? ›To normalize a 2NF to 3NF, we have to determine if we have a transitive dependency in the table. In case a transitive dependency exists, then we remove those attributes that are transitively dependent from the relations. We do this by placing these attributes in a separate, new relation.
How do you explain normalization in an interview? ›Answer: Normalization is a process of organizing database by splitting larger tables into smaller ones that are easier to maintain and linking them using relationships. It reduces redundant data and optimize data dependencies.
What is meant by Normalisation in SQL? ›
Normalization or normalisation refers to a process that makes something more normal or regular. Most commonly it refers to: Normalization (sociology) or social normalization, the process through which ideas and behaviors that may fall outside of social norms come to be regarded as "normal"
Why is normalization needed? ›Normalization is an essential part of product information management, preventing data from being replicated in two tables at the same time or unrelated product data being gathered together in the same table. In addition, normalization helps to streamline your data, simplifying your database and making it more concise.
What are the problems of normalization in database? ›A poorly normalized database and poorly normalized tables can cause problems ranging from excessive disk I/O and subsequent poor system performance to inaccurate data. An improperly normalized condition can result in extensive data redundancy, which puts a burden on all programs that modify the data.
How do you normalize sample data? ›The data can be normalized by subtracting the mean (µ) of each feature and a division by the standard deviation (σ). This way, each feature has a mean of 0 and a standard deviation of 1. This results in faster convergence.
What are the different types of normalization in SQL? ›- 1st Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce Codd Normal Form (BCNF)
Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.?
What are the advantages of normalization in SQL? ›- Reduces redundant data.
- Provides data consistency within the database.
- More flexible database design.
- Higher database security.
- Better and quicker execution.
- Greater overall database organization.
Normalization rules divides larger tables into smaller tables and links them using relationships. The purpose of Normalization in SQL is to eliminate redundant (repetitive) data and ensure data is stored logically. Each table cell should contain a single value. Each record needs to be unique.
How do you know when the database is fully normalized? ›For a database to be in third normal form, it must first satisfy all the criteria for 2NF (and therefore, also 1NF). Then, each column must be non-transitively dependent on the table's primary key. This means that all columns in a table should rely on the primary key and no other column.
What are the disadvantages of normalization? ›There are a few drawbacks in normalization : Creating a longer task, because there are more tables to join, the need to join those tables increases and the task become more tedious (longer and slower). The database become harder to realize as well.
When should you not normalize data? ›
- Joins are expensive. Normalizing your database often involves creating lots of tables. ...
- Normalized design is difficult. ...
- Quick and dirty should be quick and dirty. ...
- If you're using a NoSQL database, traditional normalization is not desirable.
The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.