1



Syllabus Focus: Unit 2 Module 1 Content 9

Specific Objective 9: Explain the concept of normalization;

Content: Definition of normalisation; attribute redundancy and anomalies; normal forms: including First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF); keys: primary, Foreign and Composite (or compound or concatenated); Partial and Non-key Dependencies; relationships, use of entity-relationship diagrams (ERD).

|explain the concept of normalization; | |Definition of normalisation; attribute redundancy and anomalies; normal forms: |

| | |including first normal form (1NF), second normal form (2NF), third normal form |

| | |(3NF); keys: primary, foreign and composite (or compound or concatenated); |

| | |partial and non-key dependencies; relationships, use of entity-relationship |

| | |diagrams (ERD). |

Database Normalisation, sometimes referred to as Canonical Synthesis, is a technique for designing Relational Database tables to minimize duplication of information and, in so doing, to safeguard the Database against certain types of logical or structural problems, namely Data Anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only.

Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application).

Database theory describes a table's degree of normalization in terms of normal forms of successively higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not necessarily the case.

Problems addressed by Normalisation

Although the Normal Forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalised.

A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies involving data operations. In such a table:

• The same information can be expressed on multiple records; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each of his skills). If the update is not carried through successfully—if, that is, the employee's address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an Update Anomaly.

• There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly-hired faculty member who has not yet been assigned to teach any courses. This phenomenon is known as an Insertion Anomaly.

• There are circumstances in which the deletion of data representing certain facts necessitates the deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears. This phenomenon is known as a Deletion Anomaly.

Ideally, a relational database table should be designed in such a way as to exclude the possibility of Update, Insertion, and Deletion Anomalies. The Normal Forms of Relational Database theory provides guidelines for deciding whether a particular design will be vulnerable to such anomalies. It is possible to correct an unnormalised design so as to make it adhere to the demands of the Normal Forms: this is called normalization. Removal of redundancies of the tables will lead to several tables, with Referential Integrity restrictions between them.

Normalisation typically involves decomposing an unnormalized table into two or more tables that were they are to be combined (joined), would convey exactly the same information as the original table.

Referential Integrity in a Relational Database is consistency between coupled tables. Referential Integrity is usually enforced by the combination of a primary key or candidate key (alternate key) and a foreign key. For Referential Integrity to hold, any field in a table that is declared a foreign key can contain only values from a parent table's primary key or a candidate key. For instance, deleting a record that contains a value referred to by a foreign key in another table would break referential integrity. The Relational Database Management System (RDBMS) enforces referential integrity, normally either by deleting the foreign key rows as well to maintain integrity, or by returning an error and not performing the delete. Which method is used would be determined by the referential integrity constraint, as defined in the Data Dictionary.

Background to Normalisation

Functional Dependency: Attribute B has a functional dependency on attribute A i.e. A → B if, for each value of attribute A, there is exactly one value of attribute B. If value of A is repeating in tuples then value of B will also repeat. In our example, Employee Address has a functional dependency on Employee ID, because a particular Employee ID value corresponds to one and only one Employee Address value. (Note that the reverse need not be true: several employees could live at the same address and therefore one Employee Address value could correspond to more than one Employee ID. Employee ID is therefore not functionally dependent on Employee Address.) An attribute may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to determine the extent to which a design is normalized without understanding what functional dependencies apply to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain. For example, an Employer may require certain employees to split their time between two locations, such as Kingston and Spanish Town, and therefore want to allow Employees to have more than one Employee Address. In this case, Employee Address would no longer be functionally dependent on Employee ID.

• Primary key: Most DBMSs require a table to be defined as having a single unique key, rather than a number of possible unique keys. A primary key is a key which the database designer has designated for this purpose.

• Super key: A super key is an attribute or set of attributes that uniquely identifies rows within a table; in other words, two distinct rows are always guaranteed to have distinct super keys. {Employee ID, Employee Address, Skill} would be a super key for the "Employees' Skills" table; {Employee ID, Skill} would also be a super key.

• Candidate key: A candidate key is a minimal super key, that is, a super key for which we can say that no proper subset of it is also a super key. {Employee Id, Skill} would be a candidate key for the "Employees' Skills" table.

• Non-prime attribute: A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the "Employees' Skills" table.



NORMAL FORMS

The Normal Forms (abbrev. NF) of Relational Database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.

The Normal Forms are applicable to individual tables; to say that an entire database is in Normal Form n is to say that all of its tables are in Normal Form n.

First Normal Form

First normal form (1NF or Minimal Form) is a normal form used in database normalization. A relational database table that adheres to 1NF is one that meets a certain minimum set of criteria. These criteria are basically concerned with ensuring that the table is a faithful representation of a relation and that it is free of repeating groups.

The concept of a "repeating group" is, however, understood in different ways by different theorists. As a consequence, there is not universal agreement as to which features would disqualify a table from being in 1NF. Most notably, 1NF as defined by some authors (for example, Ramez Elmasri and Shamkant B. Navathe, following the precedent established by Edgar F. Codd) excludes relation-valued attributes (tables within tables); whereas 1NF as defined by other authors (for example, Chris Date) permits them.

According to Date's definition of 1NF, a table is in 1NF if and only if it is "isomorphic to some relation", which means, specifically, that it satisfies the following five conditions:

1. There's no top-to-bottom ordering to the rows.

2. There's no left-to-right ordering to the columns.

3. There are no duplicate rows.

4. Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).

5. All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].

Violation of any of these conditions would mean that the table is not strictly relational, and therefore that it is not in 1NF.

Examples of tables (or views) that would not meet this definition of 1NF are:

• A table that lacks a unique key. Such a table would be able to accommodate duplicate rows, in violation of condition 3.

• A table whose definition mandates that results be returned in a particular order, so that the row-ordering is an intrinsic and meaningful aspect of the view. This violates condition 1. The tuples in true relations are not ordered with respect to each other.

• A table with at least one null able attribute. A nullable attribute would be in violation of condition 4, which requires every field to contain exactly one value from its column's domain. It should be noted, however, that this aspect of condition 4 is controversial. It marks an important departure from Codd's original vision of the relational model, which made explicit provision for nulls.

Date's fourth condition, which expresses "what most people think of as the defining feature of 1NF", is concerned with repeating groups. The following example illustrates how a database design might incorporate repeating groups, in violation of 1NF.

Example 1: Domains and values

|CUSTOMER |

|Customer ID |First Name |Surname |Telephone Number |

|123 |Robert |Ingram |555-861-2025 |

|456 |Jane |Wright |555-403-1659 |

|789 |Maria |Fernandez |555-808-9633 |

Suppose novice designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

|CUSTOMER |

|Customer ID |First Name |Surname |Telephone Number |

|123 |Robert |Ingram |555-861-2025 |

|456 |Jane |Wright |555-403-1659 |

| | | |555-776-4100 |

|789 |Maria |Fernandez |555-808-9633 |

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

Assuming, however, that the Telephone Number column is defined on some Telephone Number-like domain (e.g. the domain of strings 12 characters in length), the representation above is not in 1NF. 1NF (and, for that matter, the RDBMS) prohibits a field from containing more than one value from its column's domain.

Example 2: Repeating groups across columns

The designer might attempt to get around this restriction by defining multiple Telephone Number columns:

|CUSTOMER |

|Customer ID |First Name |Surname |Tel. No. 1 |Tel. No. 2 |Tel. No. 3 |

|123 |Robert |Ingram |555-861-2025 | | |

|456 |Jane |Wright |555-403-1659 |555-776-4100 | |

|789 |Maria |Fernandez |555-808-9633 | | |

This representation, however, makes use of null able columns, and therefore does not conform to Date's definition of 1NF. Even if the table is taken that null able columns are allowed, the design is not in keeping with the spirit of 1NF. Tel. No. 1, Tel. No. 2., and Tel. No. 3. share exactly the same domain and exactly the same meaning; the splitting of Telephone Number into three headings is artificial and causes logical problems. These problems include:

• Difficulty in querying the table. Answering such questions as "Which customers have telephone number X?" and "Which pairs of customers share a telephone number?" is awkward.

• Inability to enforce uniqueness of Customer-to-Telephone Number links through the RDBMS. Customer 789 might mistakenly be given a Tel. No. 2 value that is exactly the same as her Tel. No. 1 value.

• Restriction of the number of telephone numbers per customer to three. If a customer with four telephone numbers comes along, we are constrained to record only three and leave the fourth unrecorded. This means that the database design is imposing constraints on the business process, rather than (as should ideally be the case) vice-versa.

Example 3: Repeating groups within columns

The designer might, alternatively, retain the single Telephone Number column but alter its domain, making it a string of sufficient length to accommodate multiple telephone numbers:

|CUSTOMER |

|Customer ID |First Name |Surname |Telephone Number |

|123 |Robert |Ingram |555-861-2025 |

|456 |Jane |Wright |555-403-1659, 555-776-4100 |

|789 |Maria |Fernandez |555-808-9633 |

This design is not consistent with 1NF, and presents several design issues. The Telephone Number heading becomes semantically woolly, as it can now represent either a telephone number, a list of telephone numbers, or indeed anything at all. A query such as "Which pairs of customers share a telephone number?" is more difficult to formulate, given the necessity to cater for lists of telephone numbers as well as individual telephone numbers. Meaningful constraints on telephone numbers are also very difficult to define in the RDBMS with this design.

A design that complies with 1NF

A design that is unambiguously in 1NF makes use of two tables: a Customer Name table and a Customer Telephone Number table.

|CUSTOMER NAME |CUSTOMER TELEPHONE NUMBER |

| | |

|Customer ID |Customer ID |

|First Name |Telephone Number |

|Surname | |

| |123 |

|123 |555-861-2025 |

|Robert | |

|Ingram |456 |

| |555-403-1659 |

|456 | |

|Jane |456 |

|Wright |555-776-4100 |

| | |

|789 |789 |

|Maria |555-808-9633 |

|Fernandez | |

| | |

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record.

Newcomers to database design sometimes suppose that normalisation proceeds in an iterative fashion, i.e. a 1NF design are first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.

Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These Normal Forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key".

A table is in first normal form (1NF) if and only if it represents a relation. Given that database tables embody a relation-like form, the defining characteristic of one in First Normal Form is that it does not allow duplicate rows or nulls. Simply put, a table with a unique key (which, by definition, prevents duplicate rows) and without any null able columns is in 1NF.

By redefining 1NF to exclude null able columns in 1NF, no level of normalization can ever be achieved unless all null able columns are completely eliminated from the entire database.

One requirement of a relation is that every table contains exactly one value for each attribute. This is sometimes expressed as "no repeating groups". While that statement itself is axiomatic, experts disagree about what qualifies as a "repeating group", in particular whether a value may be a relation value; thus the precise definition of 1NF is the subject of some controversy. Notwithstanding, this theoretical uncertainty applies to relations, not tables. Table manifestations are intrinsically free of variable repeating groups because they are structurally constrained to the same number of columns in all rows.

Put at its simplest; when applying 1NF to a database, every record must be the same length. This means that each record has the same number of fields, and none of them contain a null value.

Second Normal Form

The criteria for second normal form (2NF) are:

• The table must be in 1NF.

• None of the non-prime attributes of the table are functionally dependent on a part (proper subset) of a candidate key; in other words, all functional dependencies of non-prime attributes on candidate keys are full functional dependencies. For example, consider an "Employees' Skills" table whose attributes are Employee ID, Employee Name, and Skill; and suppose that the combination of Employee ID and Skill uniquely identifies records within the table. Given that Employee Name depends on only one of those attributes – namely, Employee ID – the table is not in 2NF.

• In simple terms, a table is 2NF if it is in 1NF and all fields are dependent on the whole of the primary key, or a relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate key of the relation.

• Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key consists of just one attribute – then we can say immediately that the table is in 2NF.

• All columns must be a fact about the entire key, and not a subset of the key.

Consider a table describing employees' skills:

|EMPLOYEES' SKILLS |

|Employee |Skill |Current Work Location |

|Jones |Typing |114 Main Street |

|Jones |Shorthand |114 Main Street |

|Jones |Whittling |114 Main Street |

|Roberts |Light Cleaning |73 Industrial Way |

|Ellis |Alchemy |73 Industrial Way |

|Ellis |Juggling |73 Industrial Way |

|Harrison |Light Cleaning |73 Industrial Way |

The table's only candidate key is {Employee, Skill}.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Ellis works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Typing" and "Shorthand" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

A 2NF alternative to this design would represent the same information in two tables:

|EMPLOYEES |EMPLOYEES' SKILLS |

| | |

|Employee |Employee |

|Current Work Location |Skill |

| | |

|Jones |Jones |

|114 Main Street |Typing |

| | |

|Roberts |Jones |

|73 Industrial Way |Shorthand |

| | |

|Ellis |Jones |

|73 Industrial Way |Whittling |

| | |

|Harrison |Roberts |

|73 Industrial Way |Light Cleaning |

| | |

| |Ellis |

| |Alchemy |

| | |

| |Ellis |

| |Juggling |

| | |

| |Harrison |

| |Light Cleaning |

| | |

Update anomalies cannot occur in these tables, which are both in 2NF.

Not all 2NF tables are free from update anomalies, however. An example of a 2NF table which suffers from update anomalies is:

|TOURNAMENT WINNERS |

|Tournament |Year |Winner |Winner Date of Birth |

|Des Moines Masters |1998 |Chip Masterson |14 March 1977 |

|Indiana Invitational |1998 |Al Fredrickson |21 July 1975 |

|Cleveland Open |1999 |Bob Albertson |28 September 1968 |

|Des Moines Masters |1999 |Al Fredrickson |21 July 1975 |

|Indiana Invitational |1999 |Chip Masterson |14 March 1977 |

Even though Winner and Winner Date of Birth are determined by the whole key {Tournament, Year} and not part of it, particular Winner / Winner Date of Birth combinations are shown redundantly on multiple records. This problem is addressed by third normal form (3NF).

Third Normal Form

A table for which there are no partial functional dependencies on the primary key is typically, but not always, in 2NF. In addition to the primary key, the table may contain other candidate keys; it is necessary to establish that no non-prime attributes have part-key dependencies on any of these candidate keys.

Multiple candidate keys occur in the following table:

|ELECTRIC TOOTHBRUSH MODELS |

|Manufacturer |Model |Model Full Name |Manufacturer Country |

|Forte |X-Prime |Forte X-Prime |Italy |

|Forte |Ultraclean |Forte Ultraclean |Italy |

|Dent-o-Fresh |EZBrush |Dent-o-Fresh EZBrush |USA |

|Kobayashi |ST-60 |Kobayashi ST-60 |Japan |

|Hoch |Toothmaster |Hoch Toothmaster |Germany |

|Hoch |Contender |Hoch Contender |Germany |

Even if the designer has specified the primary key as {Model Full Name}, the table is not in 2NF. {Manufacturer, Model} is also a candidate key, and Manufacturer Country is dependent on a proper subset of it: Manufacturer.

Third Normal Form

The Third Normal Form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd in 1971. Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:

• The relation R (table) is in second normal form (2NF)

• Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every key of R.

An example of a 2NF table that fails to meet the requirements of 3NF is:

|TOURNAMENT WINNERS |

|Tournament |Year |Winner |Winner Date of Birth |

|Indiana Invitational |1998 |Al Fredrickson |21 July 1975 |

|Cleveland Open |1999 |Bob Albertson |28 September 1968 |

|Des Moines Masters |1999 |Al Fredrickson |21 July 1975 |

|Indiana Invitational |1999 |Chip Masterson |14 March 1977 |

The only candidate key is {Tournament, Year}.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

|TOURNAMENT WINNERS |PLAYER DATES OF BIRTH |

| | |

|Tournament |Player |

|Year |Date of Birth |

|Winner | |

| |Chip Masterson |

|Indiana Invitational |14 March 1977 |

|1998 | |

|Al Fredrickson |Al Fredrickson |

| |21 July 1975 |

|Cleveland Open | |

|1999 |Bob Albertson |

|Bob Albertson |28 September 1968 |

| | |

|Des Moines Masters | |

|1999 | |

|Al Fredrickson | |

| | |

|Indiana Invitational | |

|1999 | |

|Chip Masterson | |

| | |

Update anomalies cannot occur in these tables, which are both in 3NF.

Additional Reading

An Introduction to Database Normalization

By Mike Hillyer

Table of contents:

• Introduction

• Mike's Bookstore

• First Normal Form

• Defining Relationships

• Second Normal Form

• Third Normal Form

• Joining Tables

• Conclusion

• Resources

Introduction

Over the years I, like many of you, have had the experience of taking over responsibility for an existing application and its associated schema (sometimes frustratingly woven together as part of a Microsoft Access solution).

The most challenging of these to maintain and rework suffer from what one author described as the Spreadsheet Syndrome: a tendency for the developer to lump every possible piece of information into as few table as possible, often into a single table.

A schema that suffers from the Spreadsheet Syndrome is subject to data redundancies, data anomalies, and various inefficiencies. The cure for Spreadsheet Syndrome is database normalization.

Database normalization is a process by which an existing schema is modified to bring its component tables into compliance with a series of progressive normal forms. The concept of database normalization was first introduced by Edgar Frank Codd in his paper A Relational Model of Data for Large Shared Data Banks, section 4.

The goal of database normalization is to ensure that every non-key column in every table is directly dependent on the key, the whole key and nothing but the key and with this goal come benefits in the form of reduced redundancies, fewer anomalies, and improved efficiencies. While normalization is not the be-all and end-all of good design, a normalized schema provides a good starting point for further development.

This article will take a practical look at database normalization, focusing on the first three of seven generally recognized normal forms. Additional resources that look at the theory of database normalization and the additional normal forms can be found in the Resources section at the end of this article.

Note:

This article has been updated from an original version published in 2003.

Mike's Bookstore

Let's say you were looking to start an online bookstore. You would need to track certain information about the books available to your site viewers, such as:

|Title |

|Author |

|Author Biography |

|ISBN |

|Price |

|Subject |

|Number of Pages |

|Publisher |

|Publisher Address |

|Description |

|Review |

|Reviewer Name |

Let's start by adding the book that coined the term ?Spreadsheet Syndrome?. Because this book has two authors, we are going to need to accommodate both in our table. Lets take a look at a typical approach

Table 1. Two Books

|Title |Author |Bio |ISBN |Subject |Pages |Publisher |

|Beginning MySQL Database |Chad Russell, |Chad Russell is a programmer and network |1590593324 |MySQL, Database |520 |Apress |

|Design and Optimization |Jon Stephens |administrator who owns his own Internet hosting | |Design | | |

| | |company., Jon Stephens is a member of the MySQL | | | | |

| | |AB documentation team. | | | | |

Lets take a look at some issues involved in this design:

First, this table is subject to several anomalies: we cannot list publishers or authors without having a book because the ISBN is a primary key which cannot be NULL (referred to as an insertion anomaly). Similarly, we cannot delete a book without losing information on the authors and publisher (a deletion anomaly). Finally, when updating information, such as an author's name, we must change the data in every row, potentially corrupting data (an update anomaly).

Note:

Normalization is a part of relational theory, which requires that each relation (AKA table) has a primary key. As a result, this article assumes that all tables have primary keys, without which a table cannot even be considered to be in first normal form.

Second, this table is not very efficient with storage. Lets imagine for a second that our publisher is extremely busy and managed to produce 5000 books for our database. Across 5000 rows we would need to store information such as a publisher name, address, phone number, URL, contact email, etc. All that information repeated over 5000 rows is a serious waste of storage resources.

Third, this design does not protect data consistency. Lets once again imagine that Jon Stephens has written 20 books. Someone has had to type his name into the database 20 times, and it is possible that his name will be misspelled at least once (i.e.. John Stevens instead of Jon Stephens). Our data is now in an inconsistent state, and anyone searching for a book by author name will find some of the results missing. This also contributes to the update anomalies mentioned earlier.

First Normal Form

The normalization process involves getting our data to conform to progressive normal forms, and a higher level of normalization cannot be achieved unless the previous levels have been satisfied (though many experienced designers can create normalized tables directly without iterating through the lower forms). The first normal form (or 1NF) requires that the values in each column of a table are atomic. By atomic we mean that there are no sets of values within a column.

In our example table, we have a set of values in our author and subject columns. With more than one value in a single column, it is difficult to search for all books on a given subject or by a specific author. In addition, the author names themselves are non-atomic: first name and last name are in fact different values. Without separating first and last names it becomes difficult to sort on last name.

One method for bringing a table into first normal form is to separate the entities contained in the table into separate tables. In our case this would result in Book, Author, Subject and Publisher tables.

Table 2. Book Table

|ISBN |Title |Pages |

|1590593324 |Beginning MySQL Database Design and Optimization |520 |

Table 3. Author Table

|Author_ID |First_Name |Last_name |

|1 |Chad |Russell |

|2 |Jon |Stephens |

|3 |Mike |Hillyer |

Table 4. Subject Table

|Subject_ID |Name |

|1 |MySQL |

|2 |Database Design |

Table 5. Publisher Table

|Publisher_ID |Name |Address |City |State |Zip |

|1 |Apress |2560 Ninth Street, Station 219 |Berkeley |California |94710 |

Note

The Author, Subject, and Publisher tables use what is known as a surrogate primary key -- an artificial primary key used when a natural primary key is either unavailable or impractical. In the case of author we cannot use the combination of first and last name as a primary key because there is no guarantee that each author's name will be unique, and we cannot assume to have the author's government ID number (such as SIN or SSN), so we use a surrogate key.

Some developers use surrogate primary keys as a rule, others use them only in the absence of a natural candidate for the primary key. From a performance point of view, an integer used as a surrogate primary key can often provide better performance in a join than a composite primary key across several columns. However, when using a surrogate primary key it is still important to create a UNIQUE key to ensure that duplicate records are not created inadvertently (but some would argue that if you need a UNIQUE key it would be better to stick to a composite primary key).

By separating the data into different tables according to the entities each piece of data represents, we can now overcome some of the anomalies mentioned earlier: we can add authors who have not yet written books, we can delete books without losing author or publisher information, and information such as author names are only recoded once, preventing potential inconsistencies when updating.

Depending on your point of view, the Publisher table may or may not meet the 1NF requirements because of the Address column: on the one hand it represents a single address, on the other hand it is a concatenation of a building number, street number, and street name.

The decision on whether to further break down the address will depend on how you intend to use the data: if you need to query all publishers on a given street, you may want to have separate columns. If you only need the address for mailings, having a single address column should be acceptable (but keep potential future needs in mind).

Defining Relationships

As you can see, while our data is now split up, relationships between the tables have not been defined. There are various types of relationships that can exist between two tables:

• One to (Zero or) One

• One to (Zero or) Many

• Many to Many

The relationship between the Book table and the Author table is a many-to-many relationship: A book can have more than one author, and an author can write more than one book. To represent a many-to-many relationship in a relational database we need a third table to serve as a link between the two. By naming the table appropriately, it becomes instantly clear which tables it connects in a many-to-many relationship (in the following example, between the Book and the Author table).

Table 6. Book_Author Table

|ISBN |Author_ID |

|1590593324 |1 |

|1590593324 |2 |

Similarly, the Subject table also has a many-to-many relationship with the Book table, as a book can cover multiple subjects, and a subject can be explained by multiple books:

Table 7. Book_Subject Table

|ISBN |Subject_ID |

|1590593324 |1 |

|1590593324 |2 |

As you can see, we now have established the relationships between the Book, Author, and Subject tables. A book can have an unlimited number of authors, and can refer to an unlimited number of subjects. We can also easily search for books by a given author or referring to a given subject.

The case of a one-to-many relationship exists between the Book table and the Publisher table. A given book has only one publisher (for our purposes), and a publisher will publish many books. When we have a one-to-many relationship, we place a foreign key in the table representing the ?many?, pointing to the primary key of the table representing the ?one?. Here is the new Book table:

Table 8. Book Table

|ISBN |Title |Pages |Publisher_ID |

|1590593324 |Beginning MySQL Database Design and Optimization |520 |1 |

Since the Book table represents the ?many? portion of our one-to-many relationship, we have placed the primary key value of the Publisher as in aPublisher_ID column as a foreign key.

In the tables above the values stored refer to primary key values from the Book, Author, Subject and Publisher tables. Columns in a table that refer to primary keys from another table are known as foreign keys, and serve the purpose of defining data relationships.

In database systems (DBMS) which support referential integrity constraints, such as the InnoDB storage engine for MySQL, defining a column as a foreign key will allow the DBMS to enforce the relationships you define. For example, with foreign keys defined, the InnoDB storage engine will not allow you to insert a row into the Book_Subject table unless the book and subject in question already exist in the Book and Subject tables or if you're inserting NULL values. Such systems will also prevent the deletion of books from the book table that have ?child? entries in the Book_Subject or Book_Author tables.

Second Normal Form

Where the First Normal Form deals with atomicity of data, the Second Normal Form (or 2NF) deals with relationships between composite key columns and non-key columns. As stated earlier, the normal forms are progressive, so to achieve Second Normal Form, your tables must already be in First Normal Form.

The second normal form (or 2NF) any non-key columns must depend on the entire primary key. In the case of a composite primary key, this means that a non-key column cannot depend on only part of the composite key.

Let's introduce a Review table as an example.

Table 9. Review Table

|ISBN |Author_ID |Summary |Author_URL |

|1590593324 |3 |A great book! | |

In this situation, the URL for the author of the review depends on the Author_ID, and not to the combination of Author_ID and ISBN, which form the composite primary key. To bring the Review table into compliance with 2NF, the Author_URL must be moved to the Author table.

Third Normal Form

Third Normal Form (3NF) requires that all columns depend directly on the primary key. Tables violate the Third Normal Form when one column depends on another column, which in turn depends on the primary key (a transitive dependency).

One way to identify transitive dependencies is to look at your table and see if any columns would require updating if another column in the table was updated. If such a column exists, it probably violates 3NF.

In the Publisher table the City and State fields are really dependent on the Zip column and not the Publisher_ID. To bring this table into compliance with Third Normal Form, we would need a table based on zip code:

Table 10. Zip Table

|Zip |City |State |

|94710 |Berkeley |California |

In addition, you may wish to instead have separate City and State tables, with the City_ID in the Zip table and the State_ID in the City table.

A complete normalization of tables is desirable, but you may find that in practice that full normalization can introduce complexity to your design and application. More tables often means more JOIN operations, and in most database management systems (DBMSs) such JOIN operations can be costly, leading to decreased performance. The key lies in finding a balance where the first three normal forms are generally met without creating an exceedingly complicated schema.

Joining Tables

With our tables now separated by entity, we join the tables together in our SELECT queries and other statements to retrieve and manipulate related data. When joining tables, there are a variety of JOIN syntaxes available, but typically developers use the INNER JOIN and OUTER JOIN syntaxes.

An INNER JOIN query returns one row for each pair or matching rows in the tables being joined. Take our Author and Book_Author tables as an example:

mysql> SELECT First_Name, Last_Name, ISBN

-> FROM Author INNER JOIN Book_Author ON Author.Author_ID = Book_Author.Author_ID;

+------------+-----------+------------+

| First_Name | Last_Name | ISBN |

+------------+-----------+------------+

| Chad | Russell | 1590593324 |

| Jon | Stephens | 1590593324 |

+------------+-----------+------------+

2 rows in set (0.05 sec)

The third author in the Author table is missing because there are no corresponding rows in the Book_Author table. When we need at least one row in the result set for every row in a given table, regardless of matching rows, we use an OUTER JOIN query.

There are three variations of the OUTER JOIN syntax: LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER JOIN. The syntax used determines which table will be fully represented. A LEFT OUTER JOIN returns one row for each row in the table specified on the left side of the LEFT OUTER JOIN clause. The opposite is true for the RIGHT OUTER JOIN clause. A FULL OUTER JOIN returns one row for each row in both tables.

In each case, a row of NULL values is substituted when a matching row is not present. The following is an example of a LEFT OUTER JOIN:

mysql> SELECT First_Name, Last_Name, ISBN

-> FROM Author LEFT OUTER JOIN Book_Author ON Author.Author_ID = Book_Author.Author_ID;

+------------+-----------+------------+

| First_Name | Last_Name | ISBN |

+------------+-----------+------------+

| Chad | Russell | 1590593324 |

| Jon | Stephens | 1590593324 |

| Mike | Hillyer | NULL |

+------------+-----------+------------+

3 rows in set (0.00 sec)

The third author is returned in this example, with a NULL value for the ISBN column, indicating that there are no matching rows in the Book_Author table.

Conclusion

Through the process of database normalization we bring our schema's tables into conformance with progressive normal forms. As a result our tables each represent a single entity (a book, an author, a subject, etc) and we benefit from decreased redundancy, fewer anomalies and improved efficiency.

Resources

The following resources were either used in the development of this article or are considered to be of interest by the author.

• A Relational Model of Data for Large Shared Data Banks - E.F. Codd

• Database Normalization - Wikipedia

• A Simple Guide to Five Normal Forms in Relational Database Theory - William Kent

• Normal Form Definitions and Examples

• MySQL Database Design and Optimization - Jon Stephens & Chad Russell



Entity-Relationship Diagram (ERD)

The purpose of Entity Relationship Diagram (ERD) is to organize information into manageable record in a database. ERD illustrate the logical structure of database. They are used to:

• Identify the data that must be captured, stored and retrieved for the particular application

• Identify the data required to derive and report on the performance measures that an organization should be monitoring.

ERDs have three different components:

• Entities

• Attributes

• Relationship

Entity

Entities are people, places, things and events. In short, anything you need to store data about. Entities also represent collection of things. For example a STUDENT entity might represent a collection of all the things that can be said about a student. Entities are represented on the ERD by labeled boxes.

Attributes are the properties or characteristics of an entity. These are the smallest unit of data that can be described in a meaningful manner. For example STUDENT entity may have attributes such as:

STUDENT

StudentNumber

FirstName

Surname

TelephoneNumber

DOB

Form

The relationship is the interaction between the entities. A relationship may be represented by a diamond shape, or more simply, by the line connecting the entities. In either case, verbs are used to label the relationships.

The cardinality defines the relationship between the entities in terms of numbers. An entity may be optional: for example, a sales rep could have no customers or could have one or many customers; or mandatory: for example, there must be at least one product listed in an order. There are several different types of cardinality notation; crow's foot notation, used here, is a common one. In crow's foot notation, a single bar indicates one, a double bar indicates one and only one (for example, a single instance of a product can only be stored in one warehouse), a circle indicates zero, and a crow's foot indicates many. The three main cardinal relationships are: one-to-one, expressed as 1:1; one-to-many, expressed as 1:M; and many-to-many, expressed as M:N.

[pic]

adapted from

The steps involved in creating an ERD are:

• Identify the entities.

• Determine all significant interactions.

• Analyze the nature of the interactions.

• Draw the ERD.

EXAMPLE of an ERD

WOD 1: Warehouse

This WOD was a simple ERD diagram:

[pic]

This diagram shows that there is a many to many relationship between tag and product which means there can be many tags for one product and many products for a single tag. Product has a many to one relationship with stock item which means that there can be only one product associated with a stock item but there can be many stock items with the same product. Finally, there is a many to one relationship between stock item and warehouse meaning there can only be one warehouse per stock item but a warehouse can have many stock items.

Taken from: , retrieved on July 27, 2015

[pic][pic][pic][pic][pic][pic]

-----------------------

STUDENT

StudentNumber

FirstName

Surname

TelephoneNumber

DOB

Form

STUDENT

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download