# Normalization

## Overview

- In this topic, we will learn about a mathematical way to argue why
  one data model is better than the other.
- We will learn principled methods and some algorithms
  (i.e. decomposition) to improve a "bad" data model.
- We will talk about importance of integrity constraints in database
  modeling.

## Why some data models can be bad?

Suppose we are given the following MusicGroups relation (some values
fictitious):

| Group   | Artist        | Genre    | DateFounded | DateJoined |
| ------- | ------------- | -------- | ----------- | ---------- |
| Queen   | F. Mercury    | Pop/Rock | 1971        | 1971       |
| Queen   | B. May        | Pop/Rock | 1971        | 1972       |
| Queen   | R. Taylor     | Pop/Rock | 1971        | 1971       |
| Queen   | J. Deacon     | Pop/Rock | 1971        | 1972       |
| Beatles | J. Lennon     | Pop/Rock | 1960        | 1960       |
| Beatles | P. McCartney  | Pop/Rock | 1960        | 1960       |
| Beatles | R. Starr      | Pop/Rock | 1960        | 1960       |
| Beatles | G. Harrison   | Pop/Rock | 1960        | 1960       |


- What is the key of this relation? Which combination of attributes must
  be unique for each tuple?

  - Group, Artist
  - Remember that you cannot have missing values for the key
    attributes.

- What types of issues do we see with this database?

  > - We repeat information: Genre for a group is repeated for each
  >   artist.
  >
  >   - It is possible that we make an error and change the value for one
  >     tuple. Then, we are left with not knowing what the correct genre
  >     was.
  >
  > - What happens we delete all the artists in a group? We end up loosing
  >   information related to the group. Should they be coupled?
  >
  > - What if a new group is formed with no artists? We had not foreseen that
  >   such a group exists. But, we cannot enter this as our data model does not
  >   allow it.

## Normalization Overview

- Given a relation, we will examine whether all the attributes in the
  relation should be in the same relation or should be split into multiple
  relations.

- Given the meaning of these attributes, we need to write down what real
  life conditions/constraints must be true.

  - Can two music groups have the same name?
  - Can two artists have the same name?
  - Can an artist join a music group more than once?

- Often the database designer will make some assumptions regarding these.

  - Having constraints will help discovery of incorrect data.
  - Choose the important constraints to track.

## Functional Dependency

- A function dependency is a logical expression of the form X -> Y such that

  - X, Y are sets of attributes from the database schema
  - X -> Y means that whenever two tuples in a relation have the same
    values for attributes in X, then they must also have the same
    values for attributes in Y.

- Functional dependencies are defined for a relation as constraints by
  the database designer based on the meaning of the attributes and the
  properties they represent

- Example:

  ```
  MusicGroups(name, artist, genre, dateFounded, dateJoined)

  name -> genre dateFounded
  name artist -> dateJoined
  ```

  - name -> genre dateFounded

    Means that given group name, there is a single genre and a single
    value for the dateFounded attribute

  - name artist -> dateJoined

    Means that given a group name and an artist, there is a single
    dateJoined corresponding to the date the artist joined the group.

  - What if an artist joins the group multiple times? According to our
    data model, we are not going to store that. Remember, you define the
    rules for your database.

- Example

  ```
  People(Name, Phone, InsCompName, InsCompAddress)

  Name -> Phone, InsCompName
  InsCompName -> InsCompAddress
  ```

- Example

  ```
  Companies(company_name, company_address, date_founded, owner_id, owner_name, owner_title, #shares)

  company_name  -> company_address
  company_name  -> date_founded
  company_name, owner_id -> owner_title
  company_name, owner_id -> #shares
  company_name, owner_title -> owner_id
  owner_id -> owner_name
  ```

  - Rules of this database according to these functional dependencies:

  > - A single owner can only hold any specific title
  >
  >   company_name, owner_title -> owner_id
  >
  > - An owner can have a single title within the company
  >
  >   company_name, owner_id -> owner_title
  >
  > - Each title is given to a single person within the company
  >
  >   company_name, owner_title -> owner_id

## Keys and functional dependencies

- Functional dependencies tell us what the keys are!

- For example

  ```
  Student(id, name, address)

  id -> name, address  (FD1)
  ```

  means that all tuples with the same id must have a unique name and address.

- Then id is the key! No two tuples can have the same id, but
  different name or address. So, id must be unique.

- But if the above is true, then we must also have that:

  ```
  id name -> name, address  (FD2)
  ```

  Then, id, name is a key too. Which one is better? The smaller one!

- Also, the functional dependency FD1 is stronger than FD2, i.e. you can
  obtain FD2 from FD1.

## Functional Dependencies

- For a given relation, we look at all valid set of functional
  dependencies together.

- We will use F to denote the set of all functional dependencies.

  - All functional dependencies in F together tell us all the
    relevant rules for our relation, such as the type of tuples can we
    store.
  - Given F, we can also infer new rules using the following inference
    rules.

- Some functional dependencies may be removed if they are not needed
  and some may be simplified based on others. To do this, we will see
  some rules first.

## Superkey: Unique Attributes[^sn1]
[^sn1]: Superkeys may or may not
  be minimal, but they are unique. Uniqueness means that if X is a
  superkey, then whenever two tuples have the same values for X, then
  it must have the same value for all the remaining attributes.


- Given a relation R(A1,...,An) and a set of functional dependencies
  F, a superkey (superset of keys) is a set of attributes B1,...,Bm
  such that

  - B1,...,Bm imply all the attributes in relation R. This is called
    uniqueness constraint.

    No two tuples may have the same values for the attributes B1,...,Bm.

    In other words,

    ```
    Given F, if

    B1,...,Bm -> A1,...,An

    is implied by the functional dependencies in F (where A1,...An)
    are all the attributes in R), then B1,...,Bm is a superkey.
    ```


## Keys[^sn2]
[^sn2]: Keys are superkeys that are also minimal, i.e. if X is a key,
then whenever two tuples have the same values for X, then
  it must have the same value for all the remaining attributes and
no subset of X is also a key.

- Given a relation R(A1,...,An) and a set of functional dependencies F,
  a key is a set of attributes B1,...,Bm such that

  - B1,...,Bm is a superkey! This is called the uniqueness constraint.
  - No subset of B1,...,Bm is a superkey. This is called the
    minimality constraint.

- From now on we will use X, Y to represent a set of attributes.

## Functional Dependency Inference

- Given a set of F functional dependencies, the closure F+ of the set
  is the set of all functional dependencies implied by F.

- A functional dependency X->Y is implied by a set F of functional
  dependencies if X->Y is in the closure.

- The rules below allow us to find whether a functional dependency
  is implied the others.

  The closure F+ is simply the set of all functional dependencies that
  can be found by applying all possible rules.

## Functional Dependency Inference Rules

- **Trivial Functional Dependencies (Reflexivity)**

  ```
  If Y ⊆ X, then  X ->Y is true for all relations.
  ```

  Proof. If two tuples have the same values for attributes in X, then
  it is already given that they have the same values for the
  attributes of Y.

- **Splitting Rule**

  ```
  Given a functional dependency

  A1,...,An -> B1,...,Bm

  we can split it into singletons on the right hand side:

  A1,...,An -> B1
  ....
  ....
  A1,...,An -> Bm
  ```

- **Combining Rule**

  ```
  Given functional dependencies

  A1,...,An -> B1
  ....
  ....
  A1,...,An -> Bm

  we can combine the attributes on the right hand side:

  A1,...,An -> B1,...,Bm
  ```

- **Transitivity**

  ```
  Given functional dependencies X -> Y and Y -> Z
  we can conclude that: X -> Z
  ```

  for sets of attributes X, Y and Z.

- **Augmentation**

  ```
  If A1,...,An -> B1,...,Bm and given attributes C1,...,Ck
  we can conclude that A1,...,An,C1,...,Ck -> B1,...,Bm,C1,...,Ck
  ```

- Note that reflexivity, augmentation and transitivity are sufficient
  and necessary rules. The rest of the rules can be driven from these
  three.

### Inference example using the above rules

> ```
> Given: F = {A -> BC, CD -> F, BE -> D} for R(A,B,C,D,E,F)
>
> Is  AE -> CDF true?
> ```
>
> Apply all the inference rules to get this:
>
> ```
> BE -> D    (given)
> BCE -> CD  (accumulation of BE -> D)
> BCE -> D   (decomposition)
> A -> BC    (given)
> AE -> BCE  (acculumulation)
> AE -> D    (transitivity)
> AE -> BCDE (combining rule)
> AE -> CD   (decomposition)
> AE -> F    (transitivity)
> AE -> CDF  (combining rule)
> ```

- Method 2: compute the closure of AE to find all that is implied by
  AE. See method below.

## Closure of a set of attributes

- The closure of a set X of attributes given by X+ with respect to
  a set F of functional dependencies is the set of all attributes implied
  by X with respect to F.

  In other words, if Y=X+, then X->Y is in F+.

- Given a set F of functional dependencies, the closure X+ of a set X
  of attributes is given by the following algorithm.

  1. Initialize, X+ = X
  2. Find any functional dependency Y -> Z in F such that Y ⊆ X+, then
     X+ = X+ ∪ Z.
  3. Continue until X+ does not change.

- Inference rule:

  ```
  If Y⊆X+ with respect to F then X->Y is implied by F (is in F+).
  ```

## Closure Example

- Given A -> BC, CD -> E, BE -> C

  - AE closure must include AE (given AE, we know AE).

  - Now given we know AE, what else can we conclude based on the above fds?

    - AE + = \{AE}, use A -> BC to get:

    - AE + = \{ABCE} nothing else we can use. So, we are finished.

    - This means: AE -> ABCE.

    - We can also write it as:

      AE -> A, AE -> B, AE->C, AE->E

## Keys revisited

- A key for a relation R is a minimal set of attributes A1,..,An such
  that {A1,...,An}+ is the set of all attributes in R.

- Given:

  name -> genre dateFounded

  name artist -> dateJoined

  key: name artist

  {name artist}+ = {name, artist, genre, dateFounded, dateJoined}

- Find the keys for the following relations:

  ```
  R1(A,B,C,D,E,F,G)  F = {A->BC, AD->EF, AC->G}

  R2(A,B,C,D,E) F = {A->BC, D->E}

  R3(A,B,C,D) F = {A->BC}

  R4(A,B,C,D,E,F,G)  F = {A->BCD, BE->FG, CG->A}
  ```

## Simplifying the set of functional dependencies

- Given a key for a relation, we would like to argue whether the
  underlying model is a good model or not (using normal forms).
- If the model is not good, i.e. does not satisfy a certain normal
  form, we can improve it by decomposition.
- To properly apply decomposition, we need to make sure that the functional
  dependencies do not have any redundant information that may result in
  non-optimal decompositions.
- We will revisit the closure of all functional dependencies to
  accomplish this.

## Closure of a set of functional dependencies (revisited)

- Closure F+ of a set F of function dependencies is the set of all functional
  dependencies that can be inferred by the given set of inference rules.

  > - Closure F+ includes all dependencies in F.
  >
  >   ```
  >   F ⊆ F+
  >   ```
  >
  > - Closure includes all trivial functional dependencies.
  >
  > - Closure also includes all other non-trivial dependencies that can be
  >   inferred by using the above rules.

- Two sets of functional dependencies F1 and F2 over the same relation
  are equivalent, if they have the same closure:

  ```
  F1 equivalent F2 iff F1+ = F2+
  ```

  In essence, two functional dependencies with the same closure have
  the same meaning.

- For example F1 and F2 below are equivalent:

> ```
> F1 = {A->BC, A->DE, E->F, AB->DF}
> F2 = {A->BC, AB->DE, E->F}
> ```

- How can you check two sets F1 and F2 are equivalent?

  - First check that each functional dependency in F1 is implied by F2.

    ```
    For each X->Y in F1:
        compute X+ using F2 and check if Y ⊆ X+
    ```

    This proves that F1+ ⊆ F2+.

  - Then check that each functional dependency in F2 is implied by F1.

    ```
    For each X->Y in F2:
        compute X+ using F1 and check if Y ⊆ X+
    ```

    This proves that F2+ ⊆ F1+.

  - If both are true, then the closures must be the same: F1+ = F2+.

## Basis of Functional Dependencies

- Basis: A basis is a set of functional dependencies such that
  there is only one attribute on the right hand side of a functional
  dependency.

  It is easy to convert any functional dependency set to a basis by
  using the splitting rule.

  ```
  F = {A->BC, CD->EF}
  ```

  convert to:

  ```
  F = {A->B, A->C, CD->E, CD->F}
  ```

- Remember, split the right hand side. The left hand side should not
  be changed unless it is implied by other functional dependencies
  as we will see next.

## Minimal Set of Functional Dependencies

- A basis F is minimal if there is no other basis F1 that can be obtained
  from F by

  - either removing some functional dependencies
  - or by removing some attributes from the left hand side of functional
    dependencies

  such that F and F1 are equivalent (i.e. they have the same closure).

## Finding a minimal basis

- Input a basis F.

  - Step 1.

    For each fd X -> Y in the set F, check if F - { X -> Y } implies X -> Y.
    If so, remove X -> Y.

    Note: all trivial functional dependencies follow this rule.

  - Step 2.

    For each fd XW -> Y in the set F, check if X+ is the same
    with respect F and (F - { XW -> Y }) ∪ { X -> Y }.

### Example

- F = {ABC -> D, AB -> E, E -> C, ABE -> C, E -> F, A -> A}

  - Step 1. Can we remove ABE -> C ?

    Given F, ABE+ = {A,B,C,D,E,F}

    Given F1 = {ABC -> D, AB -> E, E -> C, E -> F},

    ABE+ = {A,B,C,D,E,F}.

    As they are equivalent, we can remove ABE -> C.

    We can also remove any functional dependency that is trivial, i.e. A -> A.

- F = {ABC -> D, AB -> E, E -> C, E -> F}

  - Step 2. Can we replace ABC ->D with AB -> D ?

    Given F:

    AB+ = {A,B,C,D,E,F}

    Given F2={AB -> D, AB -> E, E -> C, E -> F}:

    AB+ = {A,B,C,D,E,F}.

    As they are equivalent, we can replace with F with F2.

## Normalization

- Given a schema, are there any problems with the way the attributes
  are grouped into the relations?

- Example:

  ```
  Person(Name, Phone, InsCompName, InsCompAddress)
  ```

- When two people have the same insurance company (InsCompName), then
  the insurance company address of these two people should also be the
  same.

  - Possible problems:

    1. If we insert a new person with the same insurance company but
       different address, the database becomes inconsistent. Addition
       of an atomic unit of information should not cause inconsistency.
    2. If we delete a person with a specific insurance company X and
       if there are no other people with this insurance company, we
       loose the information about where this company is located.
    3. If we update the insurance information for one person but not
       the others with the same insurance, then the database becomes
       inconsistent -similar to insertion above.

## Normal Forms

- Normal forms are introduced as rules that help avoid the above
  problems, based on the functional dependencies true for this
  relation.
- We have already seen first normal form: all attributes should
  have simple values.

## Boyce-Codd Normal Form (BCNF)

- Given a set F of functional dependencies, a set R is said to be in
  Boyce-Codd normal form if all functional dependencies X -> Y in F
  are

  - either trivial, i.e. Y ⊆ X
  - or has a superkey on the left side (i.e. X is a superkey).

## BCNF

- Example:

  ```
  Person(Name, Phone, InsCompName, InsCompAddress)

  Name -> Phone, InsCompName
  InsCompName -> InsCompAddress

  Key: Name
  ```

  Not in BCNF since the second f.d. does not have a superkey on the
  left (InsCompName is not a superkey).

- Example:

  ```
  MusicGroup(name, artist, genre, dateFounded, dateJoined)

  name -> genre dateFounded
  name artist -> dateJoined

  Key: name, artist
  ```

  Not in BCNF since the first f.d. violates it (name is not a superkey)

- Example:

  ```
  BankAccount(routingno, accountno, userid, bankname, bankaddress)

  userid, routingno, accountno -> bankname, bankaddress
  userid, bankname, accountno -> routingno

  Key: userid, routingno, accountno  or userid, bankname, accountno
  ```

  In BCNF because both keys have a superkey on the left hand side.

- Example:

  ```
  Student(Id, lastname, firstname, SSN, firstmajor, year, email)

  id -> lastname firstname ssn firstmajor year email
  ssn -> id lastname firstname firstmajor year email
  email -> lastname firstname id ssn firstmajor year

  Key: id or ssn or email
  ```

  In BCNF because all functional dependencies have a superkey on the left.

## Prime attribute

- Given a relation R and a set of functional dependencies, a prime
  attribute is an attribute that appears in a key for R.

## Third Normal Form (3NF)

- Given a set F of functional dependencies, a set R is said to be in
  third normal form if all functional dependencies X -> Y in F are

  - either trivial, i.e. Y ⊆ X,
  - or has a superkey on the left side (i.e. X is a superkey),
  - or has only prime attributes on the right hand side (i.e. all
    attributes in Y are prime attributes).

- Example:

  ```
  Address(city, state, street, zip)

  city, state, street -> zip
  zip -> city, state

  Key:  city, state, street  or zip, street
  Prime attributes: city, street, state, zip
  ```

  This is not in BCNF because the second functional dependency does not
  have a superkey on the left. However, as city and state are both prime
  attributes, this relation is in 3NF.

  - First functional dependency is not trivial but has a superkey on the left.
  - The second functional dependency is not trivial, does not have a
    superkey on the left, but all attributes on the right are prime
    attributes.

## BCNF and 3NF relations

- All relations that are in BCNF are also in 3NF. The reverse is not true.

  - In the above example, Address is in 3NF but not in BCNF.

- BCNF is the golden standard, it is desirable to put all relations in
  BCNF.

- If a relation is not in BCNF, we need to decompose the it to get relations
  that are in BCNF.

  - There is no single way to do this. You need to run the algorithm
    multiple ways and choose the most reasonable solution.

- If a relation is not in 3NF, there is a simple algorithm to
  decompose it into relations that are in 3NF.

  - It will produce a single result.

- We will now see how to do decompositions!

## Decomposition

- For example, we know that the following relation is not in BCNF:

  ```
  Person(Name, Phone, InsCompName, InsCompAddress)
  ```

  Because insurance information belongs in a different relation than people.
  A better model would be:

  ```
  Person(Name, Phone, InsCompName),
  Insurance(InsCompName, InsCompAddress)
  ```

This act of splitting the relation is called decomposition.

The objective is to have relations that are atomic, contain only
information relevant to the key as a whole, have fever null values.

- Given a relation R and functional dependencies F, a decomposition is
  given by

  ```
  R1(A1,..,An), F1
  R2(B1,...,Bm), F2
  ```

  such that

  - Atttributes in R1 and R2 together are equivalent to attributes in R

  - F1 only involves attributes in R1, and F1 ⊆ F+

  - F2 only involves attributes in R2, and F2 ⊆ F+

  - Tuples in R1 and R2 are obtained from R:

    $$
    R1 = \Pi_{A1,...,An}\, R

    R2 = \Pi_{B1,...,BM}\, R
    $$

## Decomposition Properties

- Suppose we are given a decomposition

  R1(A1,..,An), F1 and R2(B1,...,Bm), F2

  of a relation R with functional dependencies F.

- The decomposition is dependency preserving iff

  $$
  (F1 \cup F2)+ = F+
  $$

- The decomposition is lossless if we are guaranteed that

  $$
  R1 * R2 = R
  $$

  The natural join of tuples is guaranteed to return the same relation
  as R.

- We will now see how to create decompositions that satisfy these
  conditions.

## Projecting functional dependencies

- The functional dependencies for a decomposition are
  found by projecting the original set F of functional dependencies
  on the decomposed relation.

- Given:

  ```
  R(A,B,C,D), F = {A -> B, B -> CD}

  Decompose into: R1(A,B) and R2(A,C,D)
  ```

  What are the maximal set of functional dependencies implied by
  original set, but only involve attributes in R1 or R2?

- First find all (non-trivial) functional dependencies implied by the
  above:

  A -> B, B -> C, A -> C, AB -> C, etc.

- Now, for R1(A,B): choose the subset of these that have only A and B
  (A -> B)

- For R2(A,C,D): choose the subset that have only A, C and D (A -> C)

- Simplify each set to be a minimal basis!

- Is this decomposition after choosing best functional dependencies
  preserve them all?

  In other words, did we loose something by decomposing:

  ```
  R(A,B,C), F = {A -> B, B -> C}

  into:

  R1(A,B)  F1 = {A -> B}
  R2(A,C,D) F2 = {A -> C}
  ```

- We check if this decomposition is dependency preserving.

  ```
  F1 ⋃ F2 and F are equivalent

  F1 ⋃ F2 = {A -> B, B -> C}
  F = {A -> B, A -> C }
  ```

- Are the two equivalent?

  F1: A -> B, B -> C

  F1 implies everything in F2

  F2: A -> B, A -> C

  F2 does not imply everything in F1 (i.e. B -> C)

  B+ = \{B}, does not include C.

  Hence, F1 and F2 are not equivalent.
  Actually, F1 is more strict than F2.

  This means that this decomposition is not dependency preserving as B
  -> C is lost!

## Projecting functional dependencies (Algorithm)

- Suppose we are given a set of functional dependencies F for a
  relation R. We decompose the relation to a subset of the attributes
  L, R1 = ΠL R.

- The functional dependencies that hold for R1 are computed as follows:

  - For all possible subsets X of L, compute X+ with respect to F.
  - For all attributes A in X+ that are in L, add the dependency X -> A
    to the projection.
  - Simplify the basis to find a minimal basis.

- For example, for R2 above, we need to compute: A+,C+,D+,AC+,AD+,CD+.

## Dependency Preserving

- Given a set F of functional dependencies for R, a decomposition R
  into R1, R2 is dependency preserving if

  - Let F1 be the projection of F onto R1

  - Let F2 be the projection of F onto R2

  - If F1 ∪ F2 ≡ F then the decomposition is dependency preserving.

    This is the same as checking: (F1 ∪ F2)+ = F+

## Chase test for lossless decomposition

- Recall that a decomposition is lossless if we are guaranteed that

  $$
  R1 * R2 = R
  $$

  The natural join of tuples is guaranteed to return the same relation
  as R.

- The chase test for lossless decomposition works as follows:

  - Given a decomposition R1,...,Rn we construct a relation R with a
    tuple for each decomposition. For each attribute, we will use a
    symbol (usually a lowercase letter).
  - The tuple for relation Ri has the specific symbol for each
    attribute in the relation. But, the symbols for the attributes in
    Ri have no subscript. The remaining symbols have subscript i.

- Given a functional dependency X -> Y, we find two tuples t1, t2 such
  that the attributes in X are the same for these tuples. Then, we set
  attributes in Y for these two tuples to be the same as follows:

  - If the attributes have both subscript, then set one to be equivalent
    to the other arbitrarily.
  - If one of the attributes have no subscript, then change the other so
    that it has no subscript as well.

- We continue to apply the functional dependencies until we find a
  tuple with no subscripts, which is a proof that the decomposition is
  lossless, and stop.

- If after we apply all the functional dependencies, there are no
  tuples without a subscript, then this decomposition is
  lossy. Furthermore, the resulting table is a counter example: if you
  decompose this table and join back, you will get more tuples than the
  original table.

## Chase Algorithm Examples

- Example:

  ```
  Given R(A,B,C) and F = {C -> B}
  Decompositions R1(A,C) and R2(B,C)
  ```

  We have following starting relation:

| A   | B   | C   |
| --- | --- | --- |
| a   | b1  | c   |
| a2  | b   | c   |

- Apply C->B to get:

| A   | B   | C   |
| --- | --- | --- |
| a   | b   | c   |
| a2  | b   | c   |

- This is the point we stop. This decomposition is losless as the
  first tuple is without a subscript.

- Example:

  ```
  R(A,B,C,D,E,F)   F= {B->E, EF->C, BC->A, AD->E}
  R1(A,B,C,F)
  R2(A,D,E)
  R3(B,D,F)
  ```

| A   | B   | C   | D   | E   | F   |
| --- | --- | --- | --- | --- | --- |
| a   | b   | c   | d1  | e1  | f   |
| a   | b2  | c2  | d   | e   | f2  |
| a3  | b   | c3  | d   | e3  | f   |

- Apply B->E to rows 1 and 3.

| A   | B   | C   | D   | E   | F   |
| --- | --- | --- | --- | --- | --- |
| a   | b   | c   | d1  | e1  | f   |
| a   | b2  | c2  | d   | e   | f2  |
| a3  | b   | c3  | d   | e1  | f   |

- Apply EF->C to rows 1 and 3.

| A   | B   | C   | D   | E   | F   |
| --- | --- | --- | --- | --- | --- |
| a   | b   | c   | d1  | e1  | f   |
| a   | b2  | c2  | d   | e   | f2  |
| a3  | b   | c   | d   | e1  | f   |

- Apply BC->A to rows 1 and 3.

| A   | B   | C   | D   | E   | F   |
| --- | --- | --- | --- | --- | --- |
| a   | b   | c   | d1  | e1  | f   |
| a   | b2  | c2  | d   | e   | f2  |
| a   | b   | c   | d   | e1  | f   |

- Apply AD->E to rows 2 and 3.

| A   | B   | C   | D   | E   | F   |
| --- | --- | --- | --- | --- | --- |
| a   | b   | c   | d1  | e1  | f   |
| a   | b2  | c2  | d   | e   | f2  |
| a   | b   | c   | d   | e   | f   |

- The last row has no subscripted letters, hence this is a lossless
  decomposition.

- Example:

  ::

  : R(A,B,C) {A->C}
    R1(A,B)
    R2(B,C)

| A   | B   | C   |
| --- | --- | --- |
| a   | b   | c1  |
| a2  | b   | c   |

> Cannot apply any functional dependencies and this relation is a
> lossy.
>
> We can use the above relation as proof of this:
>
> R1:

| A   | B   |
| --- | --- |
| a   | b   |
| a2  | b   |

> R2:

| B   | C   |
| --- | --- |
| b   | c1  |
| b   | c   |

> R1\*R2 (natural join) results in a different relation than R:

| A   | B   | C   |
| --- | --- | --- |
| a   | b   | c1  |
| a   | b   | c   |
| a2  | b   | c   |
| a2  | b   | c1  |

## Objectives for Decomposition Methods

- If we are decomposing relations to obtain BCNF or 3NF relations:

  - All our decompositions must be lossless

  - It is highly desirable that decompositions are dependency
    preserving, but that may not always be possible.

    In this case, choose the least important dependencies to let go.

## Decomposition into BCNF

- Given a set of functional dependencies F for a relation R(A1,...,An)
  that violate BCNF,

  1. Convert F to minimal basis

  2. Find a functional dependency X -> Y that violate BCNF

     - Compute X+
     - Decompose R into two: a relation R1 containing attributes
       in X+ and a relation containing attributes {A1,...,An} - (X+ - \{X}).
     - Find projection of f.d.s into R1 and R2, now recursively apply
       BCNF decomposition to R1 and R2 (with corresponding set of
       functional dependencies) until all the relations are in BCNF.

- Example

  ```
  MusicGroup(name, artist, genre, dateFounded, dateJoined)

  name -> genre dateFounded

  name artist -> dateJoined

  Key: name, artist

  Given name -> genre dateFounded violates BCNF, decompose into:

  R1(name, genre, dateFounded), name -> genre dateFounded

  R2(name, artist, dateJoined), name artist -> dateJoined
  ```

## Notes about BCNF Decomposition

- BCNF decomposition algorithm is non-deterministic. Depending on the
  choice of functional dependency you choose in each step, you may get
  a different output.

  You must use judgment of which decomposition results in a better data
  model.

- While BCNF decomposition is lossless, it is not always guaranteed
  to be dependency preserving.

- Also, sometimes BCNF may result in unintuitive results.

- Example:

  ```
  tracks(album, trackNo, song, version)

  album trackNo -> song

  song version album -> trackNo

  Key:  song, version, album   or album, trackNo, version

  Not in BCNF due to album trackNo -> song
  ```

  BCNF Decomposition requires us to create relations:

  ```
  tracks1(album, trackNo, song)   album trackNo -> song

  tracks2(song, version)
  ```

- This decomposition is not desirable for two reasons:

  - The second dependency is lost: there is no way now to check for
    it after decomposition.
  - These four attributes semantically belong together, but BCNF
    decomposition is forcing us to divide them into different relations.

## 3NF decomposition algorithm

- Given a set F of functional dependencies that form a minimal basis
  for a relation R, use the combining rule for F to combine all f.d.s
  with the same left hand side. (this is an optional step as discussed
  in the example below)

  - Decomp = {}

  - For each functional dependency X -> Y in F:

    - If there is no relation in Decomp that contains all the
      attributes in X and Y then add a relation with attributes X∪Y to
      Decomp.

  - If there are no relations in Decomp that have all the attributes
    of one of the keys of R then, add a new relation to Decomp that
    contain all the attributes in one of the keys of R.

  - Simplify Decomp: If R1,R2 are in Decomp but R2 contains all the
    attributes in R1 then remove R1.

- The algorithm guarantees that

  - the resulting relations are in 3NF,
  - the decomposition is dependency preserving, and
  - the decomposition is lossless

## 3NF Decomposition Examples

- Given

  ```
  R(A,B,C,D,E,F)  F = {AB -> C, AB -> F, C -> A, BC -> D}

  Key: ABE or BCE (not in 3NF or BCNF)
  ```

  Before decomposing, join the right hand sides for F to get:

  ```
  F = {AB -> CF, C -> A, BC -> D}

  R1(A,B,C,F)    {AB -> CF, C -> A}
  Key: AB

  R2(B,C,D) {BC -> D}
  Key: BC

  R3(A,B,E)
  Key: ABE
  ```

- Note: no need to create a new relation for C -> A if we already
  created R1. If R1 is created afterwards, we will remove the relation
  with attributes (A,C) in the simplification step.

- Note: R3 is needed as no relation has all the attributes of a key
  (either ABE or BCE is enough)

- Note: What happens if we do not use the combining rule? We would
  get the following decomposition:

  ```
  R1(A,B,C)  {AB -> C, C->A}
  R2(A,B,F)  {AB -> F}
  R3(B,C,D)  {BC -> D}
  R4(A,B,E)
  ```

  Attributes A,B,C,F are now distributed to two relations. This may
  be desirable in some applications, allowing us to separate frequently
  queried attributes from the other. However, it may also be very undesirable
  if we need to join R1 and R2 all the time. It will really depend on the
  use of these relations.

## Fourth Normal Form (4NF)

- Suppose we have

  ```
  People(id, hobby, phone)
  ```

  As hobby and phone can take multiple values, there are no functional
  dependencies for this relation. So, it is in BCNF.

- But, there is still a problem:

  Hobby and phone take multiple values but they are not related. Why
  put them in the same relation?

- We must separate them.

  Decompose into

  ```
  PersonHobby(id, hobby)
  PersonPhone(id, phone)
  ```

- Multivalued dependencies shows the technical way to accomplish this.

## Multi-valued dependencies

- A multi-valued dependency of the form

  ```
  A1 ...  An =>> B1 ... Bm
  ```

  means that for all pairs of tuples t1 and t2 that agree on A, we can
  find a tuple v in R such that

  - v agrees with t1 and t2 on As
  - v agrees with t1 on B’s
  - v agrees with t2 on the remaining attributes (not As or Bs)

- For example, suppose we have that

  ```
  hero(name, weaponsused, hobby)
  name =>> hobby
  ```

  This means that we would expect to see tuples as follows:

| hero | weaponsused  | hobby            | tuple |
| ---- | ------------ | ---------------- | ----- |
| lara | ice axe      | relic collecting | t1    |
| lara | compound bow | motorcycling     | t2    |
| lara | ice axe      | motorcycling     |       |
| lara | compound bow | relic collecting | v     |

- where for t1 and t2 given as above, we need to have v that agrees
  with the hobby for t1 and weaponsused for t2.

  In other words, unless we store all possible combination of values
  for Bs and the remaining attributes, we claim that there is a
  relation between them. But, the multi-valued functional dependency

  ```
  A1 ...  An =>> B1 ... Bm
  ```

  claims that Bs and the remaining attributes are unrelated.

## Inference rules

- Trivial MDs, A1 ... An =>> B1 ... Bm is true if {B1, ...,Bm } ⊆
  {A1,...,An}
- Trivial MDs, A1 ... An =>> B1 ... Bm is trivial if {A1,...,An,B1,...,Bm } consist of all the attributes in R.
- The transitive rule: X =>> Y and Y =>> Z implies X =>> Z (where
  X,Y,Z are sets of attributes).
- If A1 ... An -> B1 ... Bm then A1 ... An =>> B1 ... Bm is also true.
- A1 ... An =>> B1 ... Bm is true and C1 ... Ck are all the
  attributes in R that are not As or Bs then A1 ... An =>> C1 ... Ck
  is also true.
- Note that the splitting/combining rule does not apply to MVDs.

## Fourth Normal Form

- A relation is fourth normal form iff whenever
  A1 ... An =>> B1 ... Bm is a non-trivial MVD, then A1 ... An is a superkey.
- To decompose a relation into Fourth normal form, use an algorithm
  similar to BCNF decomposition algorithm using MVDs.
- Relations in 4NF ⊆ Relations in BCNF ⊆ Relations in 3NF.

## 4NF decomposition

- Given a relation R where A1 ... An =>> B1 ... Bm violates the 4NF,
  decompose R into:

  R1(A1,...,An,B1,..,Bm)

  R2 contains all attributes except for B1,...,Bm.

  If the resulting relations are not in 4NF, then continue decomposing
  until they are.

## Summary

- Make sure that all tables in your database are in 3NF and 4NF in the
  least, and BCNF if possible.

- All decompositions you perform should be lossless.

- Not all decompositions can be necessarily dependency preserving.

  Choose which dependencies are important to enforce for your application
  and find ways to enforce them, either through integrity constraints
  or other application logic.

- Normalization and decomposition is one way of thinking of improving
  a data model, but may not necessily help design the database from
  scratch.

  - It does not help when you have already normalized the database too
    far, splitting related information across different tables.
  - It does not help you see the big picture.

- We will learn E-R modeling methodology to design a full database,
  but with normalization in our minds.