For all but the very simplest database applications, there are many, many different relational database "designs" (schemas) that can be used to store the relevant data.

- Q: Can one schema be much better than another?
- A: Often, yes

- In reality, for big applications higher-level design tools are often used (we'll cover that topic soon), but sometimes designers go straight to relations.
- Relational design is one of the more theoretical topics of the course, but it is still of practical use.

- ID and name
- Applying to one or more campuses
- Played zero or more sports in high school
- Attended one or more high schools

Proposed relation to capture this information:

Apply(ID, name, campus, sport, HScode, HScity)

Consider "Mary" (ID 123) who is applying to Berkeley and Santa Cruz. She played tennis, basketball, and volleyball at both of her high schools, Gunn (code 26) and Menlo-Atherton (code 28).

*Question: How many tuples in the relation?*

`Apply`

relation exhibits three types of anomalies:
- Redundancy
- Update Anomaly
- Deletion Anomaly

Good designs:

- No anomalies
- Can reconstruct all original information

- Start with "mega" relations containing all information
- Then decompose them into smaller, better relations containing the same information
- Do the decomposition automatically

- Designer specifies mega-relations plus
*properties of the data* - System decomposes based on properties
- Final set of relations satisfies
*normal form*-- guarantees no anomalies, no lost information

- If designer specifies
**functional dependencies**(**FDs**) => system produces**Boyce-Codd Normal Form**(**BCNF**) - If designer also specifies
**multivalued dependencies**(**MVDs**) => system produces**Fourth Normal Form**(**4NF**) - 4NF is
*stronger*than BCNF: Any design that is in 4NF is also in BCNF, but not vice-versa.

`Apply(ID,name,campus)`

Students can apply to multiple campuses

*Question: Is this a good design or should we decompose?*

Can have many duplicates of each

`(ID,name)`

- Redundant: a given
`ID`

always has the same`name`

(but not necessarily vice-versa) - Better to store each
`ID-name`

connection once

`Apply(ID,campus,sport)`

*Question: Is this a good design or should we decompose?*

Multiplicative effect: C campuses and S sports yields CxS tuples

*Question: captured by FDs and BCNF?*

*4NF decomposition* says make a separate relation (A,B) and take
B out of the original.

- Functional dependencies
- Boyce-Codd Normal Form
- BCNF decomposition algorithm
- Multivalued dependencies
- Fourth Normal Form
- 4NF decomposition algorithm

Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority) Apply(ID, campus, date, major)

`Student.priority`

is determined by
`Student.cumGPA`

.
*(show example)*

Formally:

For every pair of tuples t and u in Student: if t[cumGPA] = u[cumGPA] then t[priority] = u[priority]This is a "functional dependency" (FD): a constraint specified with the schema of a relation.

- Based on knowledge of the real world
- All data must adhere to it

A1, A2, ..., Am -> B1, B2, ..., Bn (commas may be omitted)states that for every pair of tuples t and u in R: if

```
t[A1,...,Am]
= u[A1,...,Am]
```

then `t[B1,..,Bn] = u[B1,..,Bn]`

.
We will abbreviate `A1, A2, ..., Am`

as `AA`

(or
"A-bar") and `B1, B2, ..., Bn`

as `BB`

(or "B-bar").

*(simple abstract example)*

`Student`

besides `cumGPA -> priority`

`Apply?`

*(show abstract example)*

Note:

- If R can have duplicates, FD <> key
- Book requires "keys" to be minimal (introduces "superkeys")

*(example)*

*(example)*

*(example)*

=> Most of the time we're interested in completely nontrivial FD's.

*Question: Can we also split the left-hand side?*

If AA -> BB then AA -> (BB U AA)

Find all attributes B in R such that A1, A2, ..., Am -> BThis set of attributes is the "closure" and is denoted {A1, A2, ..., Am}+

**Algorithm** for computing closure:

start with {A1, A2, ..., Am} repeat until no change: if current set of attributes includes LHS of a dependency, add RHS attributes to the set(Effectively applies combining and transitive rules until there's no more change.)

*Example:* closure of `{ID, HScode}`

in `Student`

given FD's:

ID -> name, address, cumGPA cumGPA -> priority HScode -> HSname, HScity

*Example:* apply rules above

When specifying FD's for a relation, we would like to specify a

*Question: How can we tell if one FD follows from others?*

*Example:* Does AA -> BB follow from a set S of FD's?

- Designer specifies mega-relations plus
*properties of the data*: functional dependencies - System decomposes based on FDs
- Final set of relations satisfies
*normal form*(BCNF) -- guarantees no anomalies, no lost information

*(diagram)*

Idea of decomposition:

- R1 = PROJECT[A1, ..., Am] (R), eliminating duplicates
- R2 = PROJECT[B1, ..., Bn] (R), eliminating duplicates
- R1 NATURAL-JOIN R2 = R

*Example decomposition:*

Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority)decomposed to

Student1(ID, name, address, HScode, cumGPA, priority) Student2(HScode, HSname, HScity)

*Another example decomposition:*

Student1(ID, name, address, HScode, HSname, HScity) Student2(name, HSname, cumGPA, priority)

**Given:** relation R and set of FD's for R

**Definition:** R is in BCNF with respect to its FD's if for every
nontrivial FD AA -> BB, AA contains a key.

*Question: Why does violating this requirement produce a "bad" relation?*

Student(ID, name, address, HScode, HSname, HScity, cumGPA, priority) FD's: ID -> name, address, cumGPA cumGPA -> priority HScode -> HSname, HScity Key: ID, HScode

*Question: Is the relation in BCNF?*

Each violation produces anomalies.

Apply(ID, campus, date, major)Can apply to campus multiple times for different majors, but can only apply to a campus once per day

FD's: ID, campus, date -> major Key: ID, campus, date

*Question: Is the relation in BCNF?*

Input: relation schema R and set of FD's for R (1) compute keys for R based on FD's (2) repeat until no more BCNF violations: (2a) pick any R' with AA -> BB that violates BCNF (2b) decompose R' into R1(AA,BB) and R2(AA,CC) where CC is all attributes in R' except (AA U BB) (2c) compute FD's for R1 and R2 (2d) compute keys for R1 and R2 based on FD's

*Question: How can we compute keys in steps (1) and (2d)?*

*Question: How do we compute FD's in step (2c)?*

`Student`

example)=> Final decomposed relations may be different depending on which violating FD is chosen in each iteration (step 2(a)), but all decompositions will be in BCNF.

Does BCNF guarantee a good decomposition?

- Removes anomalies? Yes
- Reconstructs original relation?
- Can we get too many tuples?
Consider R(A,B,C) with tuples (1,2,3) and (4,2,5)

Decompose into R1(A,B) with tuples (1,2) and (4,2) R2(B,C) with tuples (2,3) and (2,5)

"Reassembled" relation has four tuples => wrong

But BCNF would not decompose this way unless B -> C or B -> A

- Can we get too few tuples? No

- Can we get too many tuples?

- Set of relations without anomalies
- Can reconstruct original relation
- But some (relatively uncommon) shortcomings to be discussed

Apply(ID, campus, major) // removed date from earlier exampleSuppose students can apply to several campuses and different majors, but must apply to the same set of majors at every campus.

*Question: What are FD's?*

*Question: What is key?*

*Question: Is it in BCNF?*

*Question: Is it a "good" design?*

**MVD Definition:**

AA ->> BB is an MVD for relation R if: For all tuples t,u in R: If t[AA] = u[AA] then there exists a v in R such that: (1) v[AA] = t[AA] (2) v[BB] = t[BB] (3) v[CC] = u[CC] where CC is all attributes in R except (AA U BB)'

MVD's are also called "tuple-generating dependencies."

*Back to example: What are MVD's?*

*(show example data to verify)*

Prove by showing (1), (2), (3) in MVD definition.

*Question: Are there any rules for FD's that do not apply for MVD's?*

**Given:** relation R and set of MVD's for R

**Definition:** R is in 4NF with respect to its MVD's if for
every nontrivial MVD AA ->> BB, AA contains a key.

**Note:** Since every FD is also an MVD, 4NF implies BCNF

*Question: What happens in the MVD definition if AA contains a key?*

Input: relation schema R and set of FD's and MVD's for R (1) compute keys for R based on FD's (2) repeat until no more 4NF violations: (2a) pick any R' with AA ->> BB that violates 4NF (2b) decompose R' into R1(AA,BB) and R2(AA,CC) where CC is all attributes in R' except (AA U BB) (2c) compute FD's and MVD's for R1 and R2 (2d) compute keys for R1 and R2 based on FD's

*Question: How do we compute MVD's in step (2c)?*

*(run algorithm on Apply example)*

Apply(ID, campus, date, major)

*Question: What are FDs and keys?*

`Apply`

as above to remain undecomposed.
=> Material on 3NF (Section 3.6.6) is required reading for the course, not covered in lecture.

Student(ID, HScode, cumGPA, priority) ID -> cumGPA cumGPA -> priority ID -> priorityNote:

`ID`

is Apply BCNF decomposition splitting first on `ID -> cumGPA`

Heuristic: "close" each FD before beginning decomposition

=> Overall, BCNF/4NF decomposition does not guarantee that all of the original FDs can be enforced on the individual decomposed relations.