## CS145 Lecture Notes (3) -- Relational Algebra

Steps in creating and using a (relational) database:
1. Design schema

2. Create schema (using DDL)

=> Now have populated database

4. Repeat: execute queries and updates on the database

### Database Query Languages

• Get all students with GPA > 3.7 who applied to Berkeley and San Diego and nowhere else.
• Get all humanities departments at campuses in southern California with < 500 applicants.
• Get the campus with highest average accept rate over the last five years.

• Some questions are easy to pose, some are not.
• Some questions are easy for DBMS to answer, some are not.
• "Query language" also used to update the database.

### Relational Query Languages

• Formal: relational algebra, relational calculus, Datalog

• Actual: SQL (also visual "query builders")

• In all languages, a query is executed over a set of relations, get a relation as the result.

### Relational Algebra

Example schema:
```   Student(ID, name, address, GPA, sizeHS)     // ID is key
Campus(location, enrollment, rank)          // location is key
Apply(ID, location, date, major, decision)  // (ID,location) is key

```

#### Simplest query: relation name

Query "`Campus`" returns all tuples in Campus relation.

Relational algebra operators used to combine, filter, etc. the relations:

#### `SELECT` operator

(Example: Students with GPA > 3.7)
```

```
(Example: Students with GPA > 3.7 and sizeHS < 1000)
```

```
(Example: Applications to S.C. geology major)
```

```
** General case: `SELECT[C] (R)`

where `C` can include attribute names, constants, comparisons (`=`, `<`, etc.), and connectives (`AND, OR, NOT`)

#### `PROJECT` operator

(Example: ID and date of all applications)
```

```
`PROJECT` eliminates columns while `SELECT` eliminates rows.

To do both, combine ("compose") operators:

(Example: name and address of Students with GPA > 3.7 and sizeHS < 1000)

```

```
`SELECT` produces a relation, `PROJECT` operates on that relation.

** General case: `PROJECT[A1, A2, ..., Am] (E)`

where `E` is any expression producing a relation

Question: Is it ever useful to compose two projection operators?

Example: `PROJECT[enrollment] (PROJECT[location, enrollment] (Campus))`

```

```

Question: Is it ever useful to compose two selection operators?

Example: `SELECT[sizeHS < 1000] (SELECT[GPA > 3.7] (Student))`

```

```

#### Duplicates

Example: `PROJECT[date,decision] (Apply)` can produce many tuples with same value
• Relational algebra semantics says remove duplicates.
• SQL does not - one difference between formal and actual query languages

#### Cartesian Product (cross-product)

`Campus X Apply`

Schema of result is `schema(Campus)` union `schema(Apply)`

```

```
• Naming conflicts: prepend relation name
• One tuple in result for each pair of tuples in Campus, Apply

Formally: R1 `X` R2 = {t | t = <t1,t2> and t1 in R1 and t2 in R2}

Question: Looks odd to glue unrelated tuples together. Why use it?

```

```
(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major and were rejected)
```

```
Question: Can we write it in a different way?
```

```

#### Natural Join

• Enforces equality on all attributes with same name
• Eliminates one copy of duplicate attributes

(Show schema of Campus JOIN Apply)

```

```
(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major and were rejected)
```

```
(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major at campus with enrollment < 15,000 and were rejected)
```

```
** General case:
`E1 JOIN E2 = PROJECT[schema(E1) U schema(E2)] (SELECT[E1.A = E2.A and E1.B = E2.B and ...] (E1 X E2))`
```

```
Need to be careful -- suppose we have:
```   Student(ID, name, address, GPA, sizeHS)
Campus(name, enrollment, rank)
```
"`Student JOIN Campus`" doesn't make sense.

#### Theta Join

`E1 JOIN[C] E2 = SELECT[C] (E1 X E2)`
```

```
• DBMS often implements theta-join as basic operation.
• Use of term "join" in implementation circles usually refers to theta-join or sometimes to cross-product.

#### Union

Example schema:
```   Student(ID, name, address, GPA, sizeHS)
OutofStateStudent(ID, name, address, GPA, sizeHS, state)
```

(Example: List name and address of all students)

Question: Can we do it with cross-products or joins?

```

```
For union operator:
• Schemas must match exactly.
• Duplicates eliminated (as usual).
(Example: List name and address of all students)
```

```

#### Difference

Back to original schema:
```   Student(ID, name, address, GPA, sizeHS)     // ID is key
Campus(location, enrollment, rank)          // location is key
Apply(ID, location, date, major, decision)  // (ID,location) is key
```
(Example: Find ID's of all students who didn't apply anywhere)
```

```
Question: What if want name of students?
```

```
For difference operator:
• Schemas must match exactly, duplicates eliminated

#### Intersection

As expected. Note `E1 INTERSECT E2 = E1 - (E1 - E2)`

#### Renaming

Suppose we have:
```   Student(ID, name, address)
Parent(child-ID, parent-name)
```
(Example: Print list of all student and parent names)
```

```
(Example: Find names of all pairs of students who live at same address)
(Example: no dups)
```

```
** General case: `RENAME[R(A1, A2, ..., Am)] (E) or RENAME[R] (E)`

#### Assigning to Temporaries

• We've been writing relational algebra queries as expressions.
• Can equivalently use relational algebra expression trees (see textbook)
• Can equivalently use sequences of algebraic assignments (called linear notation in textbook)
(Example: previous query using sequence of assignments)
```

```

### Summary of Relational Algebra

Basic:
```   E ::= R
|  SELECT[C] (E)
|  PROJECT[A1, A2, ..., An] (E)
|  E1 X E2
|  E1 U E2
|  E1 - E2
|  RENAME[R(A1, A2, ..., Am)] (E)

```
Abbreviations:
```      |  E1 JOIN E2
|  E1 JOIN[C] E2
|  E1 INTERSECT E2
|  TempName(A1, A2, ..., Ak) := E
|  E1 ; E2
```