CS145 Lecture Notes (3) -- Relational Algebra

Steps in creating and using a (relational) database:

Design schema
Create schema (using DDL)
"Bulk load" initial data
=> Now have populated database
Repeat: execute queries and updates on the database

Database Query Languages

Given a database, ask questions, get data as answers

Get all students with GPA > 3.7 who applied to Berkeley and San Diego and nowhere else.
Get all humanities departments at campuses in southern California with < 500 applicants.
Get the campus with highest average accept rate over the last five years.
Some questions are easy to pose, some are not.
Some questions are easy for DBMS to answer, some are not.
"Query language" also used to update the database.

Relational Query Languages

Formal: relational algebra, relational calculus, Datalog
Actual: SQL (also visual "query builders")
In all languages, a query is executed over a set of relations, get a relation as the result.

Relational Algebra

Example schema:

   Student(ID, name, address, GPA, sizeHS)     // ID is key
   Campus(location, enrollment, rank)          // location is key
   Apply(ID, location, date, major, decision)  // (ID,location) is key

Simplest query: relation name

Query "Campus" returns all tuples in Campus relation.

Relational algebra operators used to combine, filter, etc. the relations:

`SELECT` operator

(Example: Students with GPA > 3.7)

(Example: Students with GPA > 3.7 and sizeHS < 1000)

(Example: Applications to S.C. geology major)

** General case: SELECT[C] (R)

where C can include attribute names, constants, comparisons (=, <, etc.), and connectives (AND, OR, NOT)

`PROJECT` operator

(Example: ID and date of all applications)

PROJECT eliminates columns while SELECT eliminates rows.

To do both, combine ("compose") operators:

(Example: name and address of Students with GPA > 3.7 and sizeHS < 1000)

SELECT produces a relation, PROJECT operates on that relation.

** General case: PROJECT[A1, A2, ..., Am] (E)

where E is any expression producing a relation

Question: Is it ever useful to compose two projection operators?

Example: PROJECT[enrollment] (PROJECT[location, enrollment] (Campus))

Question: Is it ever useful to compose two selection operators?

Example: SELECT[sizeHS < 1000] (SELECT[GPA > 3.7] (Student))

Duplicates

Example: PROJECT[date,decision] (Apply) can produce many tuples with same value

Relational algebra semantics says remove duplicates.
SQL does not - one difference between formal and actual query languages

Cartesian Product (cross-product)

Campus X Apply

Schema of result is schema(Campus) union schema(Apply)

Naming conflicts: prepend relation name

One tuple in result for each pair of tuples in Campus, Apply

Formally: R1 X R2 = {t | t = <t1,t2> and t1 in R1 and t2 in R2}

Question: Looks odd to glue unrelated tuples together. Why use it?

(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major and were rejected)

Question: Can we write it in a different way?

Natural Join

Enforces equality on all attributes with same name

Eliminates one copy of duplicate attributes

(Show schema of Campus JOIN Apply)

(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major and were rejected)

(Example: Names and addresses of all students with GPA > 3.7 who applied to CS major at campus with enrollment < 15,000 and were rejected)

** General case:
E1 JOIN E2 = PROJECT[schema(E1) U schema(E2)] (SELECT[E1.A = E2.A and E1.B = E2.B and ...] (E1 X E2))

Need to be careful -- suppose we have:

   Student(ID, name, address, GPA, sizeHS)
   Campus(name, enrollment, rank)

"Student JOIN Campus" doesn't make sense.

Theta Join

E1 JOIN[C] E2 = SELECT[C] (E1 X E2)

DBMS often implements theta-join as basic operation.

Use of term "join" in implementation circles usually refers to theta-join or sometimes to cross-product.

Union

Example schema:

   Student(ID, name, address, GPA, sizeHS)
   OutofStateStudent(ID, name, address, GPA, sizeHS, state)
   ForeignStudent(ID, name, address, GPA, country)

(Example: List name and address of all students)

Question: Can we do it with cross-products or joins?

For union operator:

Schemas must match exactly.
Duplicates eliminated (as usual).

(Example: List name and address of all students)

Difference

Back to original schema:

   Student(ID, name, address, GPA, sizeHS)     // ID is key
   Campus(location, enrollment, rank)          // location is key
   Apply(ID, location, date, major, decision)  // (ID,location) is key

(Example: Find ID's of all students who didn't apply anywhere)

Question: What if want name of students?

For difference operator:

Schemas must match exactly, duplicates eliminated

Intersection

As expected. Note E1 INTERSECT E2 = E1 - (E1 - E2)

Renaming

Suppose we have:

   Student(ID, name, address)
   Parent(child-ID, parent-name)

(Example: Print list of all student and parent names)

(Example: Find names of all pairs of students who live at same address)
(Example: no dups)

** General case: RENAME[R(A1, A2, ..., Am)] (E) or RENAME[R] (E)

Assigning to Temporaries

We've been writing relational algebra queries as expressions.
Can equivalently use relational algebra expression trees (see textbook)
Can equivalently use sequences of algebraic assignments (called linear notation in textbook)

(Example: previous query using sequence of assignments)

Summary of Relational Algebra

Basic:

   E ::= R
      |  SELECT[C] (E)
      |  PROJECT[A1, A2, ..., An] (E)
      |  E1 X E2
      |  E1 U E2
      |  E1 - E2
      |  RENAME[R(A1, A2, ..., Am)] (E)

Abbreviations:

      |  E1 JOIN E2
      |  E1 JOIN[C] E2
      |  E1 INTERSECT E2
      |  TempName(A1, A2, ..., Ak) := E
      |  E1 ; E2