CS145 Assignment 3

CS145 - Introduction to Databases
Spring 2001, Prof. Widom

Assignment #3
Due Wednesday May 2

LOGISTICS

Lots of work: Some of the written exercises on this assignment are even more challenging than last time. In addition, the project part this week can be fairly time-consuming. In total this may be your most demanding assignment of the course, and you should plan accordingly. Note that we've also included several particularly challenging optional problems.
Relationship to lectures: Much of the material in this assignment was covered in the April 23 and 25 lectures, however some of it will not be covered until the April 30 lecture.
Scope of assignment: There are numerous exercises in the book covering basic relational algebra (Sections 4.1 and 4.6.1-4.6.6) and basic SQL (Sections 5.1-5.7 and 5.9). We are not assigning any of these problems explicitly, but we strongly suggest that you work a good number of them in order to "warm up" for the problems included in this assignment. Also note that the problems included in this assignment are not comprehensive, i.e., they do not cover every bit of material you are expected to know in relational algebra and SQL.
Turning in written exercises: The procedure for turning in the written exercises on this assignment is the same as for the first two assignments, with the same deadline and late policy.
Turning in PDA: For this assignment and all subsequent ones with projects parts, your PDA work (and only your PDA work) will be turned in electronically. Details for electronic submission are provided in submit-README.txt. Remember that a completely separate deadline and late policy applies for the programming part of this and subsequent assignments:
For both on-campus and SITN students: Programming work must be submitted electronically by midnight on the Wednesday that it is due. Programming work submitted after the deadline but less than 24 hours late (i.e., by Thursday midnight) will be accepted but penalized 10%, and programming work submitted more than 24 hours but less than 48 hours late (i.e., by Friday midnight) will be penalized 30%. No programming work will be accepted more than 48 hours late. One emergency chit is will be deducted automatically for programming work turned in up to 24 hours late, and two chits for programming work turned in up to 48 hours late. These chits are in addition to any chits deducted for late written work. For further details please see the official course late policy.
Oracle passwords: On May 2 (the due date for this assignment) all Oracle accounts for which the password has not been changed will be deactivated automatically by ITSS for security reasons. Note that it may take several days to reactivate an account.
Oracle (lack of) backups: Oracle is not backed up, so anything you need long-term should be saved in the leland file system. Some suggestions for general management of your PDA are given as part of Problem 7 below.
SQL on Oracle: If you decide to test your solutions to the SQL written exercises using Oracle, or experiment with running some SQL queries on your PDA, note that Oracle's implementation of SQL differs somewhat from the SQL standard as covered in the textbook and class. These differences are outlined in the document Oracle 8i SQL.

THE ASSIGNMENT

Problem 1

Consider the following relational database schema:

  Student(ID, name, dept, status)  // status = "grad" or "undergrad"
                                   // ID is a key
  RA(ID, advisor, dept)            // (ID,advisor) together are a key
  TA(ID, course, dept)             // (ID,course) together are a key

and the following query: Find the names of all graduate students who are neither an RA nor a TA.

(a) Write this query in relational algebra.

(b) Write this query in SQL.

Now consider the following query: Find the names of all graduate students who are an RA or a TA in a department other than their own.

(c) Write this query in relational algebra.

(d) Write this query in SQL.

Problem 2

Consider the following relation:

  Emp(ID, salary)  // ID is a key, salary is a key

and the following two queries:

Find the highest employee salary.
Find the top third of salaries. You may assume that the number of employees (and therefore salaries) is divisible by three.

(a) Express Query 1 in relational algebra.
Hint: Join Emp with itself to find salaries that aren't the highest, then use the difference operator.

(b) Express Query 1 in SQL using the max aggregation operator.

(c) Express Query 1 in SQL without using the max aggregation operator.

(d) Express Query 2 in relational algebra. If needed, you may use the following programming constructs in addition to relational algebra:

Statements of the form "R := relational algebra expression", where R can be an existing relation that gets replaced or a new one that gets created
Loops of the form "WHILE not-empty(R) DO { sequence of statements }"
Sequences of statements and/or loops as above

(e) (optional) Express Query 2 in SQL. Do not use the programming constructs from part (d).

Problem 3

Consider a relation Flight(city1,city2,cost,airline) containing nonstop flights from one city to another. Note that the flights from city1 to city2 are independent of the flights from city2 to city1. For example:

city1 city2 cost airline
SF Denver 300 Frontier
SF Denver 350 United
Denver SF 250 United
Denver SF 250 Frontier
Denver Chicago 250 American
Chicago NY 250 Delta
Denver NY 500 American
Denver NY 400 TWA
SF NY 750 United

city1	city2	cost	airline
`SF`	`Denver`	`300`	`Frontier`
`SF`	`Denver`	`350`	`United`
`Denver`	`SF`	`250`	`United`
`Denver`	`SF`	`250`	`Frontier`
`Denver`	`Chicago`	`250`	`American`
`Chicago`	`NY`	`250`	`Delta`
`Denver`	`NY`	`500`	`American`
`Denver`	`NY`	`400`	`TWA`
`SF`	`NY`	`750`	`United`

(a) Give a single SQL query that returns the cost of the cheapest nonstop flight between each pair of cities. For example, the result over the above relation instance should be:

city1 city2 cost
SF Denver 300
Denver SF 250
Denver Chicago 250
Chicago NY 250
Denver NY 400
SF NY 750

city1	city2	cost
`SF`	`Denver`	`300`
`Denver`	`SF`	`250`
`Denver`	`Chicago`	`250`
`Chicago`	`NY`	`250`
`Denver`	`NY`	`400`
`SF`	`NY`	`750`

(b) Give a single SQL query that returns the cheapest cost of flying between each pair of cities assuming we are willing to stop up to two times en-route. For example, by stopping once (in Denver), we can get from SF to NY for $700 instead of $750. With this example data, we could stop twice (in Denver and Chicago), but that would be more expensive ($300+$250+$250 = $800).

(c) Is it possible to write a single SQL query that returns the cheapest cost of flying between each pair of cities regardless of the number of stops? If so, give the query. If not, briefly explain why.

Problem 4

Consider the following relation:

  Advised(Advisor, Student, Year)

A tuple (A,S,Y) in Advised specifies that advisor A advised student S who graduated in year Y. Assume that Student is a key for this relation.

(a) Consider the following SQL query, which finds all advisors who advised a student who graduated in the same year that Hector Garcia-Molina (HGM) or Jennifer Widom (JW) graduated.

  SELECT Advisor
  FROM Advised
  WHERE Year IN
    (SELECT Year FROM Advised
     WHERE Student = "HGM" OR Student = "JW")

Try to write an equivalent SQL query that does not use any subqueries. If there are any circumstances in which your "equivalent" query can produce a different answer from the query above, please explain.

(b) (optional) Write a SQL query that finds the advisor(s) with the longest advising span, i.e., with the longest period from their earliest advisee to their latest advisee.

Problem 5

Consider a relation R(A,B) that may contain duplicate tuples, for example:

A B
1 2
1 2
3 4
5 6
5 6
5 6

A	B
1	2
1	2
3	4
5	6
5	6
5	6

(a) Suppose we are interested in selecting all but one copy of each tuple appearing in R. For example, on the above data the query result would be:

A B
1 2
5 6
5 6

A	B
1	2
5	6
5	6

Can you write this query in SQL? If so, give the query. If not, briefly explain why.

(b) Suppose instead we are interested in actually deleting one copy of each tuple in R. That is, we want to write a SQL DELETE statement such that after the statement executes, table R has the contents:

A B
1 2
5 6
5 6

A	B
1	2
5	6
5	6

Can you write such a DELETE statement in SQL? If so, give the statement. If not, briefly explain why.

Problem 6

(optional) Consider a relation R(A,B,C). Write a relational algebra expression that always returns empty if and only if the multivalued dependency A->>B holds on R.

Problem 7 (Personal Database Application, Part 3)

In this part of the project, you will create a relational schema for your PDA in the Oracle database system, and you will populate the tables in your database with initial data sets.

(a) Familiarize yourself with the Oracle relational DBMS by reading the document Getting Started With Oracle (make sure to read the most recent version), logging into Oracle, changing your password, trying some of the examples in the document, and experimenting with the help command. You don't need to turn anything in for this part.

(b) Create relations for your PDA based on your final relational schema from PDA Part 2. In addition to creating the appropriate attributes and types, please declare keys for your relations; see Creating a Table With a Primary Key from Getting Started With Oracle. Many of the attribute types supported by Oracle are listed on page 286 of the textbook. If you have an attribute that represents a date and/or time, you may want to look at our help page on Oracle Dates and Times.

Turn in a script showing an Oracle session in which your relations are created successfully. Also show, for each relation, the result of the sqlplus DESCRIBE command once the relation has been created: for a relation T, type "DESCRIBE T;". Please see Creating scripts and what to turn in below for details.

(c) For each relation in your PDA, create a file containing a few (approximately 5-10) records of "realistic" data. Then use the Oracle bulk-loading facilities to insert those records as tuples into your relations. Refer to the document Using the Oracle Bulk Loader for file format and how to load records into Oracle. (Here too make sure to look at the most recent version.)

Turn in a listing showing the contents of the files you created, the successful loading of the data into Oracle, and the execution of "SELECT *" commands to show the contents of each relation. (Again, information is available in Getting Started With Oracle.)

(d) Write a program in any programming language you like that creates large files of records for each of your PDA relations. If you have available real data for your PDA, then your program will need to transform the data into files of records conforming to your PDA schema and to Oracle's load format. The rest of you will need to write a program to fabricate data: your program will generate either random or nonrandom (e.g., sequential) records conforming to your schema. Note that it is both fine and expected for your data values - strings especially - to be meaningless gibberish. The point of generating large amounts of data is so that you can experiment with a database of realistic size, rather than the small "toy" databases often used in classes. The data you generate and load should be on the order of:

At least two relations with thousands of tuples
At least one relation with hundreds of tuples

If your application naturally includes relations that are expected to be relatively small (e.g., schools within a university), then it is fine to use some small relations, but please ensure that you have relations of the sizes prescribed above as well. When writing a program to fabricate data, there are two important points to keep in mind:

Make sure not to generate duplicate values for key attributes.
Your PDA almost certainly includes relations that are expected to join with each other. For example, you may have a Student relation with attribute courseNum that's expected to join with attribute number in relation Course. When generating data, be sure to generate values that actually do join - otherwise all of your interesting queries will have empty results! There are a couple of ways to properly generate joining values. One way is to generate records for multiple relations (e.g., Course and Student) at the same time. Another way is to generate the records for one relation first, and then use the joining values for the other relation. For example, you could generate records for relation Course first, then use the Course.number values when creating values for Student.courseNum.

Turn in your program code for generating or transforming data, a small sample of the records generated for each relation (5 or so records per relation), and a script showing the successful loading of your data into Oracle.

Creating scripts and what to turn in

Please see Recording Your Session from the Getting Started With Oracle document for a guide to preparing output to be submitted for this and subsequent project parts.

Components (b), (c), and (d) of this project part each tell you what should be recorded in the script that you turn in. In this and all subsequent project parts, the material you turn in should be clearly formatted and delineated, and should include comments for any aspects that are not crystal clear. Poorly assembled or documented material will not receive full credit, even if it is correct. You also will not receive full credit if you turn in your entire large data files (or large query results in later assignments) when we ask for small samples. Other than comments, truncation, and simple formatting, it is an Honor Code violation to edit scripts before turning them in. Please see http://www.stanford.edu/class/cs145/submit-README.txt for details on electronic submission.

Maintaining your databases

You will be using both your small (part c) and large (part d) databases for the rest of the course. The idea is to use the small database to experiment on meaningful-looking data, and the large one to experiment on data of more realistic size. We suggest that you keep the two databases going in parallel by creating two sets of files and two sets of relations with slightly different names.

For the duration of the project, we suggest that you establish some kind of routine that includes reloading your database from the files created in this project part each time you want to get a "fresh" start with Oracle. Remember to delete the contents of each relation (or destroy and recreate the relations) before reloading. Otherwise, unless there is a declared key (or you take APPEND out of your control file), Oracle will happily append the new data to your old relation, causing your relation size to double, triple, quadruple, etc. To get rid of a table called T, issue the command:

  drop table T;

If you want to get rid of all tuples in T without deleting the table itself, issue the command:

  delete from T;

CS145 - Introduction to DatabasesSpring 2001, Prof. Widom

Assignment #3 Due Wednesday May 2