CS145 - Fall 2011
Introduction to Databases
Assignment #7
Due Thursday December 8

This assignment consists of four automated quizzes, one set of automated view-modification exercises, SQL recursion query exercises using the PostgreSQL system, and a few challenge problems. It's your grand finale assignment for the course; enjoy.

Automated Quizzes for Basic Material

Log into the db-class portal for CS145. Click on the Quizzes & Exercises link. Complete four quizzes:
You can take each quiz as many times as you like, with different instances of the problems delivered each time, but with a 10-minute break enforced between each try. Although all scores are recorded, we will only use the highest score you achieve as of the due date. We strongly suggest that you start early and continue quizzing until you've achieved a perfect score at least once, and you feel you understand the material thoroughly. And remember: Automated quizzes receive no credit if submitted late.

Automated View-Modification Exercises

Log into the db-class portal for CS145. Click on the Quizzes & Exercises link. Complete one exercise set:
Before beginning the problems, you may want to glance through the SQLite documentation for the CREATE TRIGGER statement, focusing specifically on "INSTEAD OF" triggers. Alternatively, you may find it easier to start with the examples in the "View modifications using triggers" video demonstration; the script is here.

You can develop your solutions using the View-Update Workbench that is integrated with the website. Or, you may prefer to install SQLite on your computer, download the data file, and work offline:
The automated system tests your triggers using SQLite, which is the only DBMS we've been using that supports a convenient syntax for "INSTEAD OF" triggers. Note that the automated system tests your solutions by: (1) creating your trigger; (2) issuing specific modification commands against the relevant view; (3) running a query over the base tables after the view modifications to ensure that your trigger reacted correctly; (4) restoring the original state.

You may do the exercises as many times as you like, with no time lapse required between tries. Although all scores are recorded, we will only use the highest score you achieve as of the due date. And remember: Automated exercises receive no credit if submitted late.

SQL Recursion Exercises

You will experiment with the recursive SQL WITH statement as supported by PostgreSQL. Specifically, you will write a number of queries over a relational database that encodes a directed acyclic graph with colored nodes and weighted edges. The schema is:
   Node(nID,color)    // nID is a key
   Edge(n1,n2,weight) // (n1,n2) identify nIDs in table Node and together form a key
As a simple example, the following graph:
            
would be represented by tuples {(A,'red'),(B,'yellow'),(C,'blue')} in table Node and tuples {(A,B,4),(B,C,5)} in table Edge. (The actual data set you'll be experimenting with is small, but not as small as this one!)

Getting Started

PostgreSQL is the only DBMS we've been using that supports SQL recursion, so you will need to do these exercises using PostgreSQL. Unfortunately we weren't able to set up an automated workbench backed by PostgreSQL as we did for the exercises using SQLite, so you'll need to either install PostgreSQL on your own computer (see our quick guide), or use PostgreSQL (version 8.4.9) installed on the Corn machines. Regardless, we will be grading your assignment using the Corn installation, so make sure to test your queries there.

Before beginning the problems, you may want to glance through the PostgreSQL documentation for WITH queries, focusing specifically on recursion. Alternatively, you may find it easier to start with the examples in the "Basic recursive WITH statements" video demonstration; the script is here.

Once you're ready to start running queries, here are the step-by-step instructions:
  1. SSH into machine popcorn01:
          ssh <suid>@popcorn01.stanford.edu
  2. SSH into the PostgreSQL virtual machine:
          ssh popcorn-pgsql1
  3. Load the sample database for your assignment (don't worry about warning messages):
          psql < /afs/ir/class/cs145/recursion/create.sql
  4. To reload the database, repeat the previous step.
  5. To run PostgreSQL in command-line mode, so you can explore the database and try out some queries:
          psql   (use ctrl-d to exit)
  6. To run queries from a file:
          psql < filename.sql
Queries

You are to write the following five queries using the recursive WITH statement in PostgreSQL.
  1. Find all node pairs (n1,n2) that are both red, and there's a path of length 1 or more from n1 to n2. Sort by n1, then n2.
  2. Find all node pairs (n1,n2) that are both red, and there's a path of length 1 or more from n1 to n2 that passes through exclusively red nodes. Sort by n1, then n2.
  3. Find the highest-weight path(s) in the graph, i.e., the path (or paths) where the sum of the edge weights on the path is as high as any other path in the graph. Return the start node, end node, length of path, and total weight.
  4. Consider all pairs (n1,n2) of nodes that are both blue, and that have a path of length 1 or more from n1 to n2. Find the lengths of the shortest and longest paths between n1 and n2. Your result should have four columns: n1, n2, the length of the shortest path, and the length of the longest path. Sort by n1, then n2.
  5. Find all pairs (n1,n2) such that n1 is yellow, and there is a path of length 1 or more from n1 to n2 that alternates yellow and blue nodes. Sort by n1, then n2.
Your queries should be designed to work over any acyclic graph encoded in the schema. We will be testing your solutions over a different database from the one we are providing. As one sanity-check (but not a guarantee of correctness!), you should get the following answers over the provided database:
  1. (A,D),(A,G),(A,J),(D,G),(D,J)
  2. (A,D),(A,J),(D,J)
  3. (A,L,7,19)
  4. (C,I,2,3),(C,L,3,5),(F,I,2,3),(F,L,3,5),(I,L,1,2)
  5. (E,I),(E,K),(E,L),(H,I),(H,K),(H,L),(K,L)

Challenge Problems

Problem 1

There's a type of database authorization that was not covered in the video lectures called "statistical authorization". With statistical authorization, some users may be permitted to retrieve only aggregate information from the database, e.g., average GPAs but not individual GPAs. Furthermore, to prevent users from applying aggregates to very small numbers of tuples (such as the average of one GPA!), the system requires that a certain minimum number of tuples contribute to each aggregate result. Finally, to prevent the user from using intersection properties to deduce a single value (e.g., the user could ask for X=sum(A1,A2,A3,A4,A5), then ask for Y=sum(A2,A3,A4,A5), then compute X-Y to deduce the value of A1), the system may require that the tuples participating in multiple queries issued by the same user have a small intersection. In this problem you will explore how, even with such measures, specific information can still be deduced from the database.

Here's the problem. Consider a simple table Student(studentID,GPA) where studentID is a key. Suppose that the following restrictions are made on any user U's set of queries against this table:

Give a set of queries that satisfies the above restrictions, and from whose answers you can determine the GPA of the student with a given studentID = X. Assume that X is in the range 1-10, there are 50 students with IDs in the range 1-50, and attribute GPA is of type float. Write the queries in SQL, then show the computation that produces the GPA for the student with ID = X from the query results. Use as few queries as you can.

Problem 2

Consider the following table containing enrollment information for students in courses:
  Took(studentID, year, dept, course, units)
Suppose the following two queries are asked frequently over this data:
   Q1: // Total units grouped by year and department
       Select year, dept, Sum(units)
       From Took
       Group By year, dept

   Q2: // Total units grouped by department and course
       Select dept, course, Sum(units)
       From Took
       Group By dept, course

(a) Assume table Took is extremely large. Specify a view V over Took, chosen so that if V is stored as a materialized view, then V can be used to substantially speed up the execution of both of the above queries. (Please do not use "Cube" or "Rollup" in defining view V; see parts (c) and (d) below.)

(b) Rewrite queries Q1 and Q2 into equivalent queries that use your materialized view V from part (a) instead of table Took.

(c) Now suppose the following materalized view C has been created.

   C: Select year, dept, course, Sum(units) as s
      From Took
      Group By Cube(year, dept, course)
Rewrite queries Q1 and Q2 into equivalent queries that use view C instead of table Took. Your queries should be as simple as possible.

(d) Is it possible to create a materialized view using Rollup instead of Cube that can be used for both of the above queries? (Consider only views with one Rollup; as we have seen we can always emulate Cube with the union of several Rollups.) If it is not possible, briefly explain why not. If it is possible, show the view definition and how the queries are rewritten to use the view.

Problem 3

Consider a fact table F(D1,D2,...,Dn,x) where D1...Dn are dimension attributes and x is the dependent attribute. Suppose F has all possible "backward" functional dependencies among dimension attributes: Di D(i-1) for i=2..n. Consider a query Q over F that ends in "GROUP BY D1,D2,...,Dn WITH CUBE", and another query Q' over F that's exactly the same as Q except it specifies "WITH ROLLUP" instead of "WITH CUBE". Does Q provide more information than Q', less information, or the same amount of information? Justify your answer with a specific example and/or short explanation.

Submission Instructions

  SQL Recursion Exercises

For each of the five queries in the SQL Recursion exercises, create a file named queryN.sql containing your solution for query N, and nothing else. (Your files should be named query1.sql, query2.sql, ..., query5.sql. If you didn't solve all of the queries, just omit the files for the unfinished ones.) You only need to submit the query files themselves -- as part of the grading process we will be running your queries against a test database. Important: Please make sure each of your submitted query files runs correctly when executed using "psql < filename.sql" as specified in the "Getting Started" instructions above. If a file does not run properly, no credit will be given.

To submit your work, log into one of the Stanford Unix machines and make sure you are in the directory containing your query files, properly named as above. From this directory execute the script:

   /afs/ir/class/cs145/bin/submit-recursion
and follow the instructions. As with other electronic submissions in CS145, you may resubmit as many times as you like. However, only the latest submission and timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Challenge Problems

Solutions should be submitted in pdf format. For students who prefer hand-writing their solutions or using some other format, solutions may be scanned and submitted as jpg. Solutions must be submitted through Coursework, and must be in a single file named challenge7.pdf (or challenge7.jpg). Login to Coursework. Under Assignments you will see Challenge Problems #7 (for Assignment #7 challenge problems -- don't worry, there were no Challenge Problems #5 or Challenge Problems #6). Upload your solution file and submit for grading. You may resubmit as many times as you like, however only the latest file and the latest timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Reminders: Late Policy and Honor Code