CS145 - Fall 2011
Introduction to Databases
Assignment #7
Due Thursday December 8
|
This assignment consists of four automated quizzes, one set of
automated view-modification exercises, SQL recursion query
exercises using the PostgreSQL system, and a
few challenge problems. It's your grand finale assignment for the
course; enjoy.
|
Automated Quizzes for Basic Material
|
Log into the
db-class portal for
CS145. Click on the
Quizzes
& Exercises link. Complete four quizzes:
- Views Quiz (3
problems)
- Authorization Quiz
(4 problems)
- Recursion Quiz (3
problems)
- OLAP Quiz (2
problems)
You can take each quiz as many
times as you like, with different instances of the problems
delivered each time, but with a 10-minute break enforced between
each try. Although all scores are recorded, we will only use the
highest score you achieve as of the due date. We strongly
suggest that you start early and continue quizzing until you've
achieved a perfect score at least once, and you feel you
understand the material thoroughly. And remember: Automated
quizzes receive no credit if submitted late.
|
Automated View-Modification Exercises
|
Log
into the
db-class portal for
CS145. Click on the
Quizzes
& Exercises link. Complete one exercise set:
- SQL Movie-Ratings View
Modification Exercises (10 problems)
Before beginning the problems, you
may want to glance through the SQLite documentation for the
CREATE
TRIGGER statement, focusing specifically on "
INSTEAD
OF" triggers. Alternatively, you may find it easier to
start with the examples in the "View modifications using
triggers" video demonstration; the script is
here.
You can develop your solutions
using the View-Update
Workbench that is integrated with the website. Or, you
may prefer to install SQLite on your computer, download the data
file, and work offline:
- Here is a quick
guide for installing and using SQLite (and other
systems)
- Here is the movie-rating
data set as a set of commands to create and load the
database, including the view definitions used in these
exercises
The automated system tests your
triggers using SQLite, which is the only DBMS we've been using
that supports a convenient syntax for "INSTEAD OF"
triggers. Note that the automated system tests your solutions
by: (1) creating your trigger; (2) issuing specific modification
commands against the relevant view; (3) running a query over the
base tables after the view modifications to ensure that your
trigger reacted correctly; (4) restoring the original state.
You may do the exercises as many times as you like, with no time
lapse required between tries. Although all scores are recorded,
we will only use the highest score you achieve as of the due
date. And remember: Automated exercises receive no credit if
submitted late.
You will experiment with the recursive SQL WITH statement
as supported by PostgreSQL. Specifically,
you will write a number of queries over a relational database that
encodes a directed acyclic graph with colored nodes and weighted
edges. The schema is:
Node(nID,color) // nID is a key
Edge(n1,n2,weight) // (n1,n2) identify nIDs in table Node and together form a key
As a simple example, the following graph:

would be represented by tuples {(A,'red'),(B,'yellow'),(C,'blue')}
in table Node and tuples {(A,B,4),(B,C,5)} in
table Edge. (The actual data set you'll be experimenting
with is small, but not as small as this one!)
Getting Started
PostgreSQL is the only DBMS we've been using that supports SQL
recursion, so you will need to do these exercises using PostgreSQL.
Unfortunately we weren't able to set up an automated workbench
backed by PostgreSQL as we did for the exercises using SQLite, so
you'll need to either install PostgreSQL on your own computer (see
our quick
guide), or use PostgreSQL (version 8.4.9) installed on the
Corn machines. Regardless, we will be grading your assignment using
the Corn installation, so make sure to test your queries there.
Before beginning the problems, you may want to glance through the
PostgreSQL documentation for WITH queries,
focusing specifically on recursion. Alternatively, you may find it
easier to start with the examples in the "Basic recursive WITH
statements" video demonstration; the script is here.
Once you're ready to start running queries, here are the
step-by-step instructions:
- SSH into machine popcorn01:
ssh
<suid>@popcorn01.stanford.edu
- SSH into the PostgreSQL virtual machine:
ssh popcorn-pgsql1
- Load the sample database for your assignment (don't worry
about warning messages):
psql <
/afs/ir/class/cs145/recursion/create.sql
- To reload the database, repeat the previous step.
- To run PostgreSQL in command-line mode, so you can explore the
database and try out some queries:
psql (use ctrl-d
to exit)
- To run queries from a file:
psql < filename.sql
Queries
You are to write the following five queries using the recursive WITH
statement in PostgreSQL.
- Find all node pairs (n1,n2) that are both red, and
there's a path of length 1 or more from n1 to n2.
Sort by n1, then n2.
- Find all node pairs (n1,n2) that are both red, and
there's a path of length 1 or more from n1 to n2
that passes through exclusively red nodes. Sort by n1,
then n2.
- Find the highest-weight path(s) in the graph, i.e., the path
(or paths) where the sum of the edge weights on the path is as
high as any other path in the graph. Return the start node, end
node, length of path, and total weight.
- Consider all pairs (n1,n2) of nodes that are both
blue, and that have a path of length 1 or more from n1
to n2. Find the lengths of the shortest and longest
paths between n1 and n2. Your result should
have four columns: n1, n2, the length of the
shortest path, and the length of the longest path. Sort by n1,
then n2.
- Find all pairs (n1,n2) such that n1 is
yellow, and there is a path of length 1 or more from n1
to n2 that alternates yellow and blue nodes. Sort by n1,
then n2.
Your queries should be designed to work over any acyclic graph
encoded in the schema. We will be testing your solutions over a
different database from the one we are providing. As one
sanity-check (but not a guarantee of correctness!), you should get
the following answers over the provided database:
- (A,D),(A,G),(A,J),(D,G),(D,J)
- (A,D),(A,J),(D,J)
- (A,L,7,19)
- (C,I,2,3),(C,L,3,5),(F,I,2,3),(F,L,3,5),(I,L,1,2)
- (E,I),(E,K),(E,L),(H,I),(H,K),(H,L),(K,L)
Problem 1
There's a type of database authorization that was not covered in the
video lectures called "statistical authorization". With statistical
authorization, some users may be permitted to retrieve only
aggregate information from the database, e.g., average GPAs but not
individual GPAs. Furthermore, to prevent users from applying
aggregates to very small numbers of tuples (such as the average of
one GPA!), the system requires that a certain minimum number of
tuples contribute to each aggregate result. Finally, to prevent the
user from using intersection properties to deduce a single value
(e.g., the user could ask for X=sum(A1,A2,A3,A4,A5),
then ask for Y=sum(A2,A3,A4,A5), then compute X-Y
to deduce the value of A1), the system may require
that the tuples participating in multiple queries issued by the same
user have a small intersection. In this problem you will explore
how, even with such measures, specific information can still be
deduced from the database.
Here's the problem. Consider a simple table Student(studentID,GPA)
where studentID is a key. Suppose that the following
restrictions are made on any user U's set of queries
against this table:
- The result of every query must be a single aggregate value --
a SQL aggregate function applied to one of the attributes of the
table.
- At least 4 different tuples must be used in the aggregate to
produce each query's result.
- For any two queries issued by user U, the sets of
tuples used to produce the two query results must have an
intersection of size no larger than 2.
Give a set of queries that satisfies the above restrictions, and
from whose answers you can determine the GPA of the
student with a given studentID = X. Assume
that X is in the range 1-10, there are 50 students with
IDs in the range 1-50, and attribute GPA is of type
float. Write the queries in SQL, then show the
computation that produces the GPA for the student with ID
= X from the query results. Use as few queries as you can.
Problem 2
Consider the following table containing enrollment information for
students in courses:
Took(studentID, year, dept, course, units)
Suppose the following two queries are asked frequently over this
data:
Q1: // Total units grouped by year and department
Select year, dept, Sum(units)
From Took
Group By year, dept
Q2: // Total units grouped by department and course
Select dept, course, Sum(units)
From Took
Group By dept, course
(a) Assume table Took is extremely large.
Specify a view V over Took, chosen so
that if V is stored as a materialized view, then V
can be used to substantially speed up the execution of both of the
above queries. (Please do not use "Cube" or "Rollup"
in defining view V; see parts (c) and (d) below.)
(b) Rewrite queries Q1 and Q2
into equivalent queries that use your materialized view V
from part (a) instead of table Took.
(c) Now suppose the following materalized view C
has been created.
C: Select year, dept, course, Sum(units) as s
From Took
Group By Cube(year, dept, course)
Rewrite queries Q1 and Q2 into
equivalent queries that use view C instead of table Took.
Your queries should be as simple as possible.
(d) Is it possible to create a materialized view using Rollup
instead of Cube that can be used for both of the above
queries? (Consider only views with one Rollup; as we have seen we can always emulate Cube with the union of several Rollups.) If it is not possible, briefly explain why not. If it is possible, show the view
definition and how the queries are rewritten to use the view.
Problem 3
Consider a fact table F(D1,D2,...,Dn,x) where D1...Dn
are dimension attributes and x is the dependent
attribute. Suppose F has all possible "backward"
functional dependencies among dimension attributes: Di → D(i-1) for i=2..n.
Consider a query Q over F that ends in "GROUP
BY D1,D2,...,Dn WITH CUBE", and another
query Q' over F that's exactly the same
as Q except it specifies "WITH ROLLUP"
instead of "WITH CUBE". Does Q
provide more information than Q', less information, or
the same amount of information? Justify your answer with a specific
example and/or short explanation.
SQL Recursion Exercises
For each of the five queries in the SQL Recursion exercises,
create a file named queryN.sql containing your
solution for query N, and nothing else. (Your files
should be named query1.sql, query2.sql,
..., query5.sql. If you didn't solve all of the
queries, just omit the files for the unfinished ones.) You only
need to submit the query files themselves -- as part of the
grading process we will be running your queries against a test
database. Important: Please make sure each of your
submitted query files runs correctly when executed using "psql
< filename.sql" as specified in the "Getting Started"
instructions above. If a file does not run properly, no credit
will be given.
To submit your work, log into one of the Stanford Unix machines
and make sure you are in the directory containing your query
files, properly named as above. From this directory execute the
script:
/afs/ir/class/cs145/bin/submit-recursion
and follow the instructions. As with other electronic submissions in
CS145, you may resubmit as many times as you like. However, only the
latest submission and timestamp are saved, and those are what we
will use for grading your work and determining late penalties. Submissions
via email will not be accepted.
Challenge
Problems
Solutions should be submitted in pdf
format. For students who prefer hand-writing their solutions or
using some other format, solutions may be scanned and submitted as jpg.
Solutions must be submitted through Coursework, and must be
in a single file named challenge7.pdf (or challenge7.jpg).
Login to Coursework.
Under Assignments you will
see Challenge Problems #7 (for Assignment #7 challenge
problems -- don't worry, there were no Challenge Problems #5 or Challenge Problems #6). Upload your solution file
and submit for grading. You may resubmit as many times as you like,
however only the latest file and the latest timestamp are saved, and
those are what we will use for grading your work and determining
late penalties. Submissions via email will not be accepted.
|
Reminders: Late Policy and Honor Code
|
- Your score for each automated quiz and exercise set is the
highest score achieved as of 11:59pm on the due date. No
credit is given for automated quizzes or exercises submitted
after the due date.
- The SQL Recursion Exercises and the Challenge Problems are
submitted separately as specified above. They are due at
11:59pm on the due date. Submissions after the deadline but
less than 24 hours late will be accepted but penalized 10%,
and submissions more than 24 hours but less than 48 hours late
will be penalized 30%. No submissions will be accepted more
than 48 hours late.
- Since emergencies do arise, each student is allowed a total
of four unpenalized late days (four periods up to 24 hours
each) for non-automated work, but no single assignment may be
more than two days late. Important: The SQL Recursion
Exercises and the Challenge Problems are considered separate
assignments when calculating late days used. For
example, if you turn both of them in two days late, that
requires using all of your unpenalized late days for the
entire course.
- For detailed discussion of the Stanford Honor Code as it
pertains to CS145, please see the Assigned
Work page under Honor Code. In summary: You must
indicate on all submitted work any assistance (human
or otherwise) that you received. Any assistance received that
is not given proper citation will be considered a violation of
the Honor Code. In any event, you are responsible for
understanding and being able to explain on your own all
material that you submit.