CS145 - Spring 2003
Introduction to Databases

Challenge Problems #5 -- Due 11:59pm, Tuesday June 3rd Email to cs145-submit@cs.stanford.edu

NO LATE DAYS will be permitted on this assignment. In order that you receive solutions in time for the final, we cannot accept submissions past the marked due date.
Submission Instructions & Guidelines for Typing Assignments: All challenge problems should be submitted by email to cs145-submit@cs.stanford.edu. Please refer to Challenge Problem 1 for guidelines for typing assignments and accepted formats. Please remember that challenge problems must be submitted electronically, and plan accordingly.
Honor Code reminder: For more detailed discussion of the Stanford Honor Code as it pertains to CS145, please see the Assigned Work page under Honor Code. In summary: You must indicate on your written and programming assignments any assistance (human or otherwise) that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding and being able to explain on your own all material that you submit.
Partners are not permitted. Problems must be completed individually.

Challenge Problems

Statistical authorization
There's a type of database security that was not covered in class or in the textbook called "statistical authorization". With statistical authorization, some users may be permitted to retrieve only aggregate information from the database, e.g., average salaries but not individual salaries. Furthermore, to prevent users from applying aggregates to very small numbers of tuples (such as the average of one salary!), the system requires that a certain minimum number of tuples contribute to each aggregate result. Finally, to prevent the user from using intersection properties to deduce a single value (e.g., the user could ask for X=sum(A1,A2,A3,A4,A5), then ask for Y=sum(A2,A3,A4,A5), then compute X-Y to deduce the value of A1), the system may require that the tuples participating in multiple queries issued by the same user have a small intersection. In this problem you will explore how, even with such security measures, specific information can still be deduced from the database.
Here's the problem. Consider the simple relation Student(ID,GPA), where ID is a key. Suppose that, for security reasons, the following restrictions are made on user U's set of queries against this relation:
- The result of every query must be a single aggregate value - a SQL aggregate function applied to one of the attributes of the relation.
- At least 4 different tuples must be used in the aggregate to produce each query's result.
- For any two queries issued by user U, the sets of tuples used to produce the two query results must have an intersection no larger than 2.
Give a set of queries that satisfies the above restrictions, and from whose answers you can determine the GPA of the student with a given ID I. You may assume that I is in the range 1-10, there are 50 students with IDs in the range 1-50, and attribute GPA is of type float. Write the queries in SQL, then show the computation that produces the GPA for the student with ID=I from the query results. Use as few queries as you can.
Object-relational to pure relational
The goal of this problem is to specify a mapping from object-relational SQL to regular SQL.
(a) Consider object-relational schemas that allow REFs, and consider the corresponding queries with dereferencing. Suggest a method for translating schemas with REFs in them into schemas without REFs. (Yes, you may make assumptions about keys, and yes, this part is very easy.) Then suggest a method for translating queries that use dereferencing over the original schema into queries without dereferencing over the new schema.
(b) Consider object-relational schemas with nested structures, and consider the corresponding queries with extra dots. Suggest a method for translating schemas with nested structures in them into schemas without nested structures. (If it helps you may make assumptions about keys.) Then suggest a method for translating queries that use nested dots over the original schema into queries without nested dots over the new schema. Note that in order to keep the issues separate, you may assume for this part that schemas and queries do not include references/dereferencing.
(c) Are there any subtleties involved in combining your solutions for (a) and (b) into a complete translation scheme?
Temporal relational algebra
Consider a relation Visit(candidate,state,dates), which records visits made by candidates to different states. Visit is a temporal relation, with timestamp attribute dates specifying all intervals of time during which the candidate was visiting the state. You may assume the granularity of time is one day. By definition of temporal relations, (candidate,state) together form a key for the relation. Write an expression in temporal relational algebra that finds all candidates who visited two or more different states on the same day within the last 30 days.
Association rules in SQL
Consider a relation Likes(student,song). This relation contains a variant of the "market basket" data discussed in class - you can think of a student as representing a "transaction" (in the shopping checkout sense, not the database sense), and think of the songs liked by that student as the "items purchased in the transaction." Attributes [student,song] together form a key for this relation.
We want to learn about correlations between songs liked by students. Specifically, we want to mine from this relation all association rules with certain support and confidence thresholds (as defined in class). To keep things simple, you only need to find rules with exactly two items (songs) on the left-hand side.
Specifically, write a SQL query that produces all triples of distinct songs [S1,S2,S3] such that {S1,S2} -> S3 is an association rule with support > 0.02 and confidence > 0.6. If you find it useful to define a sequence of one or more views used by the final query you may do so.