CS145 - Spring 2004
Introduction to Databases
Challenge Problems #5
Due Tuesday June 5

The Problems

  1. Statistical authorization

    There's a type of database authorization that was not covered in class or in the textbook called "statistical authorization". With statistical authorization, some users may be permitted to retrieve only aggregate information from the database, e.g., average salaries but not individual salaries. Furthermore, to prevent users from applying aggregates to very small numbers of tuples (such as the average of one salary!), the system requires that a certain minimum number of tuples contribute to each aggregate result. Finally, to prevent the user from using intersection properties to deduce a single value (e.g., the user could ask for X=sum(A1,A2,A3,A4,A5), then ask for Y=sum(A2,A3,A4,A5), then compute X-Y to deduce the value of A1), the system may require that the tuples participating in multiple queries issued by the same user have a small intersection. In this problem you will explore how, even with such measures, specific information can still be deduced from the database.

    Here's the problem. Consider the simple relation Student(ID,GPA), where ID is a key. Suppose that the following restrictions are made on user U's set of queries against this relation:

    Give a set of queries that satisfies the above restrictions, and from whose answers you can determine the GPA of the student with a given ID X. You may assume that X is in the range 1-10, there are 50 students with IDs in the range 1-50, and attribute GPA is of type float. Write the queries in SQL, then show the computation that produces the GPA for the student with ID=X from the query results. Use as few queries as you can.

  2. Data cube queries

    Consider a fact table with four dimension attributes and one dependent attribute:

       Fact(d1,d2,d3,d4,v)
    
    Suppose we create a materialized view that "cubes" this table on three of its dimension attributes:
       Create Materialized View MV as
         Select d1, d2, d3, d4, aggr(v) as v
         From Fact
         Group By d1, Cube(d2,d3,d4)
    
    Assume that aggr is one of Sum, Min, or Max.

    Consider any query over MV that further aggregates v (using the same aggregate function), and includes as part of its Where clause "And d3 Is Null". Can we always replace that condition with "And d3 Is Not Null" and get the same query result? You may assume:

    If you answered yes, briefly argue why the queries are always equivalent. If you answered no, show the simplest example query pair and data where the "Is Null" version returns a different answer from the "Is Not Null" version.

  3. Association rules in SQL

    Consider a relation Took(studentID,courseID). This relation contains a variant of the "market basket" data discussed in class -- you can think of a studentID as representing a "transaction" (in the shopping checkout sense, not the database sense), and think of the courses taken by that student as the "items purchased in the transaction." Attributes [studentID,courseID] together form a key for this relation.

    We want to learn about correlations between courses taken. Specifically, we want to mine from this relation all association rules with certain support and confidence thresholds (as defined in class). To keep things simple, you only need to find rules with exactly two items (courseID's) on the left-hand side.

    Specifically, write a SQL query that produces all triples of distinct courseID's [C1,C2,C3] such that {C1,C2} -> C3 is an association rule with support > 0.05 and confidence > 0.7. If you find it useful to define a sequence of one or more views used by the final query you may do so.