CS145 - Spring 2002
Introduction to Databases
 Assignment #8   --   Due Wednesday June 5
• There is no late deadline for this assignment. On-campus students must turn it in by 4:00 PM on Wednesday June 5. SCPD assignments must be timestamped by the courier no later Thursday June 6. No exceptions.

• This assignment contains written work only. Your project was completed in Assignment #7.

 Exercises

1. Consider the following relation:
```  Advised(Advisor, Student, Year) // Student is a key
```
A tuple (A,S,Y) in Advised specifies that advisor A advised PhD student S who graduated in year Y (and then may have become a professor and advised his or her own students).

In one statement of SQL-99 using the recursive WITH construct, write a query that finds all "descendants" of Widom, i.e., all students whose advisor was Widom, or whose advisor's advisor was Widom, or whose advisor's advisor's advisor was Widom, and so on. For each descendant, return the Student attribute along with the graduation Year attribute.

2. The lecture on recursion in SQL introduced a relation:
```  Flight(orig, dest, airline, cost) // all four attributes are the only key; cost > 0
```
containing nonstop flights from one city (orig) to another (dest). We were interested in finding the cheapest cost of flying from city A to city B, regardless of the number of stops. We observed that the query could not be written in SQL2.

(a) Write the query in one statement of SQL-99, using the recursive WITH construct. You may assume that A and B are constants that you may use anywhere in the query.

(b) Does your query in part (a) take full advantage in the WITH clause of knowing cities A and B, i.e., does it use one or both of them to "limit" the recursion to the relevant data? If not, modify your query to take advantage of A and/or B to the extent possible.

(c) Now write a single recursive SQL-99 statement that computes the cheapest cost of flying between all pairs of cities, regardless of the number of stops. The result of the query should have schema (orig,dest,cost) where [orig,dest] together form a key. (The exact attribute names are not important.)

(d) Can your query for part (c) return tuples where orig and dest are the same city? If so, what is the cheapest cost of traveling from a city to itself? Modify your solution to part (c) so that no such tuples appear in the query result.

3. Consider the following object-relational SQL schema taken from the object-relational lecture notes:
```   CREATE TYPE AddressType AS (street char(50), city char(50), zip integer)

CREATE TYPE ScoresType AS (GPA float, SAT integer)
METHOD composite() RETURNS float

CREATE TYPE StudentType AS

CREATE TABLE Student OF TYPE StudentType (PRIMARY KEY (ID))
```

(a) Write a query that returns the zipcodes (no duplicates) of all students with composite scores greater than 100.

(b) Write a query that returns all pairs of students by their ID's such that the students come from the same city and have identical scores. Do not return students with themselves, and do not return pairs and their inverses.

(c) Add to the schema above the statement:

```    CREATE ORDERING FOR ScoresType EQUALS ONLY BY STATE
```
Does this change your solution to part (b)?

4. Do parts (c) and (d) of Exercise 9.5.1 in the textbook (pages 460-461).

5. Consider the following relation that uses a temporal TIMESTAMP attribute as described in class. This relation captures the on-and-off dating habits of couples (sorry for the heterosexual bias).
```  CREATE TABLE Dating (boy STRING, girl STRING, time TIMESTAMP)
```

Recall that TIMESTAMP values are sets of disjoint intervals, and that NOW may be used as an interval endpoint. You may assume that names are unique (i.e., the same name indicates the same person), and that (boy,girl) is a key for this relation. Write the following queries using the temporal relational algebra introduced in class. You may call built-in functions and test conditions on time values as illustrated in class.

(a) Find all boy-girl pairs who are dating each other now.

(b) Find all boy-girl pairs who are dating each other now and have dated each other for a total of at least 6 months.

(c) Find all boys who have dated more than one girl at the same time.

(d) Find all girls who have been dating someone (not necessarily the same person) for a total of at least 2 years.

6. Consider the following fact table that records student exam scores over time.
```  Scores(studentID, quarter, courseID, exam#, score)
```

Consider the following two queries over the Scores relation. (Average scores obviously would make more sense than sums of scores, but we're intentionally keeping the problem simple.)

```  // Total score grouped by quarter and course:
SELECT quarter, courseID, SUM(score)
FROM Scores
GROUP BY quarter, courseID

// Total score grouped by course and exam#
SELECT courseID, exam#, SUM(score)
FROM Scores
GROUP BY courseID, exam#
```

Specify a view V over the Scores relation. You should choose V so that if V is stored as a materialized view, then V can be used to substantially speed up the execution of both of the above queries, assuming that the Scores relation is very large. In addition to specifying V, show how each of the two queries above can be rewritten into an equivalent query that uses V instead of Scores.

7. Consider a much simpler relation Took(courseID), which just lists courses taken over time. The relation has no key, i.e., a value for courseID will appear multiple times in Took, once for each time the course was taken. One fairly straightforward data mining problem is called the top-N items (or frequent items) problem: find those items that appear most often in a relation. Suppose we want to find the top ten courseID's in Took based on how often they appear. Write a SQL query to produce the desired result. If you find it useful to define a sequence of one or more views used by the final query you may do so. Also, if it helps you may assume that frequencies are unique, i.e., that there are no two courseID's that appear the same number of times in Took.

8. Now consider a slightly more complicated version of the Took relation than in the previous problem: Took(studentID,courseID). This relation contains a variant of the "market basket" data discussed in class - you can think of a studentID as representing a "transaction" (in the shopping checkout sense, not the database sense), and think of the courseID's appearing with that student as the "items purchased in the transaction." Attributes [studentID,courseID] together form a key for this relation.

We want to learn about correlations between courses taken. Specifically, we want to mine from this relation all single-attribute association rules with certain support and confidence thresholds (as defined in class). Specifically, write a SQL query that produces all pairs of distinct courseID's [C1,C2] such that C1->C2 is an association rule with support > 0.05 and confidence > 0.6. If you find it useful to define a sequence of one or more views used by the final query you may do so.

 Challenge Problems

1. Let us revisit Challenge Problem #1(b) from Assignment #1. In that problem, we wanted to find the median value in a relation GPA(ID,gpa), and we were unable to do so using relational algebra. In this truly challenging problem you are to write the same query in SQL-99 with recursion and without aggregation. Specifically, your recursive query should return the ID of the student with the median gpa in the relation, without using any aggregation functions. (Recall that the median is the number in the middle, i.e., there are an equal number of higher and lower numbers.) You may assume that both ID's and gpa's are unique, and that there is an odd number of tuples in the relation. In developing a solution, remember that although UNION is used frequently in recursive WITH definitions, it is not a requirement. Also, for the purposes of this problem, you may drop the SQL-99 restriction prohibiting nonlinear recursion.

2. Consider a temporal relation T(A1,A2,...,An,ts) where ts is the special TIMESTAMP attribute, and the remaining attributes are a key for the relation. Recall that each ts contains a disjoint set of time intervals. For simplicity, assume that NOW is not used, i.e., all time interval endpoints represent a fixed point in time.

(a) Specify a schema (including any keys) for a "normal" (nontemporal) relation R that can encode the same data as in relation T. Explain how a set of tuples in T would be translated to a set of tuples in R.

(b) Can you translate a temporal relational algebra projection over temporal relation T into a SQL query over nontemporal relation R? That is, suppose you wish to execute PROJECT_Ai(T) (with its temporal interpretation) by translating it into a SQL query over the corresponding data in relation R. Either show the translation or explain why it can't be done.

(c) Same as part (b), except consider the temporal join of two temporal relations T1(A,B,ts) and T2(B,C,ts). Assume that the temporal relations are represented as two nontemporal relations, R1 and R2, using the same encoding you gave in part (a). You wish to execute   T1 NATURAL-JOIN T2 (with its temporal interpretation) by translating it into a SQL query over R1 and R2. Either show the translation or explain why it can't be done.

3. Consider a fact table F in an OLAP application, and consider the extended table CUBE(F) as defined in class.

(a) Suppose F has five dimension attributes with two values each, and all combinations of dimension values appear in F. Let F have one dependent attribute that is summed in the CUBE tuples representing aggregation. How many tuples are there in F? How many in CUBE(F)?

(b) Now suppose F has N dimension attributes but the values are very sparse, i.e., not nearly all combinations of dimension values are represented in F (but assume F is not empty). Consider the ratio of the size of CUBE(F) to the size of F. How large can that ratio be?

4. Same as Exercise 8, except find all association rules with either one or two attributes on the left-hand side (and one attribute on the right-hand side). Note that all attributes in any association rule should be distinct.

(a) Write the new SQL query. If you find it useful to define a sequence of one or more views used by the final query you may do so.

(b) Are there any principles you can use to "prune" the search, i.e., so you don't need to consider all possible rules with one or two distinct attributes on the left-hand side and one on the right-hand side? Don't worry about incorporating the pruning into your SQL code, just describe it in English.

```

```