Introduction to Databases

Assignment #8 -- Due Wednesday June 5 |

**There is no late deadline for this assignment. On-campus students must turn it in by 4:00 PM on Wednesday June 5. SCPD assignments must be timestamped by the courier no later Thursday June 6. No exceptions.**- This assignment contains written work only. Your project was completed in Assignment #7.

Exercises |

- Consider the following relation:
Advised(Advisor, Student, Year) // Student is a key

A tuple`(A,S,Y)`in`Advised`specifies that advisor`A`advised PhD student`S`who graduated in year`Y`(and then may have become a professor and advised his or her own students).In one statement of SQL-99 using the recursive

`WITH`construct, write a query that finds all "descendants" of Widom, i.e., all students whose advisor was Widom, or whose advisor's advisor was Widom, or whose advisor's advisor's advisor was Widom, and so on. For each descendant, return the`Student`attribute along with the graduation`Year`attribute. - The lecture on recursion in
SQL introduced a relation:
Flight(orig, dest, airline, cost) // all four attributes are the only key; cost > 0

containing nonstop flights from one city (`orig`) to another (`dest`). We were interested in finding the cheapest cost of flying from city`A`to city`B`, regardless of the number of stops. We observed that the query could not be written in SQL2.**(a)**Write the query in one statement of SQL-99, using the recursive`WITH`construct. You may assume that`A`and`B`are constants that you may use anywhere in the query.**(b)**Does your query in part (a) take full advantage in the`WITH`clause of knowing cities`A`and`B`, i.e., does it use one or both of them to "limit" the recursion to the relevant data? If not, modify your query to take advantage of`A`and/or`B`to the extent possible.**(c)**Now write a single recursive SQL-99 statement that computes the cheapest cost of flying between all pairs of cities, regardless of the number of stops. The result of the query should have schema`(orig,dest,cost)`where`[orig,dest]`together form a key. (The exact attribute names are not important.)**(d)**Can your query for part (c) return tuples where`orig`and`dest`are the same city? If so, what is the cheapest cost of traveling from a city to itself? Modify your solution to part (c) so that no such tuples appear in the query result. - Consider the following object-relational SQL schema taken from
the object-relational lecture notes:
CREATE TYPE AddressType AS (street char(50), city char(50), zip integer) CREATE TYPE ScoresType AS (GPA float, SAT integer) METHOD composite() RETURNS float CREATE TYPE StudentType AS (ID integer, name char(30), address AddressType, scores ScoresType) CREATE TABLE Student OF TYPE StudentType (PRIMARY KEY (ID))

**(a)**Write a query that returns the zipcodes (no duplicates) of all students with composite scores greater than 100.**(b)**Write a query that returns all pairs of students by their ID's such that the students come from the same city and have identical scores. Do not return students with themselves, and do not return pairs and their inverses.**(c)**Add to the schema above the statement:CREATE ORDERING FOR ScoresType EQUALS ONLY BY STATE

Does this change your solution to part (b)? - Do parts (c) and (d) of Exercise 9.5.1 in the textbook (pages 460-461).
- Consider the following relation that uses a temporal
`TIMESTAMP`attribute as described in class. This relation captures the on-and-off dating habits of couples (sorry for the heterosexual bias).CREATE TABLE Dating (boy STRING, girl STRING, time TIMESTAMP)

Recall that

`TIMESTAMP`values are sets of disjoint intervals, and that`NOW`may be used as an interval endpoint. You may assume that names are unique (i.e., the same name indicates the same person), and that`(boy,girl)`is a key for this relation. Write the following queries using the temporal relational algebra introduced in class. You may call built-in functions and test conditions on time values as illustrated in class.**(a)**Find all boy-girl pairs who are dating each other now.**(b)**Find all boy-girl pairs who are dating each other now and have dated each other for a total of at least 6 months.**(c)**Find all boys who have dated more than one girl at the same time.**(d)**Find all girls who have been dating someone (not necessarily the same person) for a total of at least 2 years. - Consider the following
*fact table*that records student exam scores over time.Scores(studentID, quarter, courseID, exam#, score)

Consider the following two queries over the

`Scores`relation. (Average scores obviously would make more sense than sums of scores, but we're intentionally keeping the problem simple.)// Total score grouped by quarter and course: SELECT quarter, courseID, SUM(score) FROM Scores GROUP BY quarter, courseID // Total score grouped by course and exam# SELECT courseID, exam#, SUM(score) FROM Scores GROUP BY courseID, exam#

Specify a view

`V`over the`Scores`relation. You should choose`V`so that if`V`is stored as a*materialized view*, then`V`can be used to substantially speed up the execution of both of the above queries, assuming that the`Scores`relation is very large. In addition to specifying`V`, show how each of the two queries above can be rewritten into an equivalent query that uses`V`instead of`Scores`. - Consider a much simpler relation
`Took(courseID)`, which just lists courses taken over time. The relation has no key, i.e., a value for`courseID`will appear multiple times in`Took`, once for each time the course was taken. One fairly straightforward data mining problem is called the*top-N items*(or*frequent items*) problem: find those items that appear most often in a relation. Suppose we want to find the top ten`courseID`'s in`Took`based on how often they appear. Write a SQL query to produce the desired result. If you find it useful to define a sequence of one or more views used by the final query you may do so. Also, if it helps you may assume that frequencies are unique, i.e., that there are no two`courseID`'s that appear the same number of times in`Took`. - Now consider a slightly more complicated version of the
`Took`relation than in the previous problem:`Took(studentID,courseID)`. This relation contains a variant of the "market basket" data discussed in class - you can think of a`studentID`as representing a "transaction" (in the shopping checkout sense, not the database sense), and think of the`courseID`'s appearing with that student as the "items purchased in the transaction." Attributes`[studentID,courseID]`together form a key for this relation.We want to learn about correlations between courses taken. Specifically, we want to mine from this relation all single-attribute association rules with certain

*support*and*confidence*thresholds (as defined in class). Specifically, write a SQL query that produces all pairs of distinct`courseID`'s`[C1,C2]`such that`C1->C2`is an association rule with support > 0.05 and confidence > 0.6. If you find it useful to define a sequence of one or more views used by the final query you may do so.

Challenge Problems |

- Let us revisit Challenge Problem #1(b) from Assignment #1. In that problem, we wanted to find
the median value in a relation
`GPA(ID,gpa)`, and we were unable to do so using relational algebra. In this truly challenging problem you are to write the same query in SQL-99 with recursion and without aggregation. Specifically, your recursive query should return the`ID`of the student with the median`gpa`in the relation, without using any aggregation functions. (Recall that the median is the number in the middle, i.e., there are an equal number of higher and lower numbers.) You may assume that both`ID`'s and`gpa`'s are unique, and that there is an odd number of tuples in the relation. In developing a solution, remember that although`UNION`is used frequently in recursive`WITH`definitions, it is not a requirement. Also, for the purposes of this problem, you may drop the SQL-99 restriction prohibiting nonlinear recursion. - Consider a temporal relation
`T(A1,A2,...,An,ts)`where`ts`is the special`TIMESTAMP`attribute, and the remaining attributes are a key for the relation. Recall that each`ts`contains a disjoint set of time intervals. For simplicity, assume that`NOW`is not used, i.e., all time interval endpoints represent a fixed point in time.**(a)**Specify a schema (including any keys) for a "normal" (nontemporal) relation`R`that can encode the same data as in relation`T`. Explain how a set of tuples in`T`would be translated to a set of tuples in`R`.**(b)**Can you translate a temporal relational algebra*projection*over temporal relation`T`into a SQL query over nontemporal relation`R`? That is, suppose you wish to execute`PROJECT_Ai(T)`(with its temporal interpretation) by translating it into a SQL query over the corresponding data in relation`R`. Either show the translation or explain why it can't be done.**(c)**Same as part (b), except consider the temporal*join*of two temporal relations`T1(A,B,ts)`and`T2(B,C,ts)`. Assume that the temporal relations are represented as two nontemporal relations,`R1`and`R2`, using the same encoding you gave in part (a). You wish to execute`T1 NATURAL-JOIN T2`(with its temporal interpretation) by translating it into a SQL query over`R1`and`R2`. Either show the translation or explain why it can't be done. - Consider a
*fact table*`F`in an OLAP application, and consider the extended table`CUBE(F)`as defined in class.**(a)**Suppose`F`has five dimension attributes with two values each, and all combinations of dimension values appear in`F`. Let`F`have one dependent attribute that is summed in the`CUBE`tuples representing aggregation. How many tuples are there in`F`? How many in`CUBE(F)`?**(b)**Now suppose`F`has`N`dimension attributes but the values are very sparse, i.e., not nearly all combinations of dimension values are represented in`F`(but assume`F`is not empty). Consider the ratio of the size of`CUBE(F)`to the size of`F`. How large can that ratio be? - Same as Exercise 8, except find all association rules with
either one or two attributes on the left-hand side (and one attribute
on the right-hand side). Note that all attributes in any association
rule should be distinct.
**(a)**Write the new SQL query. If you find it useful to define a sequence of one or more views used by the final query you may do so.**(b)**Are there any principles you can use to "prune" the search, i.e., so you don't need to consider all possible rules with one or two distinct attributes on the left-hand side and one on the right-hand side? Don't worry about incorporating the pruning into your SQL code, just describe it in English.