CS145 - Spring 2002
Introduction to Databases

Assignment #3 -- Due Wednesday May 1

The procedure for turning in this assignment, the late policy, and the grading (and non-grading) of exercises and challenge problems, is exactly the same as for Assignment #1.
Suggestion: You may want to do the written problems first, since the relational design material covered in the written work is directly applicable to the project work.

Exercises

Consider a relation R(A,B,C). Suppose R contains the following four tuples:
```
   A  B  C
   -------
   1  2  3
   1  2  4
   5  2  3
   5  2  6
```
(a) List all completely nontrivial functional dependencies that hold on this instance of R.
(b) List all nontrivial multivalued dependencies that hold on this instance of R.
Do Exercise 3.6.4 in the textbook (page 117).
Consider a database for a university that may include information about students, courses, professors, etc.
(a) Within this general domain, specify the schema for an example relation and a set of functional dependencies (FD's) over the relation such that the relation is not in Boyce-Codd Normal Form (BCNF). Your relation need not be extensive (i.e., it need not include many attributes and it may represent only a very small part of the complete university information), but the schema and the FD's should be realistic in their modeling of the real world. You should make no assumptions besides those captured by the schema and the FD's.
(b) Decompose your relation from part (a) so that the new schema is in BCNF.
(c) Now specify the schema for an example relation and a set of FD's and multivalued dependencies (MVD's) over the relation such that the relation is in BCNF but is not in Fourth Normal Form (4NF). Again, your relation need not be extensive but the schema and the dependencies should be realistic in their modeling of the real world, and you should make no assumptions besides those captured by the schema and the dependencies. The relation you give for this part of the problem can capture similar or different information from that captured in part (a).
(d) Decompose your relation from part (c) so that the new schema is in 4NF.
Do part (c) of Exercise 3.5.4 in the textbook (page 101).
Do part (b) of Exercise 3.7.8 in the textbook (page 127).
Consider the following rule for functional dependencies:
```
   IF A -> BB, CC -> D, and CC is a subset of BB
   THEN A -> D
```
where A and D are single attributes and BB and CC are sets of attributes. For simplicity you may assume that A and D are not in BB or CC. Prove the correctness of this rule. Your proof should not be based on closure or on other rules for functional dependencies, but rather it should be based on the formal definition of functional dependencies:
```
   AA -> BB =
   for all tuples t and u, if t[AA]=u[AA] then t[BB]=u[BB]
```
where AA and BB are sets of attributes.
Your proof might have roughly the following form: Suppose A -> BB holds, CC -> D holds, and CC is a subset of BB. Then for all tuples t and u, ... [fill in] ... To prove that A -> D holds, we need to prove that for all tuples t and u, ... [fill in]. Therefore A -> D holds.
Another way to prove the rule in Exercise 6 is to use closure of attributes. Given the facts:
```
   A -> BB, CC -> D, CC is a subset of BB
```
compute the closure {A}+ of the attribute set {A}. Does {A}+ contain attribute D? If so, you have proven the rule, i.e., you have shown A -> D. (As an example of this type of proof, see the justification of the Transitive Rule on page 96 of the textbook.)
A third way to prove the rule in Exercise 6 is to use other rules for functional dependencies that already have been proven. By applying the splitting, combining, and transitive rules it is possible to infer A -> D from the three facts:
```
   A -> BB, CC -> D, CC is a subset of BB
```
Show the steps in the inference.

Challenge Problems

(This one is relatively straightforward for a challenge problem, but more time-consuming than a typical exercise.)
Consider the following rule for multivalued dependencies:
```
  IF AA ->> BB and AA ->> CC
  THEN AA ->> (BB minus CC)
```
where AA, BB, and CC are sets of attributes, and minus performs set difference. For simplicity you may assume that AA does not intersect BB or CC. As in Exercise 6, your proof should be based on the formal definition of (in this case) multivalued dependencies, and not on other rules for multivalued dependencies.
Your proof might have roughly the following form: Suppose AA ->> BB and AA ->> CC hold. Then for all tuples t and u there exists a tuple v such that ... [fill in] ... Let DD = BB minus CC. To prove that AA ->> DD holds, we need to prove that for all tuples t and u there exists a tuple v such that ... [fill in]. Therefore AA ->> DD holds.
Consider a relation R(A1, A2, ..., An) where n > 1. Give a general algorithm for constructing a relation instance that satisfies the functional dependencies:
```
   A1 -> A2
   A2 -> A3
   A3 -> A4
     ...
   A(n-1) -> An
```
and all functional dependencies that follow from these, but does not satisfy any other functional dependencies. Even better, give an algorithm that finds a minimal such instance.
Consider the Boyce-Codd Normal Form (BCNF) decomposition algorithm specified in class and discussed in the textbook. We start with a relation R and a set F of functional dependencies (FDs) for R. We decompose R into R1 and R2 based on a dependency in F that violates BCNF, and then we compute the sets of FDs F1 for R1 and F2 for R2 so we can continue with the algorithm. Suppose that when we compute the FDs for R1 and R2, instead of finding all FDs that follow from our original set F and that apply to R1 and R2 (i.e., that include only attributes from R1 or R2), we simply use the original set F of FDs specified for R and take those that apply to R1 and R2. Can we get a decomposition that is not in BCNF? If not, argue why the algorithms still works. If so, illustrate the problem: give a relation schema R and a set of FDs for R, and trace how the BCNF decomposition algorithm produces a final set of relations that is not in BCNF.

Project Part 3

Now that you're familiar with the basics of Oracle and interacting with Oracle from a Web interface, it's time to delve into the real project: AuctionBase. We're providing you with a significant amount of actual auction data encoded in XML files. (The data was downloaded in 2001 from eBay courtesy of the University of Wisconsin Databae Group, then transformed and augmented for use in CS145.) You will examine the data and design a relational schema for it. You will then use an XML parser (if you wish) as the basis for a program that transforms the data from its XML form into Oracle's load file format, conforming to your relational schema. Finally, you will create your schema in Oracle, load the data, and try a few queries.

Part A: Examine the XML data files

We are providing an XML-encoded auction data set for you to use in your project. The data files are located in directory /usr/class/cs145/ebay_data/ (symbolically linked for your convenience as http://www.stanford.edu/class/cs145/ebay_data/).

As a small data set for initial experimentation and debugging we suggest you use just one file: /usr/class/cs145/ebay_data/items-0.xml (also http://www.stanford.edu/class/cs145/ebay_data/items-0.xml). It contains 500 auctions, comprising about 900K bytes of ASCII data.
Your AuctionBase system also must work on the large data set, which consists of all 40 files: /usr/class/cs145/ebay_data/items-?.xml, for ? = 0..39. There are a total of about 20,000 auctions, comprising about 38 megabytes of ASCII data.

Each XML data file is valid with respect to the XML DTD specified in file /usr/class/cs145/ebay_data/items.dtd (also http://www.stanford.edu/class/cs145/ebay_data/items.dtd).

Your first task is to examine the DTD and the XML files to completely understand the data you will be starting with. You will translate this data into relations and load it into your AuctionBase system. Please read the auction data help file in /usr/class/cs145/ebay_data/items.txt (also http://www.stanford.edu/class/cs145/ebay_data/items.txt). One of the most important things to understand about the data you're starting with is that it represents a single point in time. (Very specifically, it represents the point in time December 20th, 2001, one second after midnight.) It contains items that have been auctioned off in the past and items that are "currently" up for auction. Once your AuctionBase system is online, you will provide the capability to enter bids on items, to "move forward" in time so that auctions close, and to add new users and auctions.

As you develop your AuctionBase system you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as they are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)

Note that some of the Description elements in the XML data are quite long. We suggest that you use the Oracle attribute type VARCHAR2(4000) (which is the maximum length) to store descriptions, and simply truncate any that are too long to fit. If you really want to store the full descriptions you can try using the Oracle datatypes LONG or CLOB, but they have a number of restrictions and most likely will cause you grief. See Limits from SQL for Web Nerds for a discussion of the problems associated with storing long strings in Oracle.

Part B: Design your relational schema

Now that you understand the data you'll be working with, design a good relational schema for it.

List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.
List all nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys.
Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.
List all nontrivial multivalued dependencies that hold on each relation, excluding those that are also functional dependencies.
Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.

Part C: Write a data transformation program

Now write a program that transforms the XML data into Oracle load files that are consistent with the relational schema you settled on in Part B. This coding will most likely represent the bulk of your work for this project part. You may use absolutely any language you like to code your data tranformation program. However, you will probably find that your job is much easier if you make use of an XML parser. We are providing two parser skeletons, one for C++ users and one for Java users. If you use these skeleton programs, all of the "parsing" is done for you: the Java version invokes JDK 1.4's built-in XML parser, and the C++ version uses the Xerces C++ parser (http://xml.apache.org/xerces-c/). You need to fill in code that processes the internal representation of the XML tree and produces Oracle files according to the relational schema you designed in Part B. Detailed information is provided as comments in the respective skeletons:

The C++ skeleton program is in files /usr/class/cs145/hw3/MyParserSkeleton.cc and /usr/class/cs145/hw3/MyParserSkeleton.h.
The Java skeleton program is in file /usr/class/cs145/hw3/MyParserSkeleton.java.

You may also find the notes from the XML Parsing Help Session useful.

Naturally we suggest that you fully debug your program on the small data set before unleashing it on the large one.

Note on duplicate elimination: When transforming the XML data to relational tuples you may discover that you generate certain tuples multiple times but only want to retain one copy. There are several ways to eliminate duplicates, including:

Coding duplicate-elimination as part of your transformation program.
Using Unix tools (e.g., sort and uniq) directly on the generated load files to eliminate duplicate lines. We recommend this method as the easiest one, but the choice is yours.
Relying on the Oracle bulk loader, which will generate an error but continue loading when a tuple with a duplicate value in a key attribute is encountered, without loading the erroneous tuple. (It's a hack but it works!) If you use this approach you may need to increase your sqlldr errors setting or your load may not complete, and you should make sure to look out for other, unintentional, errors.

Part D: Load your data into Oracle

The final step is to create and populate your AuctionBase database. Using the sqlplus interface to Oracle, issue CREATE TABLE commands for all of the relations in your schema. Three suggestions:

For efficiency we suggest that you specify a PRIMARY KEY for each table that has at least one key. (See the document Getting Started with Oracle, available through the course Web site Project page, for details of specifying primary keys in table declarations.)
For attributes representing dates/times, we suggest that you use Oracle's built-in DATE type. Information is available in the document Oracle Dates and Times (available through the course Web site Project page), and in a chapter called Dates from SQL for Web Nerds, also linked from the course Web site Project page.
For attributes representing money, we suggest you use NUMBER(8,2) to specify a numeric type with 2 digits after the decimal point.

Once the tables are created, load your data through sqlplus in the same way that you loaded tables Courses and Grades in parts 1 and 2 of the project. Note that for long string fields like VARCHAR2(4000) you must specify the length of the field in your control file, as shown here for the Description attribute:

  LOAD DATA
  [...]
  (ItemID, ..., Description CHAR(4000))

At this point you may want to revisit the Note on maintaining your databases included in Assignment #2.

Part E: Test your Oracle database

Take your newly loaded database for a "test drive" by running a few SQL queries over it through sqlplus. Try some simple queries over one relation and some more complex queries involving joins. Make sure the results look correct.

What to submit

Create a submission directory containing the following files:

Your parser code files (C++, Java, Perl, etc.). Please name the main program file MyParser.xxx. For example, if you used our Java skeleton, you should submit MyParser.java, and if you used our C++ skeleton, you should submit MyParser.cc and MyParser.h.
A text file named README, containing the following items, clearly labeled (a) through (e):
(a) Any special compilation or running instructions for your program and for loading your data into Oracle (e.g., if you did not eliminate duplicates inside your transformation program, specify the procedure used to eliminate duplicates)
(b) The 5 items listed in Part B of the assignment description
(c) The CREATE TABLE commands and text of your sqlldr control files
(d) A sample few lines from each of the Oracle load files produced by your transformation program, suitably labeled so we can match them against your relational schema. Do not submit the entire load files!
(e) A transcript showing your table creation commands in Oracle, the successful loading of your complete database on the large data set, and the successful execution of 2-3 sample queries over the large data set demonstrating that the data set is correct. If your sample queries generate large results, please include only the first few tuples of each result in your submission.

Once your submission directory is created, submission instructions are exactly the same as for Project Parts 1 and 2. Remember that points may be deducted if you do not follow the submission procedures exactly as specified, including file naming and contents, and if you submit more than once, all submissions except your last will be ignored.