CS145 - Spring 2003
Introduction to Databases

Part 1: AuctionBase Schema and Data -- Due Thursday May 1st

Programming work: Programming work must be submitted electronically by Thursday at 11:59pm. Programming work submitted after the deadline but less than 24 hours late (i.e., by Friday 11:59pm) will be accepted but penalized 10%, and programming work submitted more than 24 hours but less than 48 hours late (i.e., by Saturday 11:59pm) will be penalized 30%. No programming work will be accepted more than 48 hours late.
For emergencies: Since emergencies do arise, each student is allocated four "chits" for turning in CS145 project work or challenge problems late with no penalty. Please see the Late Policy on the Assigned Work page for a detailed description of chits.
Honor Code reminder: For more detailed discussion of the Stanford Honor Code as it pertains to CS145, please see the Assigned Work page under Honor Code. In summary: You must indicate on your written and programming assignments any assistance (human or otherwise) that you received. Any assistance received that is not given proper citation will be considered a violation of the Honor Code. In any event, you are responsible for understanding and being able to explain on your own all material that you submit.
Reminder: Projects must be completed individually.

Overview

You will be given a large volume of data downloaded from the eBay Web site and stored in XML files. You will examine the data and design a good relational schema for it. You will then write a Java (or other) program to transform the data from its XML form into Oracle's load file format, conforming to your relational schema. You will create your schema in your own Oracle database and load the data.

Part A: Examine the XML data files

We are providing an XML-encoded auction data set for you to use in your project. The data files are located in directory /usr/class/cs145/project/ebay_data/.

As a small data set for initial experimentation and debugging we suggest you use just one file: /usr/class/cs145/project/ebay_data/items-0.xml. It contains 500 auctions, comprising about 900K bytes of ASCII data.
Your AuctionBase system also must work on the large data set, which consists of all 40 files: /usr/class/cs145/project/ebay_data/items-?.xml, for ? = 0..39. There are a total of about 20,000 auctions, comprising about 38 megabytes of ASCII data.

Each XML data file is valid with respect to the XML DTD specified in file /usr/class/cs145/project/ebay_data/items.dtd.

Your first task is to examine the DTD and the XML files to completely understand the data you will be starting with. You will translate this data into relations and load it into your AuctionBase system. Please read the auction data help file in /usr/class/cs145/project/ebay_data/items.txt. One of the most important things to understand about the data you're starting with is that it represents a single point in time. (Very specifically, it represents the point in time December 20th, 2001, one second after midnight.) It contains items that have been auctioned off in the past and items that are "currently" up for auction. Once your AuctionBase system is online, you will provide the capability to enter bids on items, to "move forward" in time so that auctions close, and to add new users and auctions.

As you develop your AuctionBase system you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as they are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)

Note that some of the Description elements in the XML data are quite long. We suggest that you use the Oracle attribute type VARCHAR2(4000) (which is the maximum length) to store descriptions, and during data transformation simply truncate any that will be too long to fit. If you really want to store the full descriptions you can try using the Oracle datatypes LONG or CLOB, but they have a number of restrictions and most likely will cause you grief. See Limits from SQL for Web Nerds for a discussion of the problems associated with storing long strings in Oracle.

Part B: Design your relational schema

Now that you understand the data you'll be working with, design a good relational schema for it.

List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.
List all nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys.
Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.
List all nontrivial multivalued dependencies that hold on each relation, excluding those that are also functional dependencies.
Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.

Part C: Write a data transformation program

Now write a program that transforms the XML data into Oracle load files that are consistent with the relational schema you settled on in Part B. We are providing a parser skeleton in Java. In this skeleton all of the "parsing" is done for you by invoking JDK 1.4's built-in XML parser. You need to fill in code that processes the internal representation of the XML tree and produces Oracle files according to the relational schema you designed in Part B. Detailed information is provided as comments in the skeleton code, which may be found here:

/usr/class/cs145/project/pa1/MyParserSkeleton.java.

Naturally we suggest that you fully debug your program on the small data set before unleashing it on the large one.

Note on duplicate elimination: When transforming the XML data to relational tuples you may discover that you generate certain tuples multiple times but only want to retain one copy. There are several ways to eliminate duplicates, including:

Coding duplicate-elimination as part of your transformation program.
Using Unix tools (e.g., sort and uniq) directly on the generated load files to eliminate duplicate lines. We recommend this method as the easiest one, but the choice is yours.
Relying on the Oracle bulk loader, which will generate an error but continue loading when a tuple with a duplicate value in a key attribute is encountered, without loading the erroneous tuple. (It's a hack but it works!) If you use this approach you may need to increase your sqlldr errors setting or your load may not complete, and you should make sure to look out for other, unintentional, errors.

Note on using other languages and tools: You are not required to use the JDK XML parser, or to code your transformation program in Java. You are welcome to use any languages or tools you wish to build a program that translates XML data conforming to the provided DTD into Oracle load files conforming to your relational schema. (You may not, however, perform the transformation by hand!) Java and the JDK XML parser are the only tools for which we can guarantee support from the course staff in terms of system problems, knowledge, and general help. If you choose to use alternate languages or tools, you may be on your own, and of course you are still required to meet the project specifications.

Part D: Load your data into Oracle

The final step is to create and populate your AuctionBase database. Using the sqlplus interface to Oracle, issue CREATE TABLE commands for all of the relations in your schema. Three suggestions:

For efficiency we suggest that you specify a PRIMARY KEY for each table that has at least one key. (See the document Getting Started with Oracle, available through the course Web site Project Support Materials page, for details of specifying primary keys in table declarations.)
For attributes representing dates/times, we suggest that you use Oracle's built-in DATE type. Information is available in the document Oracle Dates and Times (available through the course Web site Project Support Materials page), and in a chapter called Dates from SQL for Web Nerds, also linked from the course Web site Project Support Materials page.
For attributes representing money, we suggest you use NUMBER(8,2) to specify a numeric type with 2 digits after the decimal point.

Once the tables are created, load your data through sqlplus in the same way that you loaded the Courses table in Project Part 0. Note that for long string fields like VARCHAR2(4000) you must specify the length of the field in your control file, as shown here for the Description attribute:

 LOAD DATA [...]  (ItemID, ..., Description CHAR(4000))

Note on maintaining your databases

You should consider two factors in the maintenance of your databases throughout the CS145 project: Oracle is not backed up, so anything you need long-term should be saved in the leland file system. As you start modifying data in your database, you will undoubtedly want the ability to get a "fresh start" easily from your original data. We recommend that you establish a routine of saving all data in Oracle load files, and perhaps reloading the database each time you start working with it. Remember to delete the contents of each table (or destroy and recreate the tables) before reloading. Otherwise, unless you take APPEND out of your control file (or there is a declared key), Oracle will happily append the new data to your old table, causing your table size to double, triple, quadruple, etc. To get rid of a table called T, issue the following command to sqlplus:

  drop table T;

If you want to get rid of all tuples in T without deleting the table itself, issue the command:

  delete from T;

Part E: Test your Oracle database

Take your newly loaded database for a "test drive" by running a few SQL queries over it through sqlplus. Try some simple queries over one relation and some more complex queries involving joins. Make sure the results look correct.

What to submit

Create a submission directory containing the following files:

Your parser code file(s). Please name the main program file MyParser.java.
A text file named README, containing the following items, clearly labeled (a) through (e):
(a) Any special compilation or running instructions for your program and for loading your data into Oracle (e.g., if you did not eliminate duplicates inside your transformation program, specify the procedure used to eliminate duplicates). If you did not use Java, please add a section to part 2(a) of your submission specifying clearly what tools you used for parsing and data transformation, what your code files are named, and (roughly) what they contain.
(b) The 5 items listed in Part B of the assignment description
(c) The CREATE TABLE commands and text of your sqlldr control files
(d) A sample few lines from each of the Oracle load files produced by your transformation program, suitably labeled so we can match them against your relational schema. Do not submit the entire load files!
(e) A transcript showing your table creation commands in Oracle, the successful loading of your complete database on the large data set, and the successful execution of 3 sample queries over the large data set demonstrating that the data set is correct. If your sample queries generate large results, please include only the first few tuples of each result in your submission.

From your submission directory, execute the command

/usr/class/cs145/bin/submit

Follow the instructions exactly. If you must resubmit, run the command

/usr/class/cs145/bin/submit -replace

If you have any problems, please send mail to cs145@cs.stanford.edu. Remember that points may be deducted if you do not follow the submission procedures exactly as specified, including file naming and contents.