CS145 - Spring 2004
Introduction to Databases
Project Part 1: AuctionBase Schema and Data
Due Thursday May 6th

Overview

You will be given a large volume of data downloaded from the eBay web site and stored in XML files. You will examine the data and design a good relational schema for it. You will then write a Java (or other) program to transform the data from its XML form into Oracle's load file format, conforming to your relational schema. You will create your schema in your own Oracle database, load your transformed data, and test it by running some SQL queries over it.

Part A: Examine the XML data files

We are providing an XML-encoded auction data set for you to use in your project. The data files are located in directory /usr/class/cs145/project/ebay_data/.

  1. As a small data set for initial experimentation and debugging we suggest you use just one file: /usr/class/cs145/project/ebay_data/items-0.xml. It contains 500 auctions, comprising about 900K bytes of plain-text data.

  2. Your AuctionBase system also must work on the large data set, which consists of all 40 files: /usr/class/cs145/project/ebay_data/items-?.xml, for ? = 0..39. There are a total of about 20,000 auctions, comprising about 38 megabytes of plain-text data.
Each XML data file is valid with respect to the XML DTD specified in file /usr/class/cs145/project/ebay_data/items.dtd.

Your first task is to examine the DTD and the XML files to completely understand the data you will be starting with. You will translate this data into relations and load it into your AuctionBase system. Please read the auction data help file in /usr/class/cs145/project/ebay_data/items.txt. One of the most important things to understand about the data you're starting with is that it represents a single point in time. (Very specifically, it represents the point in time December 20th, 2001, one second after midnight.) It contains items that have been auctioned off in the past and items that are "currently" up for auction. Once your AuctionBase system is online, you will provide the capability to enter bids on items, to "move forward" in time so that auctions close, and to add new users and auctions.

As you develop your AuctionBase system you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as they are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)

Note that some of the Description elements in the XML data are quite long. We suggest that you use the Oracle attribute type VARCHAR2(4000) (which is the maximum length) to store descriptions, and during data transformation simply truncate any that will be too long to fit. If you really want to store the full descriptions you can try using the Oracle datatypes LONG or CLOB, but they have a number of restrictions and most likely will cause you grief. See Limits from SQL for Web Nerds for a discussion of the problems associated with storing long strings in Oracle.

Part B: Design your relational schema

Now that you understand the data you'll be working with, design a good relational schema for it.

  1. List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.

  2. List all completely nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys.

  3. Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.

  4. List all nontrivial multivalued dependencies that hold on each relation, excluding those that are also functional dependencies.

  5. Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.
Document these five steps in a text file for submission; see What to submit below.

Part C: Write a data transformation program

Now write a program that transforms the XML data into Oracle load files that are consistent with the relational schema you settled on in Part B. We are providing a parser skeleton in Java. In this skeleton all of the "parsing" is done for you by invoking JDK 1.4's built-in XML parser. You need to fill in code that processes the internal representation of the XML tree and produces Oracle files according to the relational schema you designed in Part B. Detailed information is provided as comments in the skeleton code we are providing:
   /usr/class/cs145/project/pa1/MyParserSkeleton.java
Please use a Makefile for compiling your transformation program(s). A sample Makefile is provided at:
  /usr/class/cs145/project/pa1/Makefile
Most of you will simply need to replace the file name MyParserSkeleton.java in the sample Makefile with your parser's file name.

We strongly suggest that you fully debug your program on the small data set before unleashing it on the large one.

Duplicate elimination

When transforming the XML data to relational tuples you may discover that you generate certain tuples multiple times but only want to retain one copy. There are different ways to eliminate duplicates, including: Please do not rely on the Oracle bulk loader to eliminate duplicates. Oracle will generate an error but continue loading when a tuple with a duplicate value in a key attribute is encountered. However, using this "feature" to eliminate duplicates is an unaesthetic hack that also complicates our grading procedure.

Using other languages and tools

You are not required to use the JDK XML parser, or to code your transformation program in Java. You are welcome to use other languages or tools to build your program that translates XML data conforming to the provided DTD into Oracle load files conforming to your relational schema. (You may not, however, perform the transformation by hand!) Java and the JDK XML parser are the only tools for which we can guarantee support from the course staff in terms of system problems, knowledge, and general help. If you choose to use alternate languages or tools, you may be on your own, and of course you are still required to meet the project specifications. Furthermore, the TA's must be able to compile and run all submitted code on the leland machines without any additional packages or runtime support, and with no additional effort over that required for projects using the supported languages and tools.

Part D: Load your data into Oracle

The next step is to create and populate your AuctionBase database(s). We suggest that you first debug the schema creation and loading process interactively using sqlplus and sqlldr (see Section D.1), as in Project Part 0. When you are satisfied that everything is working, follow the instructions to set up for batch loading (see Section D.3), which allows a database to be conveniently recreated from scratch with one command.

D.1 Creating your databases interactively

Using the sqlplus interface to Oracle, issue "create table" commands for all of the relations in your schema. Some suggestions:

Once the tables are created, load your data through sqlldr in the same way that you loaded the Courses table in Project Part 0. Note that for long string fields like VARCHAR2(4000) you must specify the length of the field in your control file, as shown here for the Description attribute:

  load data [...]  (ItemID, ..., Description CHAR(4000))

D.2 Maintaining your databases

You should consider two factors in the maintenance of your databases throughout the CS145 project:
  1. Oracle is not backed up, so anything you need long-term should be saved in the leland file system.
  2. As you start modifying data in a database, you will undoubtedly want the ability to get a "fresh start" easily from your original data.
We recommend that you establish a routine of saving all data in Oracle load files, and perhaps reloading the database each time you start working with it. Remember that you need to delete the contents of each table (or destroy and recreate the tables) before reloading. Otherwise, unless you take APPEND out of your control file (or there is a declared key), Oracle will happily append the new data to your old table, causing your table size to double, triple, quadruple, etc. To get rid of a table called T, issue the following command to sqlplus:
  drop table T;
If you want to get rid of all tuples in T without deleting the table itself, issue the command:
  delete from T;

D.3 Automating the process

sqlplus provides a facility for reading a set of commands from a file and executing them in batch mode. You should use this facility for (re)building databases and running sample queries, and you must use it extensively in your submitted work. (See What to submit below.) For starters, create a command file called create.sql that includes the SQL commands for table creation that you debugged interactively (recall Section D.1). This file will look something like:
  create table Item ( .... );
  create table AuctionUser ( ... );
  ...
  quit;
Similarly, create a second command file called drop.sql that destroys all your tables. This file will look something like:
  drop table Item;
  drop table AuctionUser;
  ...
  quit;
To automate the entire process, create a shell script called runLoad. This script can do some or all of the following four things. (Note that the script you submit should do step 3 only; see What to submit below.)
  1. Drop any existing relevant tables in your database.
  2. Create all the relevant tables in your database.
  3. Call your parser to generate fresh load files.
  4. Call sqlldr to load the files into Oracle.
Here is a skeleton of what this file can look like when it includes all four steps:
  #!/bin/tcsh
  
  source /afs/ir/class/cs145/all.env
 
  # Run the drop.sql batch file to drop existing tables
  sqlplus USERNAME/PASSWORD @drop.sql
  
  # Run the create.sql batch file to create new tables
  sqlplus USERNAME/PASSWORD @create.sql

  # Run your parser to generate the appropriate load files
  java MyParser ...
  ...

  # Run sqlldr to load the data file
  sqlldr USERNAME/PASSWORD ...
Note that you must replace USERNAME/PASSWORD with your actual Oracle account information. To run this script, make sure permissions are set properly by executing "chmod 0755 runLoad", then simply type "./runLoad" at the shell prompt to run it.

Part E: Test your Oracle database

The final step is to take your newly loaded database for a "test drive" by running a few SQL queries over it. As with database creation, use sqlplus to test your queries interactively, then set up a command file to run them in batch mode. First try some simple queries over one relation, then more complex queries involving joins and aggregation. Make sure the results look correct. When you are confident that everything is correct, write SQL queries for the following specific tasks:
  1. Find the number of users in the database.
  2. Find the number of users from "Los Angeles", i.e., users whose location contains the exact string "Los Angeles".
  3. Find the number of auctions belonging to exactly four categories.
  4. Find the ID(s) of auction(s) with the highest current price.
  5. Find the number of sellers whose rating is higher than 1000.
  6. Find the number of users who are both sellers and bidders.
  7. Find the number of categories that include at least one item with a bid of more than $100.

Correctness check: Your answers to the above seven queries over the large data set should be (in order): 13422, 118, 8365, 1046871451, 3130, 6766, 150

Put all seven queries in a batch command file called queries.sql. It should look like:

  select ... from ... ;
  select ... from ... ;
  ...
  quit;
Make sure that the order of your queries in the file is exactly the same as the order specified above. We will be checking correctness automatically.


Having trouble?

If you run into difficulties, before contacting the course staff please consult the Project Part 1 FAQ Page. We will keep it up-to-date with answers to common questions.

What to submit

Create a submission directory containing the following files:

  1. Your parser source code file(s). Do not submit the binary version of your parser.

  2. Your Makefile for compiling your files.

  3. Your control files for sqlldr. Please use file names with a .ctl extension.

  4. Your batch command files create.sql, drop.sql, and queries.sql. Do not submit the output produced by running these files.

  5. The part of your runLoad script that handles running your parser (i.e., step 3 in Section D.3 above). Put this fragment in a file named runParser, and set it to invoke your parser over the large data set directly from the class directory, not over your own local copy. Do not submit the output from your parser.

  6. A text file named README.txt, containing the 5 items listed in Part B above.
Reminder: Please do NOT submit the output of your parser code and scripts, or the binary version of your parser.

Once your submission directory is properly assembled, with no extraneous files, from your submission directory execute the script:

  /afs/ir/class/cs145/bin/submit-project
You may resubmit as many times as you like, however only the latest submission and timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Important Note: Grading projects is a difficult and time-consuming task, made much harder when students do not follow submission instructions precisely. It is to your benefit to follow directions and keep the grader in a good mood! Points will be deducted if you do not follow the submission procedures exactly as specified, including which files to include (and which not to include), file naming, and file contents. Remember to allow sufficient time to prepare your submission once your work is complete.