CS145 - Spring
2004
Introduction to Databases
Project Part 1: AuctionBase Schema and Data Due
Thursday May 6th |
- Submission: Programming work is submitted
electronically and must be submitted by Thursday at 11:59 PM.
Programming work submitted after the deadline but less than 24 hours
late (i.e., by Friday 11:59 PM) will be accepted but penalized 10%,
and programming work submitted more than 24 hours but less than 48
hours late (i.e., by Saturday 11:59 PM) will be penalized 30%. No
programming work will be accepted more than 48 hours late.
- For emergencies: Since emergencies do arise,
each student is allowed a total of four unpenalized late days (four
periods up to 24 hours each) for programming work together with
challenge problems, although no single assignment may be more than two
days late. See the Assigned
Work page for more information.
- Honor Code reminder: For more detailed
discussion of the Stanford Honor Code as it pertains to CS145, please
see the Assigned
Work page under Honor Code. In summary: You must
indicate on your written and programming assignments any
assistance (human or otherwise) that you received. Any assistance
received that is not given proper citation will be considered a
violation of the Honor Code. In any event, you are responsible for
understanding and being able to explain on your own all material that
you submit.
- Projects must be completed individually.
Overview
You will be given a large volume of data downloaded from the eBay web
site and stored in XML files. You will examine the data and design a
good relational schema for it. You will then write a Java (or other)
program to transform the data from its XML form into Oracle's load
file format, conforming to your relational schema. You will create
your schema in your own Oracle database, load your transformed data,
and test it by running some SQL queries over it.
Part A: Examine the XML data files
We are providing an XML-encoded auction data set for you to use in
your project. The data files are located in directory
/usr/class/cs145/project/ebay_data/.
- As a small data set for initial experimentation and
debugging we suggest you use just one file:
/usr/class/cs145/project/ebay_data/items-0.xml. It contains
500 auctions, comprising about 900K bytes of plain-text data.
- Your AuctionBase system also must work on the large data
set, which consists of all 40 files:
/usr/class/cs145/project/ebay_data/items-?.xml, for
? = 0..39. There are a total of about 20,000 auctions,
comprising about 38 megabytes of plain-text data.
Each XML data file is valid with respect to the XML DTD specified in
file /usr/class/cs145/project/ebay_data/items.dtd.
Your first task is to examine the DTD and the XML files to
completely understand the data you will be starting with. You will
translate this data into relations and load it into your AuctionBase
system. Please read the auction data help file in
/usr/class/cs145/project/ebay_data/items.txt. One of the most
important things to understand about the data you're starting with is
that it represents a single point in time. (Very specifically, it
represents the point in time December 20th, 2001, one second after
midnight.) It contains items that have been auctioned off in the past
and items that are "currently" up for auction. Once your AuctionBase
system is online, you will provide the capability to enter bids on
items, to "move forward" in time so that auctions close, and to add
new users and auctions.
As you develop your AuctionBase system you are free to impose any
reasonable auction rules you like -- in particular you are not
required to follow eBay's rules -- as long as they are consistent with
the provided data. (For example, if the provided data contains an
auction with two bids that are 50 cents apart, you cannot impose a bid
increment minimum of one dollar.)
Note that some of the Description elements in the XML data
are quite long. We suggest that you use the Oracle attribute type
VARCHAR2(4000) (which is the maximum length) to store
descriptions, and during data transformation simply truncate any that
will be too long to fit. If you really want to store the full
descriptions you can try using the Oracle datatypes LONG or
CLOB, but they have a number of restrictions and most likely
will cause you grief. See Limits from SQL for Web
Nerds for a discussion of the problems associated with storing
long strings in Oracle.
Part B: Design your relational schema
Now that you understand the data you'll be working with, design a good
relational schema for it.
- List your relations. Please specify all keys that hold on each
relation. You need not specify attribute types at this stage.
- List all completely nontrivial functional dependencies that
hold on each relation, excluding those that effectively specify keys.
- Are all of your relations in Boyce-Codd Normal Form (BCNF)? If
not, either redesign them and start over, or explain why you feel it
is advantageous to use non-BCNF relations.
- List all nontrivial multivalued dependencies that hold on each
relation, excluding those that are also functional dependencies.
- Are all of your relations in Fourth Normal Form (4NF)? If not,
either redesign them and start over, or explain why you feel it is
advantageous to use non-4NF relations.
Document these five steps in a text file for submission; see What to
submit below.
Part C: Write a data transformation program
Now write a program that transforms the XML data into Oracle load
files that are consistent with the relational schema you settled on in
Part B. We are providing a parser skeleton in Java. In this skeleton
all of the "parsing" is done for you by invoking JDK 1.4's built-in
XML parser. You need to fill in code that processes the internal
representation of the XML tree and produces Oracle files according to
the relational schema you designed in Part B. Detailed information is
provided as comments in the skeleton code we are providing:
/usr/class/cs145/project/pa1/MyParserSkeleton.java
Please use a Makefile for compiling your transformation program(s). A
sample Makefile is provided at:
/usr/class/cs145/project/pa1/Makefile
Most of you will simply need to replace the file name
MyParserSkeleton.java in the sample Makefile with your
parser's file name.
We strongly suggest that you fully debug your program on the
small data set before unleashing it on the large one.
Duplicate elimination
When transforming the XML data to relational tuples you may discover
that you generate certain tuples multiple times but only want to
retain one copy. There are different ways to eliminate duplicates,
including:
- Coding duplicate-elimination as part of your transformation
program.
- Using Unix tools (e.g., sort and uniq)
directly on the generated load files to eliminate duplicate lines.
Please do not rely on the Oracle bulk loader to eliminate
duplicates. Oracle will generate an error but continue loading when a
tuple with a duplicate value in a key attribute is encountered.
However, using this "feature" to eliminate duplicates is an
unaesthetic hack that also complicates our grading procedure.
Using other languages and tools
You are not required to use the JDK XML parser, or to code your
transformation program in Java. You are welcome to use other
languages or tools to build your program that translates XML data
conforming to the provided DTD into Oracle load files conforming to
your relational schema. (You may not, however, perform the
transformation by hand!) Java and the JDK XML parser are the only
tools for which we can guarantee support from the course staff in
terms of system problems, knowledge, and general help. If you choose
to use alternate languages or tools, you may be on your own, and of
course you are still required to meet the project specifications.
Furthermore, the TA's must be able to compile and run all submitted
code on the leland machines without any additional packages or
runtime support, and with no additional effort over that required for
projects using the supported languages and tools.
Part D: Load your data into Oracle
The next step is to create and populate your AuctionBase database(s).
We suggest that you first debug the schema creation and loading
process interactively using sqlplus and sqlldr (see
Section D.1), as in Project
Part 0. When you are satisfied that everything is working, follow
the instructions to set up for batch loading (see Section D.3),
which allows a database to be conveniently recreated from scratch with
one command.
D.1 Creating your databases interactively
Using the sqlplus interface to Oracle, issue "create
table" commands for all of the relations in your schema. Some
suggestions:
- For efficiency we suggest that you specify a primary
key for each table that has at least one key. (See the document
Getting Started with Oracle, available through the
course web site Project
Support Materials page, for details of specifying primary keys in
table declarations.)
- For attributes representing dates/times, we suggest that you
use Oracle's built-in DATE type. Information is available in
the document Oracle Dates and Times (available through the
course web site Project
Support Materials page), and in a chapter called Dates from SQL for Web
Nerds, also linked from the course web site Project
Support Materials page.
- For attributes representing money, we suggest you use
NUMBER(8,2) to specify a numeric type with 2 digits after the
decimal point.
Once the tables are created, load your data through sqlldr
in the same way that you loaded the Courses table in Project Part 0. Note that for long string
fields like VARCHAR2(4000) you must specify the length of the
field in your control file, as shown here for the Description
attribute:
load data [...] (ItemID, ..., Description CHAR(4000))
D.2 Maintaining your databases
You should consider two factors in the maintenance of your databases
throughout the CS145 project:
- Oracle is not backed up,
so anything you need long-term should be saved in the leland file
system.
- As you start modifying data in a database, you will
undoubtedly want the ability to get a "fresh start" easily from
your original data.
We recommend that you establish a routine of saving all data in Oracle
load files, and perhaps reloading the database each time you start
working with it. Remember that you need to delete the contents of each
table (or destroy and recreate the tables) before
reloading. Otherwise, unless you take APPEND out of your
control file (or there is a declared key), Oracle will happily append
the new data to your old table, causing your table size to double,
triple, quadruple, etc. To get rid of a table called T, issue the
following command to sqlplus:
drop table T;
If you want to get rid of all tuples in T without
deleting the table itself, issue the command:
delete from T;
D.3 Automating the process
sqlplus provides a facility for reading a set of commands
from a file and executing them in batch mode. You should use this
facility for (re)building databases and running sample queries, and
you must use it extensively in your submitted work. (See What to
submit below.) For starters, create a command file called
create.sql that includes the SQL commands for table creation
that you debugged interactively (recall Section D.1). This
file will look something like:
create table Item ( .... );
create table AuctionUser ( ... );
...
quit;
Similarly, create a second command file called drop.sql that
destroys all your tables. This file will look something like:
drop table Item;
drop table AuctionUser;
...
quit;
To automate the entire process, create a shell script called
runLoad. This script can do some or all of the following four
things. (Note that the script you submit should do step 3 only; see
What to submit below.)
- Drop any existing relevant tables in your database.
- Create all the relevant tables in your database.
- Call your parser to generate fresh load files.
- Call sqlldr to load the files into Oracle.
Here is a skeleton of what this file can look like when it includes
all four steps:
#!/bin/tcsh
source /afs/ir/class/cs145/all.env
# Run the drop.sql batch file to drop existing tables
sqlplus USERNAME/PASSWORD @drop.sql
# Run the create.sql batch file to create new tables
sqlplus USERNAME/PASSWORD @create.sql
# Run your parser to generate the appropriate load files
java MyParser ...
...
# Run sqlldr to load the data file
sqlldr USERNAME/PASSWORD ...
Note that you must replace USERNAME/PASSWORD with your actual
Oracle account information. To run this script, make sure permissions
are set properly by executing "chmod 0755 runLoad", then
simply type "./runLoad" at the shell prompt to run it.
Part E: Test your Oracle database
The final step is to take your newly loaded database for a "test
drive" by running a few SQL queries over it. As with database
creation, use sqlplus to test your queries interactively,
then set up a command file to run them in batch mode. First try some
simple queries over one relation, then more complex queries involving
joins and aggregation. Make sure the results look correct. When you
are confident that everything is correct, write SQL queries for the
following specific tasks:
- Find the number of users in the database.
- Find the number of users from "Los Angeles", i.e., users whose location contains the exact string "Los Angeles".
- Find the number of auctions belonging to exactly four categories.
- Find the ID(s) of auction(s) with the highest current price.
- Find the number of sellers whose rating is higher than 1000.
- Find the number of users who are both sellers and bidders.
- Find the number of categories that include at least one item with a bid of more than
$100.
Correctness check: Your answers to the above seven queries
over the large data set should be (in order): 13422, 118,
8365, 1046871451, 3130, 6766, 150
Put all seven queries in a batch
command file called queries.sql. It should look like:
select ... from ... ;
select ... from ... ;
...
quit;
Make sure that the order of your queries in the file is exactly the same as
the order specified above. We will be checking correctness automatically.
Having trouble?
If you run into difficulties, before contacting the course staff
please consult the Project Part 1
FAQ Page. We will keep it up-to-date with answers to common
questions.
What to submit
Create a
submission directory containing the following files:
- Your parser source code file(s). Do not submit the binary
version of your parser.
- Your Makefile for compiling your files.
- Your control files for sqlldr. Please use file names
with a .ctl extension.
- Your batch command files create.sql,
drop.sql, and queries.sql. Do not submit the
output produced by running these files.
- The part of your runLoad script that handles running
your parser (i.e., step 3 in Section D.3 above). Put this fragment in
a file named runParser, and set it to invoke your parser
over the large data set directly from the class directory, not over your
own local copy.
Do not submit the
output from your parser.
- A text file named README.txt, containing the 5 items
listed in Part B above.
Reminder: Please do NOT submit the output of your parser
code and scripts, or the binary version of your parser.
Once your submission directory is properly assembled, with no
extraneous files, from your submission directory execute the script:
/afs/ir/class/cs145/bin/submit-project
You may resubmit as many times as you like, however only the latest
submission and timestamp are saved, and those are what we will use for
grading your work and determining late penalties. Submissions via
email will not be accepted.
Important Note: Grading projects is a difficult and
time-consuming task, made much harder when students do not follow
submission instructions precisely. It is to your benefit to follow
directions and keep the grader in a good mood! Points will be
deducted if you do not follow the submission procedures exactly as
specified, including which files to include (and which not to
include), file naming, and file contents. Remember to allow
sufficient time to prepare your submission once your work is
complete.