CS145 - Spring
2003
Introduction to Databases
Part 1: AuctionBase Schema and Data   --   Due
Thursday May 1st |
- Programming work: Programming work must be submitted
electronically by Thursday at 11:59pm. Programming work submitted
after the deadline but less than 24 hours late (i.e., by Friday
11:59pm) will be accepted but penalized 10%, and programming work
submitted more than 24 hours but less than 48 hours late (i.e., by
Saturday 11:59pm) will be penalized 30%. No programming work will be
accepted more than 48 hours late.
- For emergencies: Since emergencies do arise, each
student is allocated four "chits" for turning in CS145 project work or
challenge problems late with no penalty. Please see the Late Policy on
the Assigned
Work page for a detailed description of chits.
- Honor Code reminder: For more detailed discussion
of the Stanford Honor Code as it pertains to CS145, please see
the Assigned
Work page under Honor Code. In summary: You must
indicate on your written and programming assignments any
assistance (human or otherwise) that you received. Any
assistance received that is not given proper citation will be
considered a violation of the Honor Code. In any event, you are
responsible for understanding and being able to explain on your
own all material that you submit.
- Reminder: Projects must be completed
individually.
Overview
You will be given a large volume of data downloaded from the eBay
Web site and stored in XML files. You will examine the data and
design a good relational schema for it. You will then write a
Java (or other) program to transform the data from its XML form into
Oracle's load file format, conforming to your relational schema.
You will create your schema in your own Oracle database and load
the data.
Part A: Examine the XML data files
We are providing an
XML-encoded auction data set for you to use in your project. The
data files are located in directory
/usr/class/cs145/project/ebay_data/.
- As a small data set for initial experimentation and debugging we
suggest you use just one file: /usr/class/cs145/project/ebay_data/items-0.xml.
It contains 500 auctions, comprising about 900K bytes of ASCII data.
- Your AuctionBase system also must work on the large data set, which
consists of all 40 files: /usr/class/cs145/project/ebay_data/items-?.xml, for
? = 0..39. There are a total of about 20,000 auctions, comprising
about 38 megabytes of ASCII data.
Each XML data file is valid with
respect to the XML DTD specified in file
/usr/class/cs145/project/ebay_data/items.dtd.
Your first task is to examine the DTD and the XML files to
completely understand the data you will be starting with. You will
translate this data into relations and load it into your AuctionBase
system. Please read the auction data help file in
/usr/class/cs145/project/ebay_data/items.txt. One of the most
important things to understand about the data you're starting with is
that it represents a single point in time. (Very specifically, it
represents the point in time December 20th, 2001, one second after
midnight.) It contains items that have been auctioned off in the past
and items that are "currently" up for auction. Once your AuctionBase
system is online, you will provide the capability to enter bids on
items, to "move forward" in time so that auctions close, and to add
new users and auctions.
As you develop your AuctionBase system you
are free to impose any reasonable auction rules you like -- in
particular you are not required to follow eBay's rules -- as long as
they are consistent with the provided data. (For example, if the
provided data contains an auction with two bids that are 50 cents
apart, you cannot impose a bid increment minimum of one dollar.)
Note that some of the Description elements in the XML data
are quite long. We suggest that you use the Oracle attribute type
VARCHAR2(4000) (which is the maximum length) to store
descriptions, and during data transformation simply truncate any that
will be too long to fit. If you really want to store the full descriptions
you can try using the Oracle datatypes LONG or CLOB,
but they have a number of restrictions and most likely will cause you
grief. See Limits from SQL for Web
Nerds for a discussion of the problems associated with storing
long strings in Oracle.
Part B: Design your relational schema
Now that you understand the data
you'll be working with, design a good relational schema for it.
- List your relations. Please specify all keys that hold on each relation.
You need not specify attribute types at this stage.
- List all nontrivial functional dependencies that hold on each relation,
excluding those that effectively specify keys.
- Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either
redesign them and start over, or explain why you feel it is advantageous to
use non-BCNF relations.
- List all nontrivial multivalued dependencies that hold on each relation,
excluding those that are also functional dependencies.
- Are all of your relations in Fourth Normal Form (4NF)? If not, either
redesign them and start over, or explain why you feel it is advantageous to
use non-4NF relations.
Part C: Write a data transformation program
Now write a
program that transforms the XML data into Oracle load files that
are consistent with the relational schema you settled on in Part
B. We are providing a parser skeleton in Java. In this skeleton
all of the "parsing" is done for you by invoking JDK 1.4's
built-in XML parser. You need to fill in code that processes the
internal representation of the XML tree and produces Oracle files
according to the relational schema you designed in Part B.
Detailed information is provided as comments in the skeleton
code, which may be found here:
/usr/class/cs145/project/pa1/MyParserSkeleton.java.
Naturally we suggest that you fully debug your program on the small
data set before unleashing it on the large one.
Note on
duplicate elimination: When transforming the XML data to
relational tuples you may discover that you generate certain tuples
multiple times but only want to retain one copy. There are several
ways to eliminate duplicates, including:
- Coding
duplicate-elimination as part of your transformation program.
- Using Unix tools (e.g., sort and uniq) directly
on the generated load files to eliminate duplicate lines. We recommend
this method as the easiest one, but the choice is yours.
- Relying
on the Oracle bulk loader, which will generate an error but continue
loading when a tuple with a duplicate value in a key attribute is
encountered, without loading the erroneous tuple. (It's a hack but it
works!) If you use this approach you may need to increase your
sqlldr errors setting or your load may not complete,
and you should make sure to look out for other, unintentional,
errors.
Note on using other languages and tools: You are not
required to use the JDK XML parser, or to code your transformation
program in Java. You are welcome to use any languages or tools you
wish to build a program that translates XML data conforming to the
provided DTD into Oracle load files conforming to your relational
schema. (You may not, however, perform the transformation by hand!)
Java and the JDK XML parser are the only tools for which we can
guarantee support from the course staff in terms of system problems,
knowledge, and general help. If you choose to use alternate languages
or tools, you may be on your own, and of course you are still required
to meet the project specifications.
Part D: Load your data into Oracle
The
final step is to create and populate your AuctionBase database. Using
the sqlplus interface to Oracle, issue CREATE TABLE
commands for all of the relations in your schema. Three suggestions:
- For efficiency we suggest that you specify a PRIMARY
KEY for each table that has at least one key. (See the document
Getting Started with Oracle, available through
the course Web site Project Support Materials page, for details of
specifying primary keys in table declarations.)
- For attributes representing dates/times, we suggest that
you use Oracle's built-in DATE type. Information is
available in the document Oracle Dates and Times (available through the
course Web site Project Support Materials page), and in a chapter
called Dates from SQL for Web
Nerds, also linked from the course Web site Project Support
Materials page.
- For attributes representing money, we suggest you use NUMBER(8,2)
to specify a numeric type with 2 digits after the decimal point.
Once the tables are created, load your data through
sqlplus in the same way that you loaded the Courses
table in Project Part 0. Note that for long string fields like
VARCHAR2(4000) you must specify the length of the field in
your control file, as shown here for the Description
attribute:
LOAD DATA [...] (ItemID, ..., Description CHAR(4000))
Note on maintaining your databases
You should consider two factors in the maintenance of your
databases throughout the CS145 project: Oracle is not backed up,
so anything you need long-term should be saved in the leland file
system. As you start modifying data in your database, you will
undoubtedly want the ability to get a "fresh start" easily from
your original data. We recommend that you establish a routine of
saving all data in Oracle load files, and perhaps reloading the
database each time you start working with it. Remember to delete
the contents of each table (or destroy and recreate the tables)
before reloading. Otherwise, unless you take APPEND out
of your control file (or there is a declared key), Oracle will
happily append the new data to your old table, causing your table
size to double, triple, quadruple, etc. To get rid of a table
called T, issue the following command to sqlplus:
drop table T;
If you want to get rid of all tuples in T without
deleting the table itself, issue the command:
delete from T;
Part E: Test your Oracle database
Take your newly loaded
database for a "test drive" by running a few SQL queries over it
through sqlplus. Try some simple queries over one
relation and some more complex queries involving joins. Make sure
the results look correct. What to submit
Create a
submission directory containing the following files:
- Your parser code file(s). Please name the main program
file MyParser.java.
- A text file named README, containing the following items, clearly
labeled (a) through (e):
(a) Any special compilation or running instructions for
your program and for loading your data into Oracle (e.g., if you did
not eliminate duplicates inside your transformation program, specify
the procedure used to eliminate duplicates). If you did not use
Java, please add a section to part 2(a) of your submission
specifying clearly what tools you used for parsing and data
transformation, what your code files are named, and (roughly) what
they contain.
(b) The 5 items listed in Part B of the assignment description
(c) The CREATE TABLE commands and text of your
sqlldr control files
(d) A sample few lines from each of the Oracle load files produced
by your transformation program, suitably labeled so we can match them against
your relational schema. Do not submit the entire load files!
(e) A transcript showing your table creation commands in Oracle, the
successful loading of your complete database on the large data set, and the
successful execution of 3 sample queries over the large data set
demonstrating that the data set is correct. If your sample queries generate
large results, please include only the first few tuples of each result in your
submission.
From your submission directory, execute the command
/usr/class/cs145/bin/submit
Follow the
instructions exactly. If you must resubmit, run the command
/usr/class/cs145/bin/submit -replace
If you have
any problems, please send mail to cs145@cs.stanford.edu.
Remember that points may be deducted if you do not follow the
submission procedures exactly as specified, including file naming
and contents.