Introduction to Databases

Assignment #3 -- Due Wednesday May 1 |

- The procedure for turning in this assignment, the late policy, and the
grading (and non-grading) of exercises and challenge problems, is
exactly the same as for
*Assignment #1*. **Suggestion:**You may want to do the written problems first, since the relational design material covered in the written work is directly applicable to the project work.

Exercises |

- Consider a relation
`R(A,B,C)`. Suppose`R`contains the following four tuples:A B C ------- 1 2 3 1 2 4 5 2 3 5 2 6

**(a)**List all completely nontrivial functional dependencies that hold on this instance of`R`.**(b)**List all nontrivial multivalued dependencies that hold on this instance of`R`. - Do Exercise 3.6.4 in the textbook (page 117).
- Consider a database for a university that may include information
about students, courses, professors, etc.
**(a)**Within this general domain, specify the schema for an example relation and a set of functional dependencies (FD's) over the relation such that the relation is not in Boyce-Codd Normal Form (BCNF). Your relation need not be extensive (i.e., it need not include many attributes and it may represent only a very small part of the complete university information), but the schema and the FD's should be realistic in their modeling of the real world. You should make no assumptions besides those captured by the schema and the FD's.**(b)**Decompose your relation from part (a) so that the new schema is in BCNF.**(c)**Now specify the schema for an example relation and a set of FD's and multivalued dependencies (MVD's) over the relation such that the relation is in BCNF but is not in Fourth Normal Form (4NF). Again, your relation need not be extensive but the schema and the dependencies should be realistic in their modeling of the real world, and you should make no assumptions besides those captured by the schema and the dependencies. The relation you give for this part of the problem can capture similar or different information from that captured in part (a).**(d)**Decompose your relation from part (c) so that the new schema is in 4NF. - Do part (c) of Exercise 3.5.4 in the textbook (page 101).
- Do part (b) of Exercise 3.7.8 in the textbook (page 127).
- Consider the following rule for functional dependencies:
IF A -> BB, CC -> D, and CC is a subset of BB THEN A -> D

where A and D are single attributes and BB and CC are sets of attributes. For simplicity you may assume that A and D are not in BB or CC. Prove the correctness of this rule. Your proof should not be based on closure or on other rules for functional dependencies, but rather it should be based on the formal definition of functional dependencies:AA -> BB = for all tuples t and u, if t[AA]=u[AA] then t[BB]=u[BB]

where AA and BB are sets of attributes.Your proof might have roughly the following form:

*Suppose A -> BB holds, CC -> D holds, and CC is a subset of BB. Then for all tuples t and u, ... [fill in] ... To prove that A -> D holds, we need to prove that for all tuples t and u, ... [fill in]. Therefore A -> D holds.* - Another way to prove the rule in Exercise 6 is to use
*closure of attributes*. Given the facts:A -> BB, CC -> D, CC is a subset of BB

compute the closure`{A}+`of the attribute set`{A}`. Does`{A}+`contain attribute`D`? If so, you have proven the rule, i.e., you have shown`A -> D`. (As an example of this type of proof, see the justification of the Transitive Rule on page 96 of the textbook.) - A third way to prove the rule in Exercise 6 is to use other
rules for functional dependencies that already have been proven.
By applying the
*splitting*,*combining*, and*transitive*rules it is possible to infer`A -> D`from the three facts:A -> BB, CC -> D, CC is a subset of BB

Show the steps in the inference.

Challenge Problems |

- (This one is relatively straightforward for a challenge
problem, but more time-consuming than a typical exercise.)

Consider the following rule for multivalued dependencies:IF AA ->> BB and AA ->> CC THEN AA ->> (BB minus CC)

where AA, BB, and CC are sets of attributes, and`minus`performs set difference. For simplicity you may assume that AA does not intersect BB or CC. As in Exercise 6, your proof should be based on the formal definition of (in this case) multivalued dependencies, and not on other rules for multivalued dependencies.Your proof might have roughly the following form:

*Suppose AA ->> BB and AA ->> CC hold. Then for all tuples t and u there exists a tuple v such that ... [fill in] ... Let DD = BB minus CC. To prove that AA ->> DD holds, we need to prove that for all tuples t and u there exists a tuple v such that ... [fill in]. Therefore AA ->> DD holds.* - Consider a relation
`R(A1, A2, ..., An)`where`n`> 1. Give a general algorithm for constructing a relation instance that satisfies the functional dependencies:A1 -> A2 A2 -> A3 A3 -> A4 ... A(n-1) -> An

and all functional dependencies that follow from these, but does not satisfy any other functional dependencies. Even better, give an algorithm that finds a minimal such instance. - Consider the Boyce-Codd Normal Form (BCNF) decomposition
algorithm specified in class and discussed in the textbook. We start
with a relation
`R`and a set`F`of functional dependencies (FDs) for`R`. We decompose`R`into`R1`and`R2`based on a dependency in`F`that violates BCNF, and then we compute the sets of FDs`F1`for`R1`and`F2`for`R2`so we can continue with the algorithm. Suppose that when we compute the FDs for`R1`and`R2`, instead of finding all FDs that follow from our original set`F`and that apply to`R1`and`R2`(i.e., that include only attributes from`R1`or`R2`), we simply use the original set`F`of FDs specified for`R`and take those that apply to`R1`and`R2`. Can we get a decomposition that is not in BCNF? If not, argue why the algorithms still works. If so, illustrate the problem: give a relation schema`R`and a set of FDs for`R`, and trace how the BCNF decomposition algorithm produces a final set of relations that is not in BCNF.

Project Part 3 |

- As a
*small data set*for initial experimentation and debugging we suggest you use just one file:`/usr/class/cs145/ebay_data/items-0.xml`(also http://www.stanford.edu/class/cs145/ebay_data/items-0.xml). It contains 500 auctions, comprising about 900K bytes of ASCII data. - Your AuctionBase system also must work on the
*large data set*, which consists of all 40 files:`/usr/class/cs145/ebay_data/items-?.xml`, for`?`= 0..39. There are a total of about 20,000 auctions, comprising about 38 megabytes of ASCII data.

Your first task is to examine the DTD and the XML files to
completely understand the data you will be starting with. You will
translate this data into relations and load it into your AuctionBase
system. Please read the auction data help file in
`/usr/class/cs145/ebay_data/items.txt` (also http://www.stanford.edu/class/cs145/ebay_data/items.txt).
One of the most important things to understand about the data you're
starting with is that it represents a single point in time. (Very
specifically, it represents the point in time December 20th, 2001, one
second after midnight.) It contains items that have been auctioned
off in the past and items that are "currently" up for auction. Once
your AuctionBase system is online, you will provide the capability to
enter bids on items, to "move forward" in time so that auctions close,
and to add new users and auctions.

As you develop your AuctionBase system you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as they are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)

Note that some of the `Description` elements in the XML data
are quite long. We suggest that you use the Oracle attribute type
`VARCHAR2(4000)` (which is the maximum length) to store
descriptions, and simply truncate any that are too long to fit. If
you really want to store the full descriptions you can try using the
Oracle datatypes `LONG` or `CLOB`, but they have a
number of restrictions and most likely will cause you grief. See Limits from
SQL for
Web Nerds for a discussion of the problems associated with storing
long strings in Oracle.

- List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.
- List all nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys.
- Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.
- List all nontrivial multivalued dependencies that hold on each relation, excluding those that are also functional dependencies.
- Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.

- The C++ skeleton program is in files
`/usr/class/cs145/hw3/MyParserSkeleton.cc`and`/usr/class/cs145/hw3/MyParserSkeleton.h`. - The Java skeleton program is in file
`/usr/class/cs145/hw3/MyParserSkeleton.java`.

Naturally we suggest that you fully debug your program on the small data set before unleashing it on the large one.

**Note on duplicate elimination**: When transforming the XML
data to relational tuples you may discover that you generate certain
tuples multiple times but only want to retain one copy. There are
several ways to eliminate duplicates, including:

- Coding duplicate-elimination as part of your transformation program.
- Using Unix tools (e.g.,
`sort`and`uniq`) directly on the generated load files to eliminate duplicate lines. We recommend this method as the easiest one, but the choice is yours. - Relying on the Oracle bulk loader, which will generate an error
but continue loading when a tuple with a duplicate value in a key
attribute is encountered, without loading the erroneous tuple. (It's
a hack but it works!) If you use this approach you may need
to increase your
`sqlldr``errors`setting or your load may not complete, and you should make sure to look out for other, unintentional, errors.

- For efficiency we suggest that you specify a
`PRIMARY KEY`for each table that has at least one key. (See the document*Getting Started with Oracle*, available through the course Web site Project page, for details of specifying primary keys in table declarations.) - For attributes representing dates/times, we suggest that you use
Oracle's built-in
`DATE`type. Information is available in the document*Oracle Dates and Times*(available through the course Web site Project page), and in a chapter called*Dates*from*SQL for Web Nerds*, also linked from the course Web site Project page. - For attributes representing money, we suggest you use
`NUMBER(8,2)`to specify a numeric type with 2 digits after the decimal point.

Once the tables are created, load your data through
`sqlplus` in the same way that you loaded tables
`Courses` and `Grades` in parts 1 and 2 of the project.
Note that for long string fields like `VARCHAR2(4000)` you
must specify the length of the field in your control file, as shown
here for the `Description` attribute:

LOAD DATA [...] (ItemID, ..., Description CHAR(4000))

At this point you may want to revisit the **Note on maintaining
your databases** included in *Assignment
#2*.

Create a submission directory containing the following files:

- Your parser code files (C++, Java, Perl, etc.). Please name the
main program file
`MyParser.xxx`. For example, if you used our Java skeleton, you should submit`MyParser.java`, and if you used our C++ skeleton, you should submit`MyParser.cc`and`MyParser.h`. - A text file named
`README`, containing the following items, clearly labeled (a) through (e):**(a)**Any special compilation or running instructions for your program and for loading your data into Oracle (e.g., if you did not eliminate duplicates inside your transformation program, specify the procedure used to eliminate duplicates)**(b)**The 5 items listed in Part B of the assignment description**(c)**The`CREATE TABLE`commands and text of your`sqlldr`control files**(d)**A sample few lines from each of the Oracle load files produced by your transformation program, suitably labeled so we can match them against your relational schema.*Do not submit the entire load files!***(e)**A transcript showing your table creation commands in Oracle, the successful loading of your complete database on the large data set, and the successful execution of 2-3 sample queries over the large data set demonstrating that the data set is correct.*If your sample queries generate large results, please include only the first few tuples of each result in your submission.*