CS145 Lecture Notes (2) -- Data: Relations and XML

Terminology reminder

Data model: general conceptual way of structuring data
Schema: structure of a particular database under a certain data model
Instance: actual data conforming to a schema

Relational Model versus XML: Fundamental Differences

Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
Relations: Rigid flat table structure
XML: Flexible hierarchical structure (also arbitrary graphs)
Relations: Easy to understand, simple query language
XML: Can be harder to understand, more complex query language (if any)

Example Applications

University records: students, courses, grades, etc.
Relations or XML? Why?
University Web site: news, academics, admissions, events, research, etc.
Relations or XML? Why?
Family tree.
Relations or XML? Why?

Application domain we use for example schemas and instances:

Textbook: Movies
Lectures: Booksellers, UC admissions
Project: Online auction

The Relational Model

Used by most commercial Database Managements Systems (DBMSs)
Very simple model
Enables simple, clean, declarative query languages

Database = set of relations (or tables), each with a distinct name

(Example: Student, Campus)

Each relation has a set of attributes (or columns), with a distinct name within its relation.

(Example: label Student, Campus)

Each tuple (or row) in a relation has a value for each attribute.

(Example: some tuples for Student, Campus relations)

Each attribute has a type (or domain).

  ID:char(9), name:char(25), GPA:float, age:integer

We'll use atomic (indivisible) types only for now.

schema = complete description of structure of relations in database: relation names, attribute names, types, etc.
instance = actual contents (tuples) of relations

Specifying the schema (on paper):

Without types:
Student(ID, name, GPA, age)
With types:
Student(ID: char(9), name: char(25), GPA: float, age: integer)

Naming conflicts:

    Student(ID, name, GPA, age)
    Campus(name, enrollment)      <- duplicate use of "name" okay
    Apply(ID, Campus)             <- "ID" okay, "Campus" not okay

Null values:

Can use special null value in attribute of any type
Can be problematic (e.g., for comparisons)

(Unknown age example)

Keys: A key for a relation is a set of attributes such that no two tuples can have the same values for all of their key attributes.

Key values identify specific tuples.
System may build special indexes over key values.
Other tuples may use key values as logical "pointers".
Specify keys by underlining (so no good way to specify multiple keys on paper)

(Example: keys for Student, Campus)

Relations in SQL

Declared as above with "create table", see book and Oracle help documents for details
SQL includes ways to add/drop attributes and specify automatic default values for attributes.

XML

If you're interested in reading about the XML standard in detail, visit:

   http://www.w3.org/TR/2006/REC-xml-20060816/

It will keep you busy for days. Less detailed references are linked from the course schedule and are required reading.

XML = Extensible Markup Language

A relatively new standard for data representation and exchange, intended initially for the internet
A document format: a superset of HTML, a subset of SGML (roughly)
Some say that XML is to data what Java is to programming.
Take note:
- XML can be clunky.
- The official specification is enormous, but the basic idea is very simple. In this class we care about the basic idea.

Like SGML and HTML, basic XML consists of three things:

Tagged elements, which may be nested within one another
Attributes on elements
Text

In HTML, tags denote formatting: <Title>, <I>, <Table>, etc.

In XML, tags denote meaning of data: <Student>, <Book_Title>, etc.

XML data can be formatted using CSS (Cascading Style Sheets) or XSL (Extensible Stylesheet Language) to translate XML to HTML.

Well-formed XML

A well-formed XML document is any XML document that follows the basic rules: single root element, matched tags, unique attribute names, etc.

Example: bookstore data

   <?xml version="1.0" standalone="yes"?>
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd">
         <Title>A First Course in Database Systems</Title>
         <Authors>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75">
         <Title>Database Systems: The Complete Book</Title>
         <Authors>
            <Author>
               <First_Name>Hector</First_Name>
               <Last_Name>Garcia-Molina</Last_Name>
            </Author>
            <Author>
               <First_Name>Jeffrey</First_Name>
               <Last_Name>Ullman</Last_Name>
            </Author>
            <Author>
               <First_Name>Jennifer</First_Name>
               <Last_Name>Widom</Last_Name>
            </Author>
         </Authors>
         <Remark>
         Amazon.com says: Buy this book bundled with "A First Course,"
         it's a great deal!
         </Remark>
      </Book>
   </Bookstore>

A well-formed XML document can contain regular data (as above) or very irregular data.

Valid XML

It's possible to define a kind of schema for XML data, called a Document Type Descriptor (DTD).

A DTD is a grammar that describes the legal attributes of tagged elements and the legal ordering and nesting of the elements.

Example DTD

   <!ELEMENT Bookstore (Book | Magazine)*>
   <!ELEMENT Book (Title, Authors, Remark?)>
   <!ATTLIST Book ISBN CDATA #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED>
   <!ELEMENT Magazine (Title)>
   <!ATTLIST Magazine Month CDATA #REQUIRED Year CDATA #REQUIRED> 
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Authors (Author+)>
   <!ELEMENT Remark (#PCDATA)>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>

The DTD is specified at the top of the document or in a separate file referenced at the top of the document. In both cases use standalone="no".

Question: What are the benefits of using a DTD?

Question: What are the benefits of not using a DTD?

ID and IDREF(S) Attributes

Element pointers: assign a special ID attribute to an element, then point to that element with a special IDREF or IDREFS attribute in another element.

Example: reorganized bookstore

   <?xml version="1.0" standalone="no"?>
   <!DOCTYPE Bookstore SYSTEM "bookstore.dtd">
   <Bookstore>
      <Book ISBN="ISBN-0-13-035300-0" Price="$65" Edition="2nd" Authors="JU JW">
         <Title>A First Course in Database Systems</Title>
      </Book>
      <Book ISBN="ISBN-0-13-031995-3" Price="$75" Authors="HG JU JW">
         <Title>Database Systems: The Complete Book</Title>
         <Remark>
            Amazon.com says: Buy this book bundled with
            <BookRef book="ISBN-0-13-035300-0" />,
            It's a great deal!
         </Remark>
      </Book>
      <Author Ident="HG">
         <First_Name>Hector</First_Name>
         <Last_Name>Garcia-Molina</Last_Name>
      </Author>
      <Author Ident="JU">
         <First_Name>Jeffrey</First_Name>
         <Last_Name>Ullman</Last_Name>
      </Author>
      <Author Ident="JW">
         <First_Name>Jennifer</First_Name>
         <Last_Name>Widom</Last_Name>
      </Author>
   </Bookstore>

DTD for this data:

   <!ELEMENT Bookstore (Book*, Author*)>
   <!ELEMENT Book (Title, Remark?)>
   <!ATTLIST Book ISBN ID #REQUIRED
                Price CDATA #REQUIRED
                Edition CDATA #IMPLIED
                Authors IDREFS #REQUIRED>
   <!ELEMENT Title (#PCDATA)>
   <!ELEMENT Remark (#PCDATA | BookRef)*>
   <!ELEMENT BookRef EMPTY>
   <!ATTLIST BookRef book IDREF #REQUIRED>
   <!ELEMENT Author (First_Name, Last_Name)>
   <!ATTLIST Author Ident ID #REQUIRED>
   <!ELEMENT First_Name (#PCDATA)>
   <!ELEMENT Last_Name (#PCDATA)>

Digression on (#P)CDATA

The details are messy (recall comments about clunky XML and official specs above), but overall:

#PCDATA is only for element content, CDATA is only for attribute types, you cannot use them interchangeably, and there are no keywords PCDATA or #CDATA. Both CDATA and #PCDATA essentially specify text where the special characters "<", ">", and "&" have to be escaped as "<", ">" and "&" respectively.
Recommendation: Use CDATA for string-valued attributes, use #PCDATA for elements containing text. If you want an element to contain a mixture of text and other elements, do so by specifying the element types along with #PCDATA in a 0-or-more list, e.g., (#PCDATA | Author | Editor)*.

XML Coverage

Not covering: Namespaces, XLink, XPointer
Covering later: XPath, XQuery
Probably will not cover: CSS, XSLT, XML Schema

Relational Model versus XML

(First three are same as earlier)

Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
Relations: Rigid flat table structure
XML: Flexible hierarchical structure; graphs
Relations: Easy to understand, simple query language
XML: Can be harder to understand, more complex query language
Relations: Ordering of data not relevant (tuple ordering or attribute ordering)
XML: Ordering forced by document format, may or may not be relevant
Relations: Transmission and sharing can be problematic
XML: Designed for easy representation and exchange
Relations: "Native" data model for all current serious commercial DBMSs
XML: "Add-on," often implemented on top of relations