CS145 Lecture Notes (2) -- Data: Relations and XML
Terminology reminder
- Data model: general conceptual way of structuring data
- Schema: structure of a particular database under a certain data model
- Instance: actual data conforming to a schema
Relational Model versus XML: Fundamental Differences
- Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
- Relations: Rigid flat table structure
XML: Flexible hierarchical structure (also graphs)
- Relations: Easy to understand, simple query languages
XML: Can be harder to understand, more complex query languages
Example Applications
- University records: students, courses, grades, etc.
Relations or XML? Why?
- University Web site: news, academics, admissions, events, research, etc.
Relations or XML? Why?
- Family tree.
Relations or XML? Why?
Application domains we use for example schemas and instances:
- Textbook: Movies
- Lectures: Booksellers, UC admissions
- Query assignments: Pizza-eaters, Movies
- Project: Online auction
The Relational Model
- Used by most commercial Database Managements Systems (DBMSs)
- Very simple model
- Enables simple, clean, declarative query languages
Database = set of relations (or tables), each with a distinct name
(Example: Student, Campus)
Each relation has a set of attributes (or columns), with
a distinct name within its relation.
(Example: label Student, Campus)
Each tuple (or row) in a relation has a value for each attribute.
(Example: some tuples for Student, Campus relations)
Each attribute has a type (or domain).
ID:char(9), name:char(25), GPA:float, age:integer
We'll use atomic (indivisible) types only for now.
Schema = complete description of structure of relations in
database: relation names, attribute names, types, etc.
Instance = actual contents (tuples) of relations
Specifying the schema (on paper):
- Without types:
Student(ID, name, GPA, age)
- With types:
Student(ID: char(9), name: char(25), GPA: float, age: integer)
Naming conflicts:
Student(ID, name, GPA, age)
Campus(name, enrollment) <- duplicate use of "name" okay
Apply(ID, Campus) <- "ID" okay, "Campus" not okay
Null values:
- Can use special null value in attribute of any type
- Can be problematic (e.g., for comparisons)
(Unknown age example)
Keys: A key for a relation is a set of attributes such that no two tuples
can have the same values for all of their key attributes.
- Key values identify specific tuples.
- System may build special indexes over key values.
- Other tuples may use key values as logical "pointers".
- Specify keys by underlining (so no good way to specify multiple keys on paper)
(Example: keys for Student, Campus)
Relations in SQL
- Declared as above with "
create table"; see book for details
- SQL includes ways to add/drop attributes and specify automatic default
values for attributes.
XML
- "Extensible Markup Language"
- A relatively new standard for data representation and exchange,
intended initially for the internet
- In document form it looks similar to HTML
- The official spec is enormous but the basic idea is very simple. In this class we care about the basic idea.
Like HTML, basic XML consists of three things:
- Tagged elements, which may be nested within one another
- Attributes on elements
- Text
In HTML, tags denote formatting: <Title>, <I>, <Table>, etc.
In XML, tags denote meaning of data: <Student>, <Book_Title>, etc.
XML data can be formatted using CSS (Cascading Style Sheets)
or XSL (Extensible Stylesheet Language) to translate XML to
HTML.
Well-formed XML
A well-formed XML document is any XML document that follows the basic
rules: single root element, matched tags, unique attribute names, etc.
Valid XML
It's possible to constrain the structure of XML data, using either a
Document Type Descriptor (DTD), or an XML Schema
Definition (XSD).
- A DTD is a grammar that describes the legal attributes of tagged
elements and the legal ordering and nesting of the elements.
- XML Schema is a more expressive language, including types,
keys, pointers, and (lots of!) other features
[[On-Line XML Data and Validation]]
Question: What are the benefits of using a DTD/XSD?
Question: What are the benefits of not using a DTD/XSD?
Notes on (#P)CDATA in DTDs
The details are messy (recall comment about official spec), but
overall:
-
#PCDATA is only for element content, CDATA is
only for attribute types, you cannot use them interchangeably, and
there are no keywords PCDATA or #CDATA. Both
CDATA and #PCDATA essentially specify text where the
special characters "<", ">", and "&" have to be escaped as
"<", ">" and "&" respectively.
- Recommendation: Use
CDATA for string-valued
attributes, use #PCDATA for elements containing text. If you
want an element to contain a mixture of text and other elements, do so
by specifying the element types along with #PCDATA in a
0-or-more list, e.g., (#PCDATA | Author | Editor)*.
XML Coverage
- Not covering: Namespaces, XLink, XPointer
- Covering later: XPath, XQuery, XSLT
Relational Model versus XML
(First three are same as earlier)
- Relations: Schema must be fixed in advance
XML: Does not require predefined, fixed schema
- Relations: Rigid flat table structure
XML: Flexible hierarchical structure; graphs
- Relations: Easy to understand, simple query languages
XML: Can be harder to understand, more complex query languages
- Relations: Ordering of data not relevant (tuple ordering or attribute ordering)
XML: Ordering forced by document or stream format, may or may not be relevant
- Relations: Transmission and sharing can be problematic
XML: Designed for easy representation and exchange
- Relations: "Native" data model for all current serious commercial DBMSs
XML: "Add-on," often implemented on top of relations