Building a DBMS tailored for XML exposes many interesting new
research issues. Many "standard" components of a database systemfrom
query language to query
optimization to user interfaceneed to be revisited. Further,
Lore is an effective platform for enabling completely new research
areas, such as seamlessly querying external data
sources, and searching for objects based on proximity to others. Please also see our more general
writeup,
Data Management for XML - Research Directions.
Lorel Query Language
We have developed the Lorel language specifically for querying XML or
other semistructured data. Based on OQL, Lorel provides powerful path
traversal operators and makes extensive use of type coercion to help
yield "intuitive" results for all queries over XML data. The original
language specification appears in this paper,
with XML-specific modifications described in this paper.
Indexing XML Data
While it is well understood how traditional indexing structures such
as B-trees and persistent hash tables can help performance of
relational database systems, determining how best to apply such
techniques over XML data is a challenge we are addressing in Lore.
Please see this paper.
Since queries over XML may often involve path traversal, we are also
looking into ways to bulid path indexes to improve
performance, using DataGuides.
Query Optimization
Depending on the structure of the underlying database and the
available indexes, there may be many different ways to evaluate a
single Lorel query. We have designed and implemented a complete
cost-based query optimizer that analyzes the search space to determine
effective query plans. While all of the usual problems associated
with cost-based query optimization apply to XML data as well, a number
of additional problems arise, such as vastly different query execution
strategies for different XML databases, more complicated notions of
database statistics, and novel uses of indexing. Details can be found
in this paper.
DataGuides
Unlike traditional relational or object-oriented databases, XML
requires no explicitly specified schema: data and structure are
intertwined in a single, simple format. Of course, DTDs may be
specified, and we are currently looking into integrating DTD support
into Lore. When a DTD is not available, having information about the
structure of any database is still essential for formulating
meaningful queries. Further, knowing the structure of the data can be
very useful for optimizing query processing. Hence, for each Lore
database, we maintain a DataGuide, a structural summary of
all paths in that database. In effect, we've reversed the traditional
relationship between schema and data. While relational and
object-oriented databases force data to adhere to a predefined schema,
we allow free-form data and then generate (and incrementally maintain)
a structural summary based on that data. DataGuides are featured in
our online demo of Lore, where users can interactively explore
the structure of a database and specify queries directly from the
DataGuide. Our query optimizer also uses DataGuides for storage of
statistics and as path indexes to improve query processing
performance. DataGuides are summarized in this
paper, and modifications to DataGuides to
incorporate XML's ordering are discussed in this
paper.
Managing External Data
Lore can integrate information dynamically fetched from one or more
external data sources. The mechanism that provides this functionality
is called the external data manager and is a component
implemented within the Lore system. Lore's external data manager
enables dynamic retrieval and integration of data from arbitrary,
heterogeneous external sources during query processing. The
distinction between Lore-resident and external data is invisible to
the user. To limit the amount of data fetched from the external
source we introduce a flexible notion of arguments. We have
also incorporated optimizations to reduce the number of calls to an
external source. The details can be found in this paper.
Proximity Search
While query languages such as Lorel (or even SQL in the relational
world) are vital for exploiting known data relationships, users are
often interested in finding data based on "fuzzy" relationships.
Search engines on the Web are one good example: someone interested in
finding a movie starring Nicolas Cage and John Travolta may simply
type "movie cage travolta." Engines return pages containing those
words, and the chances are strong that the most relevant documents
will have the words near each other. Ranking strategies are vital for
organizing the results returned by the engine, since there is no easy
way to precisely determine whether a returned page is "correct" or
"incorrect." Unfortunately, for a movie database with structured or
semistructured data, neither SQL (nor Lorel) have the machinery to
handle a query of the form "movie cage travolta." Since any database
can be viewed as a collection of interlinked objects, we are not
searching for individual objects containing all of these keywords;
rather we somehow want to rank objects based on their "proximity" to
other objects containing these terms. This concept is not supported by
traditional query languages.
We have developed technology that can rank database objects based on
their proximity to other objects, where we measure proximity based on
distances in the graph linking the objects together. While the
proximity engine is general enough for many applications, we currently
use it to support proximity keyword search in Lore, as shown in the demo.
Our work in proximity search is summarized in this paper,
and modifications to incorporate XML's ordering are discussed in this
paper.
Views
Defining a view over an XML database introduces many new problems. We
have created a view specification language and considered the problem
of answering queries posed over views. In this paper
we outline the two main techniques for supporting views: query
rewriting and view materialization. An in-depth analysis of
incremental maintenance of materialized views can be found in this
paper. The current Lore system includes a materialized views
facility, but at this time does not maintain views incrementally.
Ozone
While we have built the Lore prototype from scratch for XML and
semistructured data, we are also developing a system called
Ozone on top of an object-oriented database system that
allows structured (ODMG) and semistructured (XML) data to coexist in
the same database, with references back and forth. The goal is to
provide one OQL-based query language for all of the data. Details can
be found in this paper.
|