Building a DBMS tailored for XML exposes many interesting new
research issues. Many "standard" components of a database system|
Lorel Query LanguageWe have developed the Lorel language specifically for querying XML or other semistructured data. Based on OQL, Lorel provides powerful path traversal operators and makes extensive use of type coercion to help yield "intuitive" results for all queries over XML data. The original language specification appears in this paper, with XML-specific modifications described in this paper.
Indexing XML DataWhile it is well understood how traditional indexing structures such as B-trees and persistent hash tables can help performance of relational database systems, determining how best to apply such techniques over XML data is a challenge we are addressing in Lore. Please see this paper. Since queries over XML may often involve path traversal, we are also looking into ways to bulid path indexes to improve performance, using DataGuides.
Query OptimizationDepending on the structure of the underlying database and the available indexes, there may be many different ways to evaluate a single Lorel query. We have designed and implemented a complete cost-based query optimizer that analyzes the search space to determine effective query plans. While all of the usual problems associated with cost-based query optimization apply to XML data as well, a number of additional problems arise, such as vastly different query execution strategies for different XML databases, more complicated notions of database statistics, and novel uses of indexing. Details can be found in this paper.
DataGuidesUnlike traditional relational or object-oriented databases, XML requires no explicitly specified schema: data and structure are intertwined in a single, simple format. Of course, DTDs may be specified, and we are currently looking into integrating DTD support into Lore. When a DTD is not available, having information about the structure of any database is still essential for formulating meaningful queries. Further, knowing the structure of the data can be very useful for optimizing query processing. Hence, for each Lore database, we maintain a DataGuide, a structural summary of all paths in that database. In effect, we've reversed the traditional relationship between schema and data. While relational and object-oriented databases force data to adhere to a predefined schema, we allow free-form data and then generate (and incrementally maintain) a structural summary based on that data. DataGuides are featured in our online demo of Lore, where users can interactively explore the structure of a database and specify queries directly from the DataGuide. Our query optimizer also uses DataGuides for storage of statistics and as path indexes to improve query processing performance. DataGuides are summarized in this paper, and modifications to DataGuides to incorporate XML's ordering are discussed in this paper.
Managing External DataLore can integrate information dynamically fetched from one or more external data sources. The mechanism that provides this functionality is called the external data manager and is a component implemented within the Lore system. Lore's external data manager enables dynamic retrieval and integration of data from arbitrary, heterogeneous external sources during query processing. The distinction between Lore-resident and external data is invisible to the user. To limit the amount of data fetched from the external source we introduce a flexible notion of arguments. We have also incorporated optimizations to reduce the number of calls to an external source. The details can be found in this paper.
Proximity SearchWhile query languages such as Lorel (or even SQL in the relational world) are vital for exploiting known data relationships, users are often interested in finding data based on "fuzzy" relationships. Search engines on the Web are one good example: someone interested in finding a movie starring Nicolas Cage and John Travolta may simply type "movie cage travolta." Engines return pages containing those words, and the chances are strong that the most relevant documents will have the words near each other. Ranking strategies are vital for organizing the results returned by the engine, since there is no easy way to precisely determine whether a returned page is "correct" or "incorrect." Unfortunately, for a movie database with structured or semistructured data, neither SQL (nor Lorel) have the machinery to handle a query of the form "movie cage travolta." Since any database can be viewed as a collection of interlinked objects, we are not searching for individual objects containing all of these keywords; rather we somehow want to rank objects based on their "proximity" to other objects containing these terms. This concept is not supported by traditional query languages. We have developed technology that can rank database objects based on their proximity to other objects, where we measure proximity based on distances in the graph linking the objects together. While the proximity engine is general enough for many applications, we currently use it to support proximity keyword search in Lore, as shown in the demo. Our work in proximity search is summarized in this paper, and modifications to incorporate XML's ordering are discussed in this paper.
ViewsDefining a view over an XML database introduces many new problems. We have created a view specification language and considered the problem of answering queries posed over views. In this paper we outline the two main techniques for supporting views: query rewriting and view materialization. An in-depth analysis of incremental maintenance of materialized views can be found in this paper. The current Lore system includes a materialized views facility, but at this time does not maintain views incrementally.
OzoneWhile we have built the Lore prototype from scratch for XML and semistructured data, we are also developing a system called Ozone on top of an object-oriented database system that allows structured (ODMG) and semistructured (XML) data to coexist in the same database, with references back and forth. The goal is to provide one OQL-based query language for all of the data. Details can be found in this paper.