Data Management for XML

Research Directions

Jennifer Widom, Stanford University

Created April 28 1999; last updated June 17 1999. A postscript version of this document as of July 1999 appears in IEEE Data Engineering Bulletin, Special Issue on XML, 22(3):44-52, September 1999.
  • The XML Revolution
  • Commercial Perspective (A Disclaimer)
  • Database Research Opportunities
  • The Lore Project at Stanford
  • Personal Research Agenda
  • More Information and Related Work
  • Acknowledgements

    The XML Revolution

    XML -- the eXtensible Markup Language -- has recently emerged as a new standard for data representation and exchange on the Internet. The basic ideas underlying XML are very simple: tags on data elements identify the meaning of the data, rather than, e.g., specifying how the data should be formatted (as in HTML), and relationships between data elements are provided via simple nesting and references. Yet the potential impact is significant: Web servers and applications encoding their data in XML can quickly make their information available in a simple and usable format, and such information providers can interoperate easily. Information content is separated from information rendering, making it easy to provide multiple views of the same data. (XML data files can be rendered via specifications in XSL, the eXtensible Stylesheet Language.) Laborious, error-prone, and unmaintainable "screen-scraping" as a method for extracting useful data from HTML Web pages is greatly reduced, since XML is designed for data representation -- XML is simple, easily parsed, and self-describing. As an example, consider the following HTML fragment (extracted from my own publications Web page without modification), describing two publications:

       <UL>
       <LI>
       R. Goldman, J. McHugh, and J. Widom.
       <A href="ftp://db.stanford.edu/pub/papers/xml.ps">
       From Semistructured Data to XML: Migrating the Lore Data Model
       and Query Language
       </A>.
       Proceedings of the 2nd International Workshop on the Web and Databases
       (WebDB '99), pages 25-30, Philadelphia, Pennsylvania, June 1999.
       <LI>
       T. Lahiri, S. Abiteboul, and J. Widom.
       <A href="ftp://db.stanford.edu/pub/papers/ozone.ps">
       Ozone: Integrating Structured and Semistructured Data
       </A>.
       Technical Report, Stanford Database Group, October 1998.
       </UL>
    

    One way of encoding the same information in XML is:

       <Publication URL="ftp://db.stanford.edu/pub/papers/xml.ps" Authors="RG JM JW">
         <Title>From Semistructured Data to XML: Migrating the Lore Data Model
            and Query Language</Title>
         <Published>Proceedings of the 2nd International Workshop on the Web
            and Databases (WebDB '99)</Published>
         <Pages>25-30</Pages>
         <Location>
           <City>Philadelphia</City>
           <State>Pennsylvania</State>
         </Location>
         <Date>
           <Month>June</Month>
           <Year>1999</Year>
         </Date>
       </Publication>
       <Publication URL="ftp://db.stanford.edu/pub/papers/ozone.ps" Authors="TL SA JW">
         <Title>Ozone: Integrating Structured and Semistructured Data</Title>
         <Published>Technical Report</Published>
         <Institution>Stanford University Database Group</Institution>
         <Date>
           <Month>October</Month>
           <Year>1998</Year>
         </Date>
       </Publication>
       <Author ID="SA">S. Abiteboul</Author>
       <Author ID="RG">R. Goldman</Author>
       <Author ID="TL">T. Lahiri</Author>
       <Author ID="JM">J. McHugh</Author>
       <Author ID="JW">J. Widom</Author>
    

    Clearly the XML encoding, although more verbose, provides the information in a far more convenient and usable format from a data management perspective. Furthermore, the XML data can be transformed and rendered as desired using simple XSL specifications.

    There is great excitement in industry over XML. True believers think XML will radically change the face and uses of the Web. Leading software vendors are committed to XML and are quickly moving towards using XML internally as well as creating XML-oriented tools and products. XML technology startups are proliferating. Commercial enterprises and scientists alike are generating their data in XML, and we expect that the amount and variety of data made available in XML form, and the tools to accompany that data, will grow rapidly.


    Commercial Perspective (A Disclaimer)

    The remainder of this document is written from a true research standpoint. For better or worse, I've ignored or cast aside certain important considerations from a commercial perspective. For example:

    Database Research Opportunities

    In addition to the promise of greatly facilitating information integration on the Internet, as discussed in The XML Revolution above (see also More on Data Management for XML, by Alon Levy), another exciting promise of XML is that it "turns the Web into a database." Migrating Web information to XML is a significant first step in enabling efficient execution of ad-hoc, expressive queries over large amounts of Web data -- a core feature of traditional database management systems. Consider the current state of query processing over information on the Web:

    Most of us are familiar with Web sites that contain vast databases of useful information, but whose query and search facilities are surprisingly primitive. As a specific query scenario, consider a large collection of publication information such as that in our above examples, drawn from one or more data sources.

    Encoding information in XML is a first step to enabling expressive, database-like queries over the information, but many query processing issues still need to be addressed, as discussed in the bullets below. Furthermore, the tendency to mix traditional data elements with free text in XML, the ability to encode data ranging from fully structured to highly unstructured, and the inherent dichotomy between documents and databases, poses new challenges in combining techniques from database systems and information retrieval.

    A few sample research topics for the database community follow, with a primary focus on database-like treatment of XML (as opposed to focusing on XML-based information integration, clearly a very important topic as well). There has been preliminary work in several of these areas, while some topics are still virtually untouched.


    The Lore Project at Stanford

    The Lore project at Stanford began around 1995, with the premise of building a complete database management system for "semistructured data." We defined semistructured data as data that may be irregular or incomplete, and whose structure may change rapidly and unpredictably. Information integrated from heterogeneous sources, structured text, and "screen-scraped" HTML pages were our original motivating sources of data for Lore.

    By 1998 we had largely achieved our goal. We had built a complete, robust (as university prototypes go), multi-user database system based on a fairly traditional DBMS architecture, with a number of extra features such as dynamic structural summaries (DataGuides), keyword and proximity search, and an external data manager. Our schema-less, self-describing data model, called the Object Exchange Model (OEM), was essentially a directed labeled graph -- or equivalently, nested tagged data with references. Our query language, Lorel, was based on OQL, with modifications and extensions suitable for semistructured data. Many aspects of building a DBMS top-to-bottom needed to be revisited in the context of semistructured data. We published papers, distributed our system, and continued to work on a variety of issues.

    When XML came along, the similarity between OEM and XML was striking, and exciting. In late 1998 and early 1999 we migrated Lore to be based on a true XML-oriented data model and modified our query language accordingly. (See our short paper on the topic, which also happens to be one of the publication examples used above.) A number of subtleties were involved in both design and implementation, and the first public release of the XML version of Lore was made in May 1999.

    While some aspects of Lore could still benefit from refitting and tuning for XML, by and large we have one of the only complete database systems designed specifically for storing and querying "native" XML. We believe that Lore provides an ideal testbed for further research in this area.


    Personal Research Agenda

    As outlined in the Database Research Opportunities section above, there is a broad range of research issues to be explored in XML data management. Here I list a number of more specific topics that are on my personal research agenda. In some cases the topics are obvious follow-ons to previous work in Lore, in other cases the topics are just plain interesting.

    Storage and Indexing

    Lore uses a simple storage manager that parses XML into the basic units of elements, attributes, and text strings, and stores them using a straightforward depth-first clustering heuristic. We can build a wide variety of indexes on Lore databases. The types of indexes we support were designed for Lore's original data model, OEM, but with minor changes the indexes also work for XML. There are several further avenues to pursue in storage and indexing:

    DataGuides and DTDs

    Lore builds and dynamically maintains a DataGuide for every database, which is a summary of the current structure of the database and serves some of the functions a schema serves in a traditional DBMS. Since an XML DTD (Document Type Definition) is a set of grammar rules that restrict the form of an XML document, there is a close relationship between DataGuides and DTDs. Note that a DTD acts more as a traditional schema, since it restricts the allowable XML data, while a DataGuide infers rather than imposes structure.

    Currently we can store DTDs in Lore databases, and we can use a DTD to build an "approximate" DataGuide. There can be a significant performance advantage to building DataGuides from DTDs instead of from the database itself: in some cases, particularly in highly connected and cyclic databases, building a DataGuide can be prohibitively expensive. However, the DTD does not capture the structure of a database as accurately as a DataGuide: attributes and elements included in a DTD need not actually appear in the database, and DTDs cannot specify restrictions on the types of elements referenced by IDREF attributes. Furthermore, the lack of accuracy in DTDs inhibits other ways we use DataGuides in Lore, such as storing statistics and encoding path indexes.

    There are several avenues to pursue in the general area of DTDs, DataGuides, and their relationship, listed here in no particular order:

    Databases and Information Retrieval

    The typical paradigm for querying document collections is to use information retrieval style searches based on keyword matching and word position within documents. By contrast, the typical paradigm for querying databases is through an expressive, declarative query language (such as SQL) that relies on database structure.

    Lore implements the declarative OQL-based query language Lorel, along with a keyword and proximity search feature. Keyword search in Lore is straightforward: we find all objects (elements or attributes) whose tag, name, or value contains the specified keyword. Proximity search is more interesting. In a document collection, a typical "X near Y" search might return all documents containing both X and Y, ranked by how near X and Y are to each other within the document. There are two problems with this approach: (1) If the documents are structured, as XML documents are, then nearness based on the linear encoding of the document, rather than on the document's structure, may not work well. For example, in our sample XML data above, the year of a publication is closer in a document sense to the title of the following publication in the list than it is to its own title, and author names are nowhere near the relevant references. (2) Document divisions may be artificial: items that are near each other semantically may not even appear in the same document, and the concept of separate documents may be lost when loading XML into a database. In Lore we solve both of these problems by encoding all XML structure in the database, and by measuring proximity based on (weighted) path length in the database graph.

    We're excited about how the techniques we've developed for proximity search in Lore might be used to improve searching over XML on the Web. Still, there is more work to be done on proximity search in Lore, as well as generally integrating database and information retrieval concepts:

    Other Database Features

    Three useful features provided by most traditional database management systems are views, constraints, and triggers. We have done some initial work on materialized views in Lore, and we also have done some work on "change management" -- tracking, representing, and querying changes in semistructured databases. However, full view support for XML, including both virtual and materialized views, is a largely unexplored area. The same is true of constraints; in fact, even the notion of keys has not been introduced into XML or semistructured database work in a significant way, as far as I know. With respect to triggers (or active database capabilities in the Web/XML context in general), it is important to consider the relationship with recent developments in publish-subscribe technology.

    Mixing Semistructured and Structured Data

    A frequent and valid criticism of work in the area of semistructured data management is that all data is assumed to be semistructured in nature, i.e., (as defined earlier) the data may be irregular or incomplete, and its structure may change rapidly and unpredictably. As a result, when structure does happen to be present and stable in a semistructured database, the structure is not exploited to the advantage of the user or the system. Lore is certainly guilty on this count.

    There are two issues to address: (1) finding the structure; (2) exploiting the structure. There has been some work on issue (1), although with XML the presence of a DTD certainly alleviates this problem to a great extent. The second issue -- how to exploit structure when it is present, but not require structure for effective or efficient processing -- is largely yet to be addressed. Structure might be exploited at all levels of the system: object layout, indexing, query processing, query formulation, etc. We have done some preliminary work in this general area by "marrying" the ODMG and OEM data models (and corresponding query languages) in a system called Ozone, but clearly there is more to be explored.

    XML in/on a Traditional DBMS

    We decided to build Lore from scratch so that we could explore every nook and cranny of building a complete DBMS for semistructured data (now XML). In retrospect, we probably should have used somebody else's low-level page or object storage manager, since we haven't done much work at that level, and writing a transaction manager was a pain. More generally, it would be interesting to explore just how much of an XML-savvy DBMS actually needs to be built from scratch, and how much can be lifted from existing DBMSs. (For example, at the other extreme we have considered schemes for translating semistructured XML data to relations and translating Lorel queries to SQL accordingly.) While I expect there are some interesting research problems to explore all along the spectrum, it's also a certainty that database companies are working fast and furious to figure out how XML data and queries can fit into their systems.

    Performance Evaluation

    We have performed some preliminary evaluations of Lore's performance, particularly in the context of query optimization. However, we have had great difficulty figuring out what an appropriate benchmark for XML data should look like, in terms of the data itself (e.g., regular versus irregular, tree versus DAG versus general graph) as well as the type of queries and mix of queries and updates. So far we have not found any large, semistructured XML data sets to work with other than those we have constructed ourselves, although this dearth of interesting XML data surely will change. Meanwhile, flexible synthetic data generation for graph-structured data is a hard and interesting problem.

    More Information and Related Work

    I intentionally avoided filling this document with tempting hyperlinks. Information related to many of the topics discussed here, ranging from how to download Lore to research papers to links to commercial product sites, can be found by starting at the Lore project's home page. There's also a lot of good stuff to be found from the W3C's QL '98 Web site. I have not attempted to cover the very interesting and fast-growing body of great work from other researchers in XML and databases -- I hope I haven't stepped on anyone's toes.

    Alon Levy has created an interesting follow-up document to this one, More on Data Management for XML, covering two important topics omitted here: the role of XML in data integration, and new measures of complexity and scalability for XML query processing.


    Acknowledgements

    Many people have had a direct influence on my thinking and work on XML, including but not limited to (in alphabetical order): Serge Abiteboul, Adam Bosworth, Roy Goldman, Jason McHugh, Michael Rys, and Frank Tompa. For comments on this document, thanks go (so far) to: Roy Goldman, Alon Levy, Jason McHugh, and Dan Suciu.

    Funding for my research in semistructured data has come in the past from DARPA and the Air Force Rome Laboratories, and currently is supported by NASA and by the National Science Foundation under grant IIS-9811947.


    widom@db.stanford.edu