Roy
Goldman
Home | Publications | Thesis
Integrated Query and Search of Databases, XML, and the Web
Thesis Available in PostScript
Thesis Defense Abstract:
Information today is decidedly split between the structured data stored in traditional databases and the huge amount of unstructured information available over the World-Wide Web. Traditional databases operate over well-structured, typed data, and languages such as SQL enable expressive queries. On the Web, millions of HTML pages are indexed by search engines, but search engines only support fairly simple keyword-based queries. Further, neither traditional database systems nor existing search engines are well-suited for managing semistructured data such as XML. We present three research contributions that serve to unify and integrate query functionality over semistructured XML data, traditional databases, and Web HTML.
First, we describe our work on Lore, a database management system we developed at Stanford for storing and querying semistructured data such as XML. Our focus within Lore has been on DataGuides: compact and accurate structural summaries of a semistructured database.
Next, we discuss our work on proximity search in databases. The Web has shown that keyword search can be very effective for interactive searches: with a good ranking scheme, a user can quickly focus in on relevant data just by typing a few keywords. Unfortunately, traditional databases have no such interface--all queries must be expressed in declarative languages such as SQL, and results are never ranked by relevance. Proximity search is a traditional information retrieval technique for identifying words "close" to each other in a document. By applying this notion to measure "closeness" of data objects in a database, we make it simple to search structured or semistructured databases with keywords alone.
Finally, we describe WSQ ("wisk"), which stands for Web-Supported Queries. With WSQ, we can efficiently leverage results from multiple Web searches to enhance SQL queries over a local relational database. WSQ relies on a novel query processing technique to overcome the bottleneck introduced by high-latency external Web searches during query execution.