Thesis Summary of Chen-Chuan Kevin Chang

Summary of Thesis Research

Thesis: Query and Data Mapping Across Heterogeneous Information Sources

An online demo of our "approximate query translation" system is now available!

Users access information sources by asking a query to retrieve the desired data. For instance, a query [fname = Tom] AND [lname = Hanks] for an online media store (to find the movies by Tom Hanks) may return data (title: "Apollo 13", price: $30, shipping: $5). Integrating heterogeneous sources is difficult because of their non-uniform query languages and data representations. For instance, to search source amazon.com, we must translate the example query as [actors contains "Tom Hanks"]. To uniformly present data from different sources, we may also need to map the result as (title: "Apollo 13", cost: $35). This mapping must transform differences in attributes (e.g., lname and fname to actors), values (e.g. 30 plus 5 to 35), and operators (e.g., "=" to "contains").

My thesis work focuses on mapping queries and data across disparate contexts. Such translation is critical especially since the Internet has brought together information sources worldwide. In particular, the mapping is essential for many important applications such as meta-searching (querying many sources by a mediator), e-commerce (e.g., comparison shopping), and web mining (by querying sources and analyzing data).

Exact Query Translation: First, I have developed a query translation mechanism for exact query processing [7]. The mechanism will find the minimal-superset mappings that do not miss any answers (i.e., no false-negatives) and that incur as few extra ones as possible (i.e., minimal false-positives). To obtain the exact query results, I also designed the derivation of filter queries for removing false-positives [5]. The algorithms guarantee that the translated queries minimally subsume the original ones and that the filter queries are of minimal cost. Furthermore, since the translation machinery relies on separately-supplied rules for rewriting basic query constraints, I also developed algorithms for rewriting IR predicates commonly used for document retrieval [2], and evaluated the post-filtering cost for such rewritings [11].

Data Translation: Second, I have adopted this query mapping framework for data translation by developing the modeling of data as a set of conjunctive constraints [8]. The machinery can deal with flat data (i.e., attribute-value pairs as shown above) as well as hierarchically structured information such as XML.

Approximate Query Translation: Third, I have further developed general approximate query translation that finds the closet mappings under virtually any closeness criteria, such as minimal-superset, maximal-subset, or some hybrid scheme that combines both precisionand recall [6]. I defined a customizable notion of closeness and designed a general translation machinery. As the basis, I have studied fundamental theorems on the constraint dependencies and the separabilityacross query conjuncts and disjuncts. The results are essential for any algorithm that attempts approximate query translation. see our online demo.

Last update: Mar. 20 2000

Chen-Chuan K. Chang, kevin@db.stanford.edu