DataMotion Report May 2006

DataMotion Report June 2008

--------------------------

During our fifth year, we have continued our DataMotion research,

on a variety of information management topics.

** Access to Uncertain Data

Trio is a system for managing data, uncertainty, and lineage. It is developed on top of a conventional DBMS. Uncertain data with lineage is encoded in relational tables, and Trio queries are translated to SQL queries on the encoding. Such a layered approach reaps significant benefits in terms of architectural simplicity, and the ability to use an off-the-shelf query processing engine. We developed special-purpose indexes and statistics that complement the layered approach to further enhance Trio’s performance. First, we identified a well-defined structure of Trio queries, relations, and their encoding that can be exploited by the underlying query optimizer to improve the performance using Trio's layered approach. We studied several mechanisms for indexing Trio's uncertain relations and studied when these indexes were useful. We then developed an interesting order, and an associated operator, which are especially useful to consider when composing query plans. The decision of which query plan to use for a Trio query is dictated by various statistical properties of the input data. We have identified the statistical data that can guide the underlying optimizer, and design histograms that enable estimating the statistics accurately.

Das Sarma, Anish; Agrawal, Parag; Nabar, Shubha; Widom, Jennifer.

Title Towards Special-Purpose Indexes and Statistics for Uncertain Data.

Proceedings of the 2008 Workshop on Management of Uncertain Data, Auckland, New Zealand, August 2008.

Available at http://dbpubs.stanford.edu/pub/2008-16.

** Confidence Computations

We studied the problem of computing query results with confidence values in ULDBs: relational databases with uncertainty and lineage. ULDBs, which subsume probabilistic databases, offer an alternative decoupled method of computing confidence values: Instead of computing confidences during query processing, compute them afterwards based on lineage. This approach enables a wider space of query plans, and it permits selective computations when not all confidence values are needed. We have developed a suite of algorithms and optimizations for a broad class of relational queries on ULDBs. In particular, we obtained confidence computation algorithms for single data items, as well as efficient batch algorithms to compute confidences for an entire relation or database. All algorithms incorporate memoization to avoid redundant computations, and they have been implemented in the Trio prototype ULDB database system. Performance characteristics and scalability of the algorithms have been evaluated through experiments over a large synthetic dataset.

Das Sarma, Anish; Theobald, Martin; Widom, Jennifer.

Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases.

Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, April 2008.

Available at http://dbpubs.stanford.edu/pub/2007-15.

** Theory of ULDBs

As discussed above, ULDBs are an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We have been able to show that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations. We have defined two notions of ULDB minimality, data-minimal and lineage-minimal, and have studied minimization of ULDB representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We developed an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. We also shown how ULDBs enable a new approach to query processing in probabilistic databases.

Benjelloun, Omar; Das Sarma, Anish; Halevy, Alon; Theobald, Martin; Widom, Jennifer.

Databases with Uncertainty and Lineage.

VLDB Journal, 17(2):243-264, March 2008.

Available at http://dbpubs.stanford.edu/pub/2007-26.

** Aggregation in Trio

“Exact” aggregation in uncertain databases can produce exponentially-sized results, so we have developed three alternatives: a low bound on the aggregate value, a high bound on the value, and the expected value. These variants return a single result instead of a set of possible results, and they are generally very efficient to compute for both full-table and grouped aggregation queries. We have formal obtained definitions and semantics for aggregation, and have implemented our algorithms in the Trio system.

http://dbpubs.stanford.edu/pub/2007-7

Submitted on 1st of June 2007

Murthy, Raghotham; Widom, Jennifer.

Title Making Aggregation Work in Uncertain and Probabilistic Databases.

Proceedings of the 2007 Workshop on Management of Uncertain Data, pages 76-90, Vienna, Austria, September 2007.

Available at http://dbpubs.stanford.edu/pub/2007-7.

** Query Rewriting for Sponsored Search

We have explored a new problem area for our project, that of query rewriting for sponsored search. We base rewrites on a historical click graph that records the ads that have been clicked on in response to past user queries. Given a query q, we first considered Simrank as a way to identify queries similar to q, i.e., queries whose ads a user may be interested in. We discovered that Simrank fails to properly identify query similarities in our application, and we thus developed two enhanced versions of Simrank: one that exploits weights on click graph edges and another that exploits "evidence." We experimentally evaluated our new schemes against Simrank, using actual click graphs and queries form Yahoo!, and using a variety of metrics. Our results showed that the enhanced methods can yield more and better query rewrites.

Antonellis, Ioannis; Garcia-Molina, Hector; Chang, Chi-Chao.

Simrank++: Query Rewriting through Link Analysis of the Click Graph.

17th International World Wide Web Conference (WWW 2008) - Beijing, China 21-25 April, 2008.

Available at http://dbpubs.stanford.edu/pub/2008-3.

** Integration of Data Hierarchies

One of the main challenges in integrating two hierarchies is determining the correspondence between the edges of each hierarchy. Traditionally, this process, which we call hierarchy matching, is done by comparing the text associated with each edge. We have instead used the placement of objects present in both hierarchies to infer how the hierarchies relate. We developed two algorithms that, given a hierarchy with known facets (label-value pairs that define what objects are placed under an edge), determine feasible facets for a second hierarchy, based on shared objects. One algorithm is rule-based and the other is statistics- based. We have also experimentally evaluated both algorithms, and explored how their performances varies based on the amount of noise in the hierarchies.

Ikeda, Robert; Zhao, Kai; Garcia-Molina, Hector.

Matching Hierarchies Using Shared Objects.

ECDL 2008 European Conference for Digital Libraries, 14-19 September 2008, Aarhus, Denmark.

Available at http://dbpubs.stanford.edu/pub/2008-4.