DataMotion Report June 2009
----------------------------
During our sixth year, we have continued our DataMotion research,
on a variety of information management topics.
** Continuous Uncertainty
Our Trio system supports data uncertainty and lineage. We have
extended Trio so it can manage continuous uncertainty. Data items
with uncertain possible values drawn from a continuous domain are
represented through a generic set of functions. Our approach enables
precise and efficient representation of arbitrary probability
distribution functions, along with standard distributions such as
Gaussians. We also studied how queries are processed efficiently over
this representation, without knowledge of specific distributions. For
queries that cannot be answered exactly, we can provide approximate
answers using sampling or histogram approximations, offering the user
a cost-precision trade-off. Our approach exploits Trio's lineage and
confidence features, with smooth integration into the overall data
model and system.
Continuous Uncertainty in Trio. Agrawal, Parag and Widom, Jennifer.
Proc. Workshop on Management of Uncertain Data, Lyon, France, August 2009.
Available at: http://ilpubs.stanford.edu:8090/928/
** Outerjoins with Uncertain Data
We have studied the problem of incorporating outerjoins into uncertain
databases. Outerjoin is a widely used operation in traditional
relational databases, because its result does not lose data when the
join is incomplete. With uncertain data, it seems even more likely
that non-matching tuples will occur in a join, especially if the data
is very noisy, or if the join condition is too strict. Thus, we
expect outerjoins to be very useful in uncertain databases. However,
outerjoins with uncertain data are tricky because standard
possible-worlds semantics may be inappropriate. We have explored a
variety of alternative semantics, and we have explored various
implementation considerations. Our results are presented in the paper
cited below.
Outerjoins in Uncertain Databases. Ikeda, Robert and Widom, Jennifer.
Proc. Workshop on Management of Uncertain Data, Lyon, France, August 2009.
Available at: http://ilpubs.stanford.edu:8090/925/
** LIVE: A Versioned DBMS
LIVE is a complete DBMS designed for applications with many stored
derived relations, and with a need for simple versioning capabilities
when base data is modified. Target applications include, for example,
scientific data management and data integration. A key feature of LIVE
is the use of lineage (provenance) to support modifications and
versioning in this environment. In our system, lineage significantly
facilitates both: (1) efficient propagation of modifications from base
to derived data; and (2) efficient execution of a wide class of
queries over versioned, derived data. LIVE is fully implemented; the
report below presents detailed experimental results that validate our
techniques.
LIVE: A Lineage-Supported Versioned DBMS
Das Sarma, Anish and Theobald, Martin and Widom, Jennifer.
Technical Report. Stanford InfoLab, June 2009.
Available at: http://ilpubs.stanford.edu:8090/926/
** Indexing Boolean Expressions
Together with Yahoo! Research, we have studied the problem of
efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive
Normal Form (CNF) Boolean expressions over a high-dimensional
multi-valued attribute space. The goal is to rapidly find the set of
Boolean expressions that evaluate to true for a given assignment of
values to attributes. A solution to this problem has applications in
online advertising (where a Boolean expression represents an
advertiser's user targeting requirements, and an assignment of values
to attributes represents the characteristics of a user visiting an
online page) and in general any publish/subscribe system (where a
Boolean expression represents a subscription, and an assignment of
values to attributes represents an event). All existing solutions that
we are aware of can only index a specialized sub-set of conjunctive
and/or disjunctive expressions, and cannot efficiently handle general
DNF and CNF expressions (including NOTs) over multi-valued
attributes. We have developed a novel solution based on the inverted
list data structure that enables us to index arbitrarily complex DNF
and CNF Boolean expressions over multi-valued attributes. An
interesting aspect of our solution is that, by virtue of leveraging
inverted lists traditionally used for ranked information retrieval, we
can efficiently return the top-N matching Boolean expressions. This
capability enables emerging applications such as ranked
publish/subscribe systems, where only the top subscriptions that match
an event are desired. For example, in online advertising there is a
limit on the number of advertisements that can be shown on a given
page and only the "best" advertisements can be displayed. We have
evaluated our proposed technique based on data from an online
advertising application, and the results show a dramatic performance
improvement over prior techniques.
Indexing Boolean Expressions.
Whang, Steven; Brower, Chad; Shanmugasundaram, Jayavel;
Vassilvitskii, Sergei; Vee, Erik; Yerneni, Ramana; Garcia-Molina, Hector.
VLDB Conference, Lyon, France, August 2009.
Available at: http://ilpubs.stanford.edu:8090/927/
** Entity Resolution with Iterative Blocking
Entity Resolution (ER) is the problem of identifying which records in
a database refer to the same real-world entity. An exhaustive ER
process involves computing the similarities between pairs of records,
which can be very expensive for large datasets. Various blocking
techniques can be used to enhance the performance of ER by dividing
the records into blocks in multiple ways and only comparing records
within the same block. However, most blocking techniques process
blocks separately and do not exploit the results of other blocks. We
have developed an iterative blocking framework where the ER results of
blocks are reflected to subsequently processed blocks. Blocks are now
iteratively processed until no block contains any more matching
records. Compared to simple blocking, iterative blocking may achieve
higher accuracy because reflecting the ER results of blocks to other
blocks may generate additional record matches. Iterative blocking may
also be more efficient because processing a block now saves the
processing time for other blocks. We have implemented a scalable
iterative blocking system and demonstrated that iterative blocking is
more accurate and efficient than blocking, especially for large
datasets.
Entity Resolution with Iterative Blocking
Whang, Steven Euijong; Menestrina, David; Koutrika, Georgia;
Theobald, Martin; Garcia-Molina, Hector.
SIGMOD Conference, June 29 - July 2, 2009,
Providence, Rhode Island.
Available at: http://ilpubs.stanford.edu:8090/915/