DataMotion Report June 2009

----------------------------

During our sixth year, we have continued our DataMotion research,

on a variety of information management topics.

** Continuous Uncertainty

Our Trio system supports data uncertainty and lineage. We have

extended Trio so it can manage continuous uncertainty. Data items

with uncertain possible values drawn from a continuous domain are

represented through a generic set of functions. Our approach enables

precise and efficient representation of arbitrary probability

distribution functions, along with standard distributions such as

Gaussians. We also studied how queries are processed efficiently over

this representation, without knowledge of specific distributions. For

queries that cannot be answered exactly, we can provide approximate

answers using sampling or histogram approximations, offering the user

a cost-precision trade-off. Our approach exploits Trio's lineage and

confidence features, with smooth integration into the overall data

model and system.

Continuous Uncertainty in Trio. Agrawal, Parag and Widom, Jennifer.

Proc. Workshop on Management of Uncertain Data, Lyon, France, August 2009.

Available at: http://ilpubs.stanford.edu:8090/928/

** Outerjoins with Uncertain Data

We have studied the problem of incorporating outerjoins into uncertain

databases. Outerjoin is a widely used operation in traditional

relational databases, because its result does not lose data when the

join is incomplete. With uncertain data, it seems even more likely

that non-matching tuples will occur in a join, especially if the data

is very noisy, or if the join condition is too strict. Thus, we

expect outerjoins to be very useful in uncertain databases. However,

outerjoins with uncertain data are tricky because standard

possible-worlds semantics may be inappropriate. We have explored a

variety of alternative semantics, and we have explored various

implementation considerations. Our results are presented in the paper

cited below.

Outerjoins in Uncertain Databases. Ikeda, Robert and Widom, Jennifer.

Proc. Workshop on Management of Uncertain Data, Lyon, France, August 2009.

Available at: http://ilpubs.stanford.edu:8090/925/

** LIVE: A Versioned DBMS

LIVE is a complete DBMS designed for applications with many stored

derived relations, and with a need for simple versioning capabilities

when base data is modified. Target applications include, for example,

scientific data management and data integration. A key feature of LIVE

is the use of lineage (provenance) to support modifications and

versioning in this environment. In our system, lineage significantly

facilitates both: (1) efficient propagation of modifications from base

to derived data; and (2) efficient execution of a wide class of

queries over versioned, derived data. LIVE is fully implemented; the

report below presents detailed experimental results that validate our

techniques.

LIVE: A Lineage-Supported Versioned DBMS

Das Sarma, Anish and Theobald, Martin and Widom, Jennifer.

Technical Report. Stanford InfoLab, June 2009.

Available at: http://ilpubs.stanford.edu:8090/926/

** Indexing Boolean Expressions

Together with Yahoo! Research, we have studied the problem of

efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive

Normal Form (CNF) Boolean expressions over a high-dimensional

multi-valued attribute space. The goal is to rapidly find the set of

Boolean expressions that evaluate to true for a given assignment of

values to attributes. A solution to this problem has applications in

online advertising (where a Boolean expression represents an

advertiser's user targeting requirements, and an assignment of values

to attributes represents the characteristics of a user visiting an

online page) and in general any publish/subscribe system (where a

Boolean expression represents a subscription, and an assignment of

values to attributes represents an event). All existing solutions that

we are aware of can only index a specialized sub-set of conjunctive

and/or disjunctive expressions, and cannot efficiently handle general

DNF and CNF expressions (including NOTs) over multi-valued

attributes. We have developed a novel solution based on the inverted

list data structure that enables us to index arbitrarily complex DNF

and CNF Boolean expressions over multi-valued attributes. An

interesting aspect of our solution is that, by virtue of leveraging

inverted lists traditionally used for ranked information retrieval, we

can efficiently return the top-N matching Boolean expressions. This

capability enables emerging applications such as ranked

publish/subscribe systems, where only the top subscriptions that match

an event are desired. For example, in online advertising there is a

limit on the number of advertisements that can be shown on a given

page and only the "best" advertisements can be displayed. We have

evaluated our proposed technique based on data from an online

advertising application, and the results show a dramatic performance

improvement over prior techniques.

Indexing Boolean Expressions.

Whang, Steven; Brower, Chad; Shanmugasundaram, Jayavel;

Vassilvitskii, Sergei; Vee, Erik; Yerneni, Ramana; Garcia-Molina, Hector.

VLDB Conference, Lyon, France, August 2009.

Available at: http://ilpubs.stanford.edu:8090/927/

** Entity Resolution with Iterative Blocking

Entity Resolution (ER) is the problem of identifying which records in

a database refer to the same real-world entity. An exhaustive ER

process involves computing the similarities between pairs of records,

which can be very expensive for large datasets. Various blocking

techniques can be used to enhance the performance of ER by dividing

the records into blocks in multiple ways and only comparing records

within the same block. However, most blocking techniques process

blocks separately and do not exploit the results of other blocks. We

have developed an iterative blocking framework where the ER results of

blocks are reflected to subsequently processed blocks. Blocks are now

iteratively processed until no block contains any more matching

records. Compared to simple blocking, iterative blocking may achieve

higher accuracy because reflecting the ER results of blocks to other

blocks may generate additional record matches. Iterative blocking may

also be more efficient because processing a block now saves the

processing time for other blocks. We have implemented a scalable

iterative blocking system and demonstrated that iterative blocking is

more accurate and efficient than blocking, especially for large

datasets.

Entity Resolution with Iterative Blocking

Whang, Steven Euijong; Menestrina, David; Koutrika, Georgia;

Theobald, Martin; Garcia-Molina, Hector.

SIGMOD Conference, June 29 - July 2, 2009,

Providence, Rhode Island.

Available at: http://ilpubs.stanford.edu:8090/915/