DataMotion Report August 2010

-----------------------------

 

Final Report

------------

 

During our seventh and final project year, we wrapped up work on the

topics we have been exploring at the end of the project, uncertain

data management and entity resolution.  In this report we summarize

the work in this last project year.  On our project home page at

http://infolab.stanford.edu/datamotion/ we describe the project and

provide all our progress reports from previous years.

 

 

 

** Generalized Uncertain Databases

 

Existing uncertain databases have difficulty managing data when exact

confidence values or probabilities are not available. Confidence

values may be known imprecisely or coarsely, or even be missing

altogether. We studied a generalized uncertain database that can

manage data with such incomplete knowledge of uncertainty. We developed

a semantics for generalized uncertain databases based on

Dempster-Shafer theory. We also proposed a representation scheme for

generalized uncertain databases that generalizes the Trio

representation. Our approach builds upon Trio's query processing

techniques to extend them to operate on generalized uncertain

databases.

 

Generalized Uncertain Databases: First Steps.

Agrawal, Parag and Widom, Jennifer.

To appear in Proc. Workshop on Management of Uncertain Data,

Singapore, September 2010.

 

 

** Foundations of Uncertain-Data Integration

 

There has been considerable past work studying data integration and

uncertain data in isolation. To bridge this gap, we developed the

theory of local-as-view (LAV) data integration when the sources being

integrated are uncertain.  In particular, we defined containment of

uncertain databases in these settings, which allows us to express

uncertain sources as views over a virtual mediated uncertain

database. Next, we defined consistency of a set of uncertain sources

and showed intractability of consistency-checking. We identified an

interesting special case for which consistency-checking is

polynomial. Finally, the notion of certain answers from traditional

LAV data integration does not generalize to the uncertain setting, so

we defined a corresponding notion of correct answers.

 

Foundations of Uncertain-Data Integration

Agrawal, Parag; Das Sarma, Anish; Ullman, Jeffrey and Widom, Jennifer.

To appear in Proc. 36th International Conference on

Very Large Data Bases, Singapore, September 2010.

 

 

** Evolving Rules in Entity Resolution

 

Entity resolution (ER) identifies database records that refer to the

same real world entity. In practice, ER is not a one-time process, but

is constantly improved as the data, schema and application are better

understood. We have addressed the problem of keeping the ER result

up-to-date when the ER logic evolves frequently. A naive approach that

re-runs ER from scratch may not be tolerable for resolving large

datasets. Thus, we investigated when and how we can instead exploit

previous materialized ER results to save redundant work with evolved

logic. We defined algorithm properties that facilitate evolution, and

we proposed efficient rule evolution techniques for two clustering ER

models: match-based clustering and distance-based clustering. Using

real data sets, we have illustrated the cost of materializations and

the potential gains over the naive approach.

 

Entity Resolution with Evolving Rules

Whang, Steven Euijong and Garcia-Molina, Hector.

In: PVLDB, September 13-17, 2010, Singapore.

 

 

** Evaluating the Results of ER

 

Various measures (e.g., pairwise F1, cluster F1) have been used for

evaluating Entity Resolution (ER) results. However, ER measures tend

to be chosen in an ad-hoc fashion without careful thought as to what

defines a good result for the specific application at hand.  To

address these shortcomings we first conducted an analysis of existing

ER measures, showing that they can often conflict with each other by

ranking the results of ER algorithms differently. Second, we explored

a new distance measure for ER (called ``generalized merge distance''

or GMD) inspired by the edit distance of strings, using cluster splits

and merges as its basic operations. A significant advantage of GMD is

that the cost functions for splits and merges can be configured,

enabling us to clearly understand the characteristics of a defined GMD

measure. Surprisingly, a state-of-the-art clustering measure called

Variation of Information is a special case of our configurable GMD

measure, and the widely used pairwise F1 measure can be directly

computed using GMD. We developed an efficient linear-time algorithm

that correctly computes the GMD measure for a large class of cost

functions that satisfy reasonable properties.

 

Evaluating Entity Resolution Results.

Menestrina, David; Whang, Steven Euijong; and Garcia-Molina, Hector.

In: PVLDB, September 13-17, 2010, Singapore.

 

** Combining our ER and Uncertain Data Work

 

Entity-resolution was one of the original motivating applications for

the Trio system (part of our DataMotion Project).  Trio-ER is a new

variant of the Trio system tailored specifically as a workbench for

entity-resolution. Trio-ER enables rapid prototyping of an important

basic class of entity-resolution algorithms.  For this integration, we

developed new constructs in Trio's data model and query language that

enable easy specification and refinement of entity-resolution matching

and merging. We have shown how iterative entity-resolution algorithms

are performed using Trio, how Trio's lineage capabilities are integral

to the process, and how confidence values are incorporated at several

levels.

 

Trio-ER: The Trio System as a Workbench for Entity-Resolution.

Agrawal, Parag; Ikeda, Robert; Park, Hyunjung; and Widom, Jennifer.

Technical Report. Stanford InfoLab.