DataMotion Report August 2010
-----------------------------
Final Report
------------
During our seventh and final project year, we wrapped up work on the
topics we have been exploring at the end of the project, uncertain
data management and entity resolution. In this report we summarize
the work in this last project year. On our project home page at
http://infolab.stanford.edu/datamotion/ we describe the project and
provide all our progress reports from previous years.
** Generalized Uncertain Databases
Existing uncertain databases have difficulty managing data when exact
confidence values or probabilities are not available. Confidence
values may be known imprecisely or coarsely, or even be missing
altogether. We studied a generalized uncertain database that can
manage data with such incomplete knowledge of uncertainty. We developed
a semantics for generalized uncertain databases based on
Dempster-Shafer theory. We also proposed a representation scheme for
generalized uncertain databases that generalizes the Trio
representation. Our approach builds upon Trio's query processing
techniques to extend them to operate on generalized uncertain
databases.
Generalized Uncertain Databases: First Steps.
Agrawal, Parag and Widom, Jennifer.
To appear in Proc. Workshop on Management of Uncertain Data,
Singapore, September 2010.
** Foundations of Uncertain-Data Integration
There has been considerable past work studying data integration and
uncertain data in isolation. To bridge this gap, we developed the
theory of local-as-view (LAV) data integration when the sources being
integrated are uncertain. In particular, we defined containment of
uncertain databases in these settings, which allows us to express
uncertain sources as views over a virtual mediated uncertain
database. Next, we defined consistency of a set of uncertain sources
and showed intractability of consistency-checking. We identified an
interesting special case for which consistency-checking is
polynomial. Finally, the notion of certain answers from traditional
LAV data integration does not generalize to the uncertain setting, so
we defined a corresponding notion of correct answers.
Foundations of Uncertain-Data Integration
Agrawal, Parag; Das Sarma, Anish; Ullman, Jeffrey and Widom, Jennifer.
To appear in Proc. 36th International Conference on
Very Large Data Bases, Singapore, September 2010.
** Evolving Rules in Entity Resolution
Entity resolution (ER) identifies database records that refer to the
same real world entity. In practice, ER is not a one-time process, but
is constantly improved as the data, schema and application are better
understood. We have addressed the problem of keeping the ER result
up-to-date when the ER logic evolves frequently. A naive approach that
re-runs ER from scratch may not be tolerable for resolving large
datasets. Thus, we investigated when and how we can instead exploit
previous materialized ER results to save redundant work with evolved
logic. We defined algorithm properties that facilitate evolution, and
we proposed efficient rule evolution techniques for two clustering ER
models: match-based clustering and distance-based clustering. Using
real data sets, we have illustrated the cost of materializations and
the potential gains over the naive approach.
Entity Resolution with Evolving Rules
Whang, Steven Euijong and Garcia-Molina, Hector.
In: PVLDB, September 13-17, 2010, Singapore.
** Evaluating the Results of ER
Various measures (e.g., pairwise F1, cluster F1) have been used for
evaluating Entity Resolution (ER) results. However, ER measures tend
to be chosen in an ad-hoc fashion without careful thought as to what
defines a good result for the specific application at hand. To
address these shortcomings we first conducted an analysis of existing
ER measures, showing that they can often conflict with each other by
ranking the results of ER algorithms differently. Second, we explored
a new distance measure for ER (called ``generalized merge distance''
or GMD) inspired by the edit distance of strings, using cluster splits
and merges as its basic operations. A significant advantage of GMD is
that the cost functions for splits and merges can be configured,
enabling us to clearly understand the characteristics of a defined GMD
measure. Surprisingly, a state-of-the-art clustering measure called
Variation of Information is a special case of our configurable GMD
measure, and the widely used pairwise F1 measure can be directly
computed using GMD. We developed an efficient linear-time algorithm
that correctly computes the GMD measure for a large class of cost
functions that satisfy reasonable properties.
Evaluating Entity Resolution Results.
Menestrina, David; Whang, Steven Euijong; and Garcia-Molina, Hector.
In: PVLDB, September 13-17, 2010, Singapore.
** Combining our ER and Uncertain Data Work
Entity-resolution was one of the original motivating applications for
the Trio system (part of our DataMotion Project). Trio-ER is a new
variant of the Trio system tailored specifically as a workbench for
entity-resolution. Trio-ER enables rapid prototyping of an important
basic class of entity-resolution algorithms. For this integration, we
developed new constructs in Trio's data model and query language that
enable easy specification and refinement of entity-resolution matching
and merging. We have shown how iterative entity-resolution algorithms
are performed using Trio, how Trio's lineage capabilities are integral
to the process, and how confidence values are incorporated at several
levels.
Trio-ER: The Trio System as a Workbench for Entity-Resolution.
Agrawal, Parag; Ikeda, Robert; Park, Hyunjung; and Widom, Jennifer.
Technical Report. Stanford InfoLab.