Stanford
|
Overview | Papers | Software | People |
The goal of the SERF project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.
In our approach, the functions that "match" records (i.e. decide whether they represent the same entity) and "merge" them are viewed as black-boxes, which permits generic, extensible ER solutions. This generic setting makes ER resemble a database join operation (of the initial set of records with itself), but there are two main differences: (a) in general, we have no knowledge about which records may match, so all pairs of records need to be compared using the match function, and (b) merged records may lead us to discover new matches, therefore a "feed-back loop" must compare them against the rest of the data set.
Some of the challenges we are addressing in the SERF project include:
[1] Swoosh: A Generic Approach to Entity
Resolution
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer
Widom. The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009. (available here)
[2] D-Swoosh: A Family of
Algorithms for Generic, Distributed Entity Resolution
Omar Benjelloun, Hector Garcia-Molina, Heng Gong, Hideki Kawai, Tait Larson,
David Menestrina, Sutthipong Thavisomboon.
In 27th IEEE International Conference on Distributed Computing
Systems (ICDCS), June 2007. (available here)
[3] Bufoosh: Buffering Algorithms for Generic Entity Resolution
Hideki Kawai, Hector Garcia-Molina, Omar Benjelloun, Tait Larson, David Menestrina, Suthipong Thavisomboon.
Technical Report, 2006 (available here)
[4] Generic Entity Resolution with Data Confidences
David Menestrina, Omar Benjelloun, Hector Garcia-Molina.
In First International VLDB Workshop on Clean Databases, Seoul, Korea, September 2006. (available here)
[5] Generic Entity Resolution with Negative Rules
Steven Euijong Whang, Omar Benjelloun, Hector Garcia-Molina.
The VLDB Journal, vol. 18, no. 6, pp. 1261-1277, Feb. 2009. (available here)
[6] Generic Entity Resolution in the SERF Project
Omar Benjelloun, Hector Garcia-Molina, Hideki Kawai, Tait Eliott Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon,
Jennifer Widom.
IEEE Data Engineering Bulletin, vol. 29, no. 2, pp. 13-20, June 2006. (available here)
[7] Entity Resolution with Iterative Blocking
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina.
In Proc. 2009 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 219-232, Providence, Rhode Island, June 2009. (available here)
[8] Evaluating Entity Resolution Results
David Menestrina, Steven Euijong Whang, Hector Garcia-Molina.
In Proc. 36th Int'l Conf. on Very Large Data Bases (PVLDB), pp. 208-219, Singapore, Sept. 2010. (available here)
[9] Trio-ER: The Trio System as a Workbench for Entity-Resolution
Parag Agrawal, Robert Ikeda, Hyunjung Park, Jennifer Widom.
Technical Report, 2009. (available here)
[10] Entity Resolution with Evolving Rules
Steven Euijong Whang, Hector Garcia-Molina.
In Proc. 36th Int'l Conf. on Very Large Data Bases (PVLDB), pp. 1326-1337, Singapore, Sept. 2010. (available here)
[11] Pay-As-You-Go ER
Steven Euijong Whang, David Marmaros, Hector Garcia-Molina.
To appear in IEEE Transactions on Knowledge and Data Engineering, 2012. (available here)
[12] Joint Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina.
To appear in Proc. 28th IEEE International Conference on Data Engineering (ICDE), Washington, DC, Apr. 2012. (available here)
[13] Developments in Generic Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina.
IEEE Data Engineering Bulletin, vol. 34, no. 3, pp. 51-59, Sept. 2011. (available here)
[14] Disinformation Techniques for Entity Resolution
Steven Euijong Whang, Hector Garcia-Molina.
Technical Report, 2011. (available here)
Our first release of the SERF software can be downloaded here.
This package provides an implementation of the R-Swoosh algorithm described in reference [1]. The algorithm takes as input a dataset of records (in XML) and a "MatcherMerger" class that implements functions to match and merge pairs of records, and returns a dataset of resolved records.
A sample dataset of product records, along with a simple MatcherMerger implementation are provided as an example. Products are matched based on the similarity of their titles and prices.
The source code is also included, and is released under the BSD license.