Stanford
Entity
Resolution
Framework


Overview Papers Software People

News

Overview

The goal of the SERF project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as deduplication, or record linkage) is an important information integration problem: The same "real-world entities" (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve" entities, by identifying the records that represent the same entity and reconciling them to obtain one record per entity.

In our approach, the functions that "match" records (i.e. decide whether they represent the same entity) and "merge" them are viewed as black-boxes, which permits generic, extensible ER solutions. This generic setting makes ER resemble a database join operation (of the initial set of records with itself), but there are two main differences: (a) in general, we have no knowledge about which records may match, so all pairs of records need to be compared using the match function, and (b) merged records may lead us to discover new matches, therefore a "feed-back loop" must compare them against the rest of the data set.

Some of the challenges we are addressing in the SERF project include:


Papers

[1] Swoosh: A Generic Approach to Entity Resolution
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer Widom. The VLDB Journal, 2008. (available here).

[2] D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution
Omar Benjelloun, Hector Garcia-Molina, Heng Gong, Hideki Kawai, Tait Larson, David Menestrina, Sutthipong Thavisomboon. In 27th IEEE International Conference on Distributed Computing Systems (ICDCS), 2007. (available here).

[3] Generic Entity Resolution with Negative Rules
Steven Euijong Whang, Omar Benjelloun, Hector Garcia-Molina. Technical Report, 2007 (available here).

[4] Generic Entity Resolution with Data Confidences
David Menestrina, Omar Benjelloun, Hector Garcia-Molina. In First International VLDB Workshop on Clean Databases, Seoul, Korea, 2006. (available here).

[5] Generic Entity Resolution in the SERF Project
Omar Benjelloun, Hector Garcia-Molina, Hideki Kawai, Tait Eliott Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon, Jennifer Widom. IEEE Data Engineering Bulletin, June 2006 (available here).


Software

Our first release of the SERF software can be downloaded here.

This package provides an implementation of the R-Swoosh algorithm described in reference [1]. The algorithm takes as input a dataset of records (in XML) and a "MatcherMerger" class that implements functions to match and merge pairs of records, and returns a dataset of resolved records.

A sample dataset of product records, along with a simple MatcherMerger implementation are provided as an example. Products are matched based on the similarity of their titles and prices.

The source code is also included, and is released under the BSD license.


People

Faculty

Students

Postdocs/visitors

Alumns