Large Scale Copy Detection

My thesis work focuses on large scale copy detection of digital objects such as textual documents, audio and video on the world-wide web. The web is a content publisher's nightmare come true. Currently, any small time cyber-pirate can make copies of music CDs and books available on the web in digital format to a large audience at virtually no cost. In my thesis, I focus on building a copy detection system (CDS) into which content publishers register their valuable digital content. The CDS then crawls the web, compares the web content to the registered content and notifies the content owners of illegal copies. The key challenges in building such a system are to balance

accuracy, in terms of high precision and recall,
scalability, in terms of coping (such as crawling, comparing, indexing) with several terabytes of data (or several tens of millions of web pages) and
resiliency to ``attacks'' (such as audio clipping, and perceptual attacks).

For this, I have developed a core architecture that can be used to build a CDS for a variety of data types. As proof of concept, I have built two prototype CDS: (1) SCAM (Stanford Copy Analysis Mechanism), for finding textual copies on the web and (2) FRAUD (Finding Replicas of AUDio) for finding audio copies on the web.

SCAM was successfully used in May 1995 to find several instances of plagiarism in conference papers and journal articles. Click here for details.

Here is a little blurb that gives a 2-page overview of my thesis research. The following papers are detailed technical notes on various problems we attacked as part of my thesis.

Invited papers

Safeguarding and Charging for Information on the Internet
H. Garcia-Molina, S. P. Ketchpel, N. Shivakumar
International Conference on Data Engineering (ICDE'98)
The SCAM Approach To Copy Detection in Digital Libraries
N. Shivakumar, H. Garcia-Molina
D-lib Magazine , November 1995.

Conference Publications
Computing Iceberg Queries Efficiently
M. Fang, N. Shivakumar , H. Garcia-Molina, R. Motwani, J.D. Ullman
Proceedings of 1998 International Conference on Very Large Databases (VLDB'98) , New York, August 1998.
Filtering with Approximate Predicates
N. Shivakumar , H. Garcia-Molina, C.S. Chekuri
Proceedings of 1998 International Conference on Very Large Databases (VLDB'98) , New York, August 1998.
Finding near-replicas of documents on the web
N. Shivakumar , H. Garcia-Molina
Proceedings of Workshop on Web Databases (WebDB'98) held in conjuntion with EDBT'98, Mar 1998.
Wave Indices: Indexing Evolving Databases
N. Shivakumar, H. Garcia-Molina
Proceedings of 1997 ACM International Conference On Management of Data, 1997 (SIGMOD'97), Tuscon, Arizona, May'97.
dSCAM : Finding Document Copies Across Multiple Databases.
H. Garcia-Molina, L. Gravano, N. Shivakumar
Proceedings of 4th International Conference on Parallel and Distributed Systems (PDIS'96) , Miami Beach, Dec'96
Building a Scalable and Accurate Copy Detection Mechanism
N. Shivakumar , H. Garcia-Molina
Proceedings of 1st ACM Conference on Digital Libraries (DL'96) , Bethesda, Maryland, Mar'96
SCAM: A Copy Detection Mechanism for Digital Documents
N. Shivakumar , H. Garcia-Molina
Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) , Austin, Texas, June '95.

In Print

Check out these articles written about my SCAM work.

Plagiarism on the Web (Editorial)
Peter Denning
Communications of the ACM , December 1995.
Here is an unofficial copy of the article. Hope ACM doesnt get upset with me for copying their content.
Cops and robbers in Cyberspace
Philip Roth
Forbes Magazine, pages 134 - 139, September 9, 1996
A Market Waiting to Happen
Dan Goodin
Intellectual Property , July 1997