Large Scale Copy Detection
My thesis work focuses on large scale copy detection of
digital objects such as textual documents, audio and video on the
world-wide web.
The web is a content publisher's nightmare come true.
Currently, any small time cyber-pirate can make copies of music CDs and books
available on the web in digital format to a large audience at virtually no
cost.
In my thesis, I focus on building a copy detection system (CDS) into
which content publishers register their valuable digital content.
The CDS then crawls the web, compares the web content to the registered
content and notifies the content owners of illegal copies.
The key challenges in building such a system are to balance
- accuracy, in terms of high precision and recall,
- scalability, in terms of coping (such as crawling, comparing,
indexing) with several terabytes of data (or several tens of millions of web
pages) and
- resiliency to ``attacks'' (such as audio clipping,
and perceptual attacks).
For this, I have developed a core architecture that can be used to
build a CDS for a variety of data types.
As proof of concept, I have built two prototype CDS: (1)
SCAM (Stanford Copy Analysis Mechanism), for
finding textual copies on the web and (2) FRAUD (Finding Replicas of AUDio)
for finding audio copies on the web.
SCAM was successfully used in May 1995 to find several instances of
plagiarism in conference papers and journal articles. Click
here for details.
Here is a little blurb that gives
a 2-page overview of my thesis research.
The following papers are detailed technical notes on various
problems we attacked as part of my thesis.
Invited papers
- Safeguarding and Charging for Information on the Internet
H. Garcia-Molina, S. P. Ketchpel, N. Shivakumar
International Conference on Data Engineering (ICDE'98)
- The SCAM Approach To Copy Detection in Digital
Libraries
N. Shivakumar, H. Garcia-Molina
D-lib Magazine , November
1995.
Conference Publications
- Computing Iceberg Queries Efficiently
M. Fang, N. Shivakumar , H. Garcia-Molina, R. Motwani, J.D. Ullman
Proceedings of 1998
International Conference on Very Large Databases (VLDB'98) ,
New York, August 1998.
- Filtering with Approximate Predicates
N. Shivakumar , H. Garcia-Molina, C.S. Chekuri
Proceedings of 1998
International Conference on Very Large Databases (VLDB'98) ,
New York, August 1998.
- Finding near-replicas of documents on the web
N. Shivakumar , H. Garcia-Molina
Proceedings of Workshop on Web Databases (WebDB'98) held in conjuntion
with EDBT'98, Mar 1998.
- Wave Indices: Indexing Evolving Databases
N. Shivakumar, H. Garcia-Molina
Proceedings of
1997 ACM
International Conference On Management of Data, 1997 (SIGMOD'97),
Tuscon, Arizona, May'97.
- dSCAM : Finding
Document Copies Across Multiple Databases.
H. Garcia-Molina, L. Gravano, N. Shivakumar
Proceedings of 4th
International Conference on Parallel and Distributed Systems (PDIS'96) ,
Miami Beach, Dec'96
- Building a Scalable and
Accurate Copy Detection Mechanism
N. Shivakumar , H. Garcia-Molina
Proceedings of
1st ACM Conference on Digital Libraries (DL'96) , Bethesda, Maryland,
Mar'96
- SCAM: A Copy Detection Mechanism for
Digital Documents
N. Shivakumar , H. Garcia-Molina
Proceedings of
2nd International Conference in Theory and Practice of Digital Libraries (DL'95)
, Austin, Texas, June '95.
In Print
Check out these articles written about my SCAM work.
- Plagiarism on the Web (Editorial)
Peter Denning
Communications of the ACM , December 1995.
Here is an unofficial copy of the article.
Hope ACM doesnt get upset with me for copying their content.
- Cops and
robbers in Cyberspace
Philip Roth
Forbes Magazine, pages 134 - 139, September 9, 1996
- A Market Waiting to Happen
Dan Goodin
Intellectual Property , July 1997