My thesis work focuses on large scale copy detection of digital objects such as textual documents, audio and video on the world-wide web. The web is a content publisher's nightmare come true. Currently, any small time cyber-pirate can make copies of music CDs and books available on the web in digital format to a large audience at virtually no cost. In my thesis, I focus on building a copy detection system (CDS) into which content publishers register their valuable digital content. The CDS then crawls the web, compares the web content to the registered content and notifies the content owners of illegal copies. The key challenges in building such a system are to balance For this, I have developed a core architecture that can be used to build a CDS for a variety of data types. As proof of concept, I have built two prototype CDS: (1) SCAM (Stanford Copy Analysis Mechanism), for finding textual copies on the web and (2) FRAUD (Finding Replicas of AUDio) for finding audio copies on the web.

SCAM was successfully used in May 1995 to find several instances of plagiarism in conference papers and journal articles. Click here for details.

Here is a little blurb that gives a 2-page overview of my thesis research. The following papers are detailed technical notes on various problems we attacked as part of my thesis.

