The SCAM Approach To Copy Detection in Digital Libraries

Narayanan Shivakumar and Hector Garcia-Molina

{shiva, hector}@cs.stanford.edu

Department of Computer Science
Stanford University
Stanford, CA 94305, U.S.A

Scenario 1: Your local publishing company Books'R'Us decides to publish on the Internet its latest book in an effort to cut down on printing paper costs, book distribution hassles and in support of the concept of Digital Libraries. Customers pay for the digital books using sophisticated electronic payment mechanisms from DigiCash, First Virtual or InterPay. When the payment is received, the book distribution server at Books'R'Us sends a digital version of the book electronically to the paying customer. Books'R'Us expects to make higher profits on the digital book for that fiscal year due to lower production and distribution costs, and higher availability on the Internet.
At the end of the year however, very few books are sold since digital copies of the Books'R'Us book had been widely circulated on UseNet newsgroups, bulletin boards, and had been available for free on alternate ftp sites and Web servers. Books'R'Us retract their digital publishing commitment blaming the ease of retransmission of digital items on the Internet, and return to traditional paper based publishing.
Scenario 2: Sheng wants to buy a new Pentium portable, and hence wants to read articles on the different brands available and their reviews before choosing a brand to buy. She searches information services like Dialog, Lycos, Gloss and Webcrawler, and follows UseNet newsgroups to find articles on the different portables available and finds nearly 1500 articles. When she starts reading the articles, she finds that most articles are really duplicates of one another and did not contribute any new information to her search. She realizes this is because most databases maintain their own local copies of different articles in perhaps different formats (Word, Postscript, HTML), or have perhaps mirror sites that contain the same set of articles. Sheng then trudges through the articles one-by-one wishing that somebody would build a system that can remove exact or near-duplicates automatically so that she only needs to read each distinct article.
Around article number 150, Sheng decides not to buy a certain brand since from the articles she learns that that brand had had problems with its color display since its release. But she has to continue looking at articles on that model since they are already a part of the result set. She adds to her wish list a dynamic search system in which she could discard any articles that have more than a certain overlap with some article she had previously discarded.

In this article, we will give a brief overview of some proposed mechanisms that address the problems illustrated by these two scenarios. In Copy Guarantees for Digital Publishers , we consider some proposals that show how to provide reasonable guarantees to digital publishers that their books are not being retransmitted easily on the Internet. In Duplicate Detection in Information Retrieval , we present some mechanisms that can be used to remove near-duplicates (such as multiple formats) in documents. We will then present the SCAM Registration Server approach that solves both the illegal copy detection problem, and the duplicate document detection problem.

Copy Guarantees for Digital Publishers

Some publishing entities such as Institute Of Electrical and Electronics Engineers (IEEE) have sought to sought to prevent illegal copying of documents by placing the documents on stand-alone CD-ROM systems, while others have chosen to use special purpose hardware [PoKl79] , or active documents [Gr93] which are documents encapsulated by programs. We believe that such prevention techniques may be cumbersome, may get in the way of the honest user, and may make it more difficult to share information. In fact in the software industry, it has been noticed that measures to prevent software piracy may actually reduce software sales and hence software manufacturers currently prefer to try and detect piracy rather than prevent piracy [ BDGa95 ].

Drawing on our analog from the software industry, we advocate detecting illegal copies rather than the copy prevention approach. In the copy detection approach, there are two important orthogonal questions.

We will now look at some popular schemes to address each of the two questions.

Registration Server

One popular answer to the first question (that we also pursue) is to build a registration server: documents are registered into a repository, and query documents are checked with the documents in the repository to detect any possible copies [ PaHa89, BDGa95, ShGa95 ]. In Figure 1 (below), we show an example of a generic registration server which consists of a repository (may be distributed) of documents. A stream of documents may arrive to be registered into the database, or to be checked against the curent repository of documents for possible overlaps.
There have been several approaches to building efficient registration servers [BDGa95, Man94 ShGa95] to scale to storing several hundreds of thousands of documents. In Architecture of SCAM , we report on the Stanford Copy Analysis Mechanism, one of the most recent registration servers from our research group currently available on the Internet .

Figure 1

Document Signatures

Once a document is known to be an illegal copy (through one of the registration server schemes outlined above), it is sometimes useful to know who was the originator of the illegal copy. For instance Books'R'Us would like to know which one of its few paying customers is retransmitting its books for commercial advantage (or for free). In signature based schemes, a "signature" is added to each document, and this signature can be used to trace the origins of each document. One popular approach is to incorporate unique watermarks such as word spacings and checksums into documents [ BoSh95, BLMG94a, BLMG94b, CMPS94 ] so that when an illegal document is found, the book distribution server at Books'R'Us can check which customer was sold the book with that particular signature.

We believe that by combining the notion of a registration server to catch illegal copies, and using document signatures to find the originator of the illegal copy, we can catch most illegal copies that occur on the Internet (we will show how SCAM will catch illegal copies in Possible Applications of SCAM ). Of course, registration server based solutions cannot catch cases when a user prints out multiple copies of a digital document and redistributes them to his friends. We however believe that hard copies of digital books will not have the same "value" as digital books (no hyperlinks to alternate sites with more information on certain topics for instance), and hence do not consider this a serious problem. Another problem is that users could transmit books to a "small" number of friends illegally without our system catching it since our system can access only public sources such as Web sites, UseNet articles public mailing lists and other public domains. However we believe that our system will at least catch illegal digital retransmissions of documents on publicly accessible sources, thereby catching the more serious global transgressions.

Duplicate Detection in Information Retrieval

In this age of information overload, an important value-added service that should be performed by search engines and databases is to remove duplicates of articles before presenting results of search to users. An information dissemination system that automatically removes duplicate netnews articles is the SIFT server built in our research group [YaGa95a] . Duplicates or near duplicates of documents may exist due to multiple formats or because of replication of articles by cross-posting for newsgroups, forwarding of articles etc. In [YaGa95b] , we show how a CDB (Copy Detection Blackbox) may be used to automatically remove multiple copies of the same article, and how a user may dynamically discard certain classes of articles that have sufficient overlap.

A promising approach to building a CDB is to use document clustering techniques. In the Scatter/Gather clustering approach [CKPT92] , users can dynamically recluster documents based on topics they wish to pursue and those they wish to discard. However, this approach appears more natural as an online algorithm which people may use to browse through a static set (for a given search session) of different topics and dynamically recluster articles in that set based on what they wish to pursue, and what they discard.

We view the CDB as something more user-specific rather than session-specific in that users may want to discard duplicate copies of articles they may have seen even in earlier search sessions. For instance, a user may not want to get multiple copies of Call for Papers for Conferences through different mailing lists over a time period. In Architecture of SCAM , we present the underlying registration mechanism of SCAM that could be used as a "plug-in" module for this abstract CDB.

Architecture of SCAM

Figure 2

We first show SCAM from the user's perspective as a black-box entity on the Internet. We will then briefly describe the underlying implementation of SCAM.

In Figure 2, we show conceptually how our current implementation of SCAM is accessible through the Internet. Netnews articles from the Usenet groups, and from certain large mailing lists are currently being registered into SCAM on a daily basis into a publicly accessible repository. We have also developed a form based web client, and a bulk loader so that users across the Internet may send documents of different formats (such as ASCII, HTML, Postscript) to be registered into their private databases, or to be queried against their private databases or the public repository.

Figure 3

In Figure 3, we show the underlying architecture of SCAM that provides the functionality in Figure 2. SCAM currently runs on a Dec 3000 at scam.stanford.edu. It has the traditional database components of a buffer manager controlling a buffer (10 Megabytes), and a disk (1 Gigabyte). There are several databases on the disk which may be part of the public repository (such as Netnews, Mailing Lists) or may be personal user-owned databases (such as Sheng's or Books'R'Us). Different servers (like the Web Server) have been implemented to provide different interfaces for users accessing SCAM, and different parsers are employed to handle multiple formats (HTML, postscript, ASCII etc.).

Possible Applications of SCAM

There are several possible ways in which SCAM may be used. We now outline some of the more interesting applications of SCAM.

1. Book companies, authors, commercial UseNet groups and professional societies (like ACM, SIGMOD) that have valuable digital documents may create their own personal databases with SCAM, and register their digital documents (through our Web server, mail server or bulk loaders) into their databases. SCAM will then check UseNet newsgroups, mailing lists and some Web sites on a daily basis for full or partial overlap (similar sentences, paragraph chunks etc.) and will notify the appropriate user of the overlap and the source.

2. Less serious users can probe the public databases and check if some document they are interested in, overlaps with some article in one of the public sources (UseNet newsgroups, web pages, mailing lists) over the previous few days.

3. Class instructors may create their own databases and store into those any articles they may find relevant for classes they teach, and also register digitally submitted homeworks into the database. When the class is offered again, he may use the database to check for any significant overlap to previous homeworks and other registered articles. Similarly, Program Committee Chairs of Conferences and Editors of Journals may use databases specific to their field to check if any new submission overlaps significantly with some previous paper in the field.

We believe Copy Detection Mechanisms such as SCAM will play an important role in the economics of Digital Libraries by providing digital publishers illegal copy detection guarantees thereby inducing paper-based publishers to switch to digital publishing. Automatic duplicate removal of documents will also become increasingly important as the number of digital formats for documents and sites storing these documents increase.

Since the number of digital documents is increasing at a fast rate every day, an important area of research is how to make copy detection mechanisms scale to such large number of articles without losing accuracy in overlap detection. We are currently considering a distributed version of SCAM for reasons of scalability. We are also experimenting with different approaches to copy detection which have different levels of expected accuracy and expense.

References

[BDGa95]: S. Brin, J. Davis, H. Garcia-Molina, Copy Detection Mechanisms for Digital Documents Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA, May 1995.
[BoSh95]: D. Boneh, J. Shaw, Collusion-secure fingerprinting for digital data. Technical Report 468, Computer Science Department, Princeton University, January 1995.
[BLMG94a]: J. Brassil, S. Low, N. Maxemumchuk, L.O. Gorman, Document marking and identification using both line and word shifting. Technical Report, AT & T Bell Labs, 1994, ftp://ftp.research.att.com/dist/brassil/docmark2.ps
[BLMG94b]: J. Brassil, S. Low, N. Maxemumchuk, L.O. Gorman, Electronic marking and identification techniques to discourage document copying. Technical Report, AT & T Bell Labs, 1994, ftp://ftp.research.att.com/dist/brassil/docmark2.ps
[CKPT95]: D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of 15th Annual SIGIR Conference, Denmark
[CMPS94]: A.Choudhury, N.Maxemchuk, S.Paul, H.Schulzrinne, Copyright protection for electronic publishing over computer networks. Technical report, AT & T Bell Labs, 1994.
[Gr93]: G.N. Griswold, A method for protecting copyrights on networks. Joint Harvard-MIT Workshop on Technology Strategies for Protecting Intellectual Property in the Networked Multimedia Environment.
[Man94]: U.Manber, Finding similar files in a large file system. Proceedings of USENIX Conference, 1-10, San Francisco, CA, January 1994.
[PaHa89]: A.Parker, J.O Hamblen, Computer Algorithms for plagiarism detection. IEEE Transactions on Education, 32(2):94-99, May 1989.
[PoKl73]: G.J. Popek, C.S. Kline, Encryption and secure computer networks. ACM Computing Surveys, 11(4): 331-356, December 1979.
[ShGa95]: N.Shivakumar, H. Garcia-Molina, SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, Austin, Texas, 1995.
[YaGa95a]: T.Yan, H. Garcia-Molina, SIFT - A Tool for wide-area information dissemination. Proceedings of USENIX, 1995
[YaGa95b]: T.Yan, H. Garcia-Molina, Duplicate detection in information dissemination. Proceedings of Very Large Database (VLDB'95) Conference, Zurich, Switzerland, September 1995

-->