The SCAM Approach To Copy Detection in Digital Libraries
Narayanan Shivakumar and
Hector Garcia-Molina
{shiva, hector}@cs.stanford.edu
Department of Computer Science
Stanford University
Stanford, CA 94305, U.S.A
-
Scenario 1
-
Your local publishing company Books'R'Us decides to publish on
the Internet its latest book in an effort to cut down on printing paper
costs, book distribution hassles and in support of the concept of Digital
Libraries. Customers pay for the digital books using sophisticated electronic
payment mechanisms from DigiCash, First Virtual or InterPay. When the payment is
received, the book distribution server at Books'R'Us sends a digital version
of the book electronically to the paying customer. Books'R'Us expects to make
higher profits on the digital book for that fiscal year due to lower
production and distribution costs, and higher availability on the Internet.
At the end of the year however, very few books are sold since
digital copies of the Books'R'Us book had been widely
circulated on UseNet newsgroups, bulletin boards, and had been available for
free on alternate ftp sites and Web servers. Books'R'Us retract their
digital publishing commitment blaming the ease of retransmission of digital
items on the Internet, and return to traditional paper based publishing.
-
Scenario 2
- Sheng wants to buy a new Pentium portable, and hence wants
to read articles on the different brands available and their reviews before
choosing a brand to buy. She searches
information services like Dialog, Lycos, Gloss and Webcrawler, and follows
UseNet newsgroups to find articles on the different portables available and
finds nearly 1500 articles. When she starts reading the
articles, she finds that most articles are really duplicates of one another
and did not contribute any new information to her search. She realizes this
is because most databases maintain their own local copies of different
articles in perhaps different formats (Word, Postscript, HTML), or have
perhaps mirror sites that contain the same set of articles.
Sheng then trudges through the articles one-by-one wishing that somebody would
build a system that can remove exact or near-duplicates automatically so that
she only needs to read each distinct article.
Around article number 150, Sheng decides not to buy a certain brand since
from the articles she learns that that brand had had problems with its
color display since its release. But she has to continue looking at articles
on that model since they are already a part of the result set. She adds to
her wish list a dynamic search system in which she could discard any articles
that have more than a certain overlap with some article she had previously
discarded.
In this article, we will give a brief overview of some proposed mechanisms
that address the problems illustrated by these two scenarios.
In Copy Guarantees for
Digital Publishers , we consider some proposals that show how to provide
reasonable guarantees to digital publishers that their books are not being
retransmitted easily on the Internet. In Duplicate Detection
in Information Retrieval , we present some mechanisms that can be used
to remove near-duplicates (such as multiple formats) in documents.
We will then present the SCAM Registration Server approach that solves both
the illegal copy detection problem, and the duplicate document detection
problem.
Some publishing entities such as Institute Of Electrical and Electronics
Engineers (IEEE) have sought to sought to prevent
illegal copying of documents by placing the documents on stand-alone CD-ROM
systems, while others have chosen to use special purpose
hardware [PoKl79] , or active documents [Gr93] which
are documents encapsulated by programs. We believe that such prevention
techniques may be cumbersome, may get in the way of the honest user, and
may make it more difficult to share information. In fact in the software
industry, it has been noticed that measures to prevent software piracy
may actually reduce software sales and hence software manufacturers
currently prefer to try and detect piracy rather than prevent piracy [ BDGa95 ].
Drawing on our analog from the software industry, we advocate detecting
illegal copies rather than the copy prevention approach. In the copy
detection approach, there are two important orthogonal questions.
1. Is a document at a certain Web site or an article posted on a newsgroup
an illegal copy of some pre-existing document?
2. If the document is an illegal copy, who was the originator of the illegal
copy?
We will now look at some popular schemes to address each of the two
questions.
Registration Server
One popular answer to the first question (that we also pursue) is to build
a registration server: documents are registered into a repository, and query
documents are checked with the documents in the repository to detect
any possible copies [ PaHa89, BDGa95, ShGa95 ].
In Figure 1 (below), we show an example of a generic registration server which
consists of a repository (may be distributed) of documents. A stream of
documents may arrive to be registered into the database, or to be checked
against the curent repository of documents for possible overlaps.
There have been several approaches to building efficient registration servers
[BDGa95, Man94 ShGa95] to scale to storing several
hundreds
of thousands of documents. In Architecture of SCAM ,
we report on the Stanford Copy Analysis Mechanism, one of the most recent
registration servers from our research group currently available
on the Internet .
![](genCopyDet.gif)
Figure 1
Document Signatures
Once a document is known to be an illegal copy (through one of the registration
server schemes outlined above), it is sometimes useful to
know who was the originator of the illegal copy. For instance Books'R'Us
would like to know which one of its few paying customers is retransmitting
its books for commercial advantage (or for free). In signature based schemes,
a "signature" is added to each document, and this signature can be used
to trace the origins of each document. One popular approach is to incorporate
unique watermarks such as word spacings and checksums into documents
[ BoSh95, BLMG94a, BLMG94b, CMPS94 ] so that when an
illegal document is found, the book distribution server at Books'R'Us
can check which customer was sold the book with that particular signature.
We believe that by combining the notion of a registration server to
catch illegal copies, and using document signatures to find the originator
of the illegal copy, we can catch most illegal copies that
occur on the Internet (we will show how SCAM will catch illegal copies in
Possible Applications of SCAM ).
Of course, registration server based solutions cannot catch
cases when a user prints out multiple copies of a digital document
and redistributes them to his friends. We however believe that
hard copies of digital books will not have the same "value" as digital books
(no hyperlinks to alternate sites with more information on certain topics
for instance), and hence do not consider this a serious problem.
Another problem is that users could transmit books to a "small"
number of friends illegally without our system catching it since our
system can access only public sources such as Web sites, UseNet articles
public mailing lists and other public domains. However we believe that
our system will at least catch illegal digital retransmissions of documents
on publicly accessible sources, thereby catching the more serious
global transgressions.
In this age of information overload, an important value-added service that
should be performed by search engines and databases is to remove duplicates
of articles before presenting results of search to users.
An information dissemination system that automatically removes duplicate
netnews articles is the
SIFT server built in our research
group [YaGa95a] . Duplicates or near duplicates
of documents may exist due to multiple formats or because of replication of
articles by cross-posting for newsgroups, forwarding of articles etc.
In [YaGa95b] , we show how a CDB (Copy
Detection Blackbox) may be used to automatically remove multiple copies
of the same article, and how a user may dynamically discard certain
classes of articles that have sufficient overlap.
A promising approach to building a CDB is to use document clustering
techniques. In the
Scatter/Gather clustering approach [CKPT92] , users
can dynamically recluster documents based on topics they wish to
pursue and those they wish to discard. However, this approach appears
more natural as an online algorithm which people may use to browse through
a static set (for a given search session) of different topics and dynamically
recluster articles in that set based on what they wish to pursue, and what
they discard.
We view the CDB as something more user-specific rather than session-specific
in that users may want to discard duplicate copies of articles they
may have seen even in earlier search sessions. For instance, a user
may not want to get multiple copies of Call for Papers for Conferences
through different mailing lists over a time period. In
Architecture of SCAM , we present the underlying registration mechanism
of SCAM that could be used as a "plug-in" module for this abstract CDB.
Figure 2
We first show SCAM from the user's perspective as a black-box entity on
the Internet. We will then briefly describe the underlying implementation
of SCAM.
In Figure 2, we show conceptually how our current implementation of SCAM
is accessible through the Internet. Netnews articles from the Usenet groups,
and from certain large mailing lists are currently being registered into SCAM
on a daily basis into a publicly accessible repository. We have also
developed a form based web client, and a bulk loader so that users across the
Internet may send documents of different formats (such as ASCII, HTML,
Postscript) to be registered into their private
databases, or to be queried against their private databases or the
public repository.
Figure 3
In Figure 3, we show the underlying architecture of SCAM that provides the
functionality in Figure 2. SCAM currently runs on a Dec 3000 at
scam.stanford.edu. It has the traditional database components of a buffer
manager controlling a buffer (10 Megabytes), and a disk (1 Gigabyte). There
are several databases on the disk which may be part of the public repository
(such as Netnews, Mailing Lists) or may be personal user-owned databases
(such as Sheng's or Books'R'Us). Different servers (like the Web Server)
have been implemented to provide different interfaces for users accessing
SCAM, and different parsers are employed to handle multiple formats (HTML,
postscript, ASCII etc.).
There are several possible ways in which SCAM may be used. We now outline some
of the more interesting applications of SCAM.
- 1. Book companies, authors, commercial UseNet groups and professional
societies (like ACM, SIGMOD) that have valuable digital documents may create
their own personal databases with SCAM, and register their digital documents
(through our Web server, mail server or bulk loaders) into their databases. SCAM
will then check UseNet newsgroups, mailing lists and some Web sites on a daily
basis for full or partial overlap (similar sentences, paragraph chunks etc.)
and will notify the appropriate user of the overlap and the source.
- 2. Less serious users can probe the public databases and check if
some document they are interested in, overlaps with some article in
one of the public sources (UseNet newsgroups, web pages, mailing lists)
over the previous few days.
- 3. Class instructors may create their own databases and store into those
any articles they may find relevant for classes they teach, and also register
digitally submitted homeworks into the database. When the class is offered
again, he may use the database to check for any significant overlap to
previous homeworks and other registered articles. Similarly, Program
Committee Chairs of Conferences and Editors of Journals may use databases
specific to their field to check if any new submission overlaps significantly
with some previous paper in the field.
We believe Copy Detection Mechanisms such as SCAM will play an important
role in the economics of Digital Libraries by providing digital publishers
illegal copy detection guarantees thereby inducing paper-based publishers
to switch
to digital publishing. Automatic duplicate removal of documents will also
become increasingly important as the number of digital formats for documents
and sites storing these documents increase.
Since the number of digital documents is increasing at a fast rate every
day, an important area of research is how to make copy detection mechanisms
scale to such large number of articles without losing accuracy in overlap
detection. We are currently considering a distributed version of SCAM for
reasons of scalability. We are also experimenting with different approaches
to copy detection which have different levels of expected accuracy and
expense.
- [BDGa95]
- S. Brin, J. Davis, H. Garcia-Molina, Copy Detection Mechanisms for Digital Documents
Proceedings of the ACM SIGMOD Annual Conference, San Francisco, CA,
May 1995.
- [BoSh95]
- D. Boneh, J. Shaw, Collusion-secure fingerprinting for digital data.
Technical Report 468, Computer Science Department, Princeton University,
January 1995.
- [BLMG94a]
- J. Brassil, S. Low, N. Maxemumchuk, L.O. Gorman,
Document marking and identification using both line and word shifting.
Technical Report, AT & T Bell Labs, 1994, ftp://ftp.research.att.com/dist/brassil/docmark2.ps
- [BLMG94b]
- J. Brassil, S. Low, N. Maxemumchuk, L.O. Gorman,
Electronic marking and identification techniques to discourage document copying.
Technical Report, AT & T Bell Labs, 1994, ftp://ftp.research.att.com/dist/brassil/docmark2.ps
- [CKPT95]
- D.R. Cutting, D.R. Karger, J.O. Pedersen, J.W. Tukey,
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections.
Proceedings of 15th Annual SIGIR Conference, Denmark
- [CMPS94]
- A.Choudhury, N.Maxemchuk, S.Paul, H.Schulzrinne,
Copyright protection for electronic publishing over computer networks.
Technical report, AT & T Bell Labs, 1994.
- [Gr93]
- G.N. Griswold, A method for protecting copyrights on networks.
Joint Harvard-MIT Workshop on Technology Strategies for Protecting
Intellectual Property in the Networked Multimedia Environment.
- [Man94]
- U.Manber, Finding similar files in a large file system.
Proceedings of USENIX Conference, 1-10, San Francisco, CA, January 1994.
- [PaHa89]
- A.Parker, J.O Hamblen, Computer Algorithms for plagiarism detection.
IEEE Transactions on Education, 32(2):94-99, May 1989.
- [PoKl73]
- G.J. Popek, C.S. Kline, Encryption and secure computer networks.
ACM Computing Surveys, 11(4): 331-356, December 1979.
- [ShGa95]
- N.Shivakumar, H. Garcia-Molina, SCAM: A Copy Detection Mechanism for Digital
Documents.
Proceedings of the 2nd International Conference on Theory and Practice
of Digital Libraries, Austin, Texas, 1995.
- [YaGa95a]
- T.Yan, H. Garcia-Molina, SIFT - A Tool for wide-area information
dissemination.
Proceedings of USENIX, 1995
- [YaGa95b]
- T.Yan, H. Garcia-Molina, Duplicate detection in information dissemination.
Proceedings of Very Large Database (VLDB'95) Conference, Zurich, Switzerland, September 1995
-->
Copyright © Narayanan Shivakumar and Hector Garcia-Molina