[
Abstract
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.
In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.
1 Introduction
Digital libraries are a concrete possibility today because of many technological advances in areas such as storage and processor technology, networks, database systems, scanning systems, and user interfaces. In many aspects, building a digital library today is just a matter of ``doing it.'' However, there is a real danger that such a digital library will either have relatively few documents of interest, or will be a patchwork of isolated systems that provide very restricted access.
The reason for this danger is that the electronic medium makes it much easier to illegally copy and distribute information. If an information provider gives a document to a customer, the customer can easily distribute it on a large mailing list or can post it on a bulletin board. The danger of illegal copies is not new, of course; however, it is much more time consuming to reproduce and distribute paper, CDs or videotape copies than on-line documents.
Current technology does not strike a good balance between protecting the owners of intellectual property and giving access to those who need the information. At one extreme are the open sources on the Internet, where everything is free, but valuable information is frequently unavailable because of the dangers of unauthorized
*This research was sponsored by the Advanced Research Projects Agency (ARPA) of the Department of Defense under Grant No. MDA972-92-J-1029 with the Corporation for National Research Initiatives (CNRI).Clearly, one would like to have an infrastructure that gives users access to a wide variety of digital libraries and information sources, but that at the same time gives information providers good economic incentives for offering their information. In many ways, we believe this is
the central issue for future digital information and library systems.In this paper we present one component of the information infrastructure that addresses this issue. The key idea is quite simple: provide a
copy detection service where original documents can be registered, and copies can be detected. The service will detect not just exact copies, but also documents that overlap is significant ways. The service can be used (see Section 2) in a variety of ways by information providers and communications agents to detect violations of intellectual property laws. Although the copy detection idea is simple, there are several challenging issues we address here involving performance, storage capacity, and accuracy that need to be resolved. Furthermore, copy detection is relevant to the ``database community'' since its central component is a large database of registered documents.We stress that copy detection is not the complete solution by any means; it is simply a helpful tool. There are a number of other important ``tools'' that will also assist in safeguarding intellectual property. For example, good encryption and authorization mechanisms are needed in some cases. It is also important to have mechanisms for charging for access to information. The articles in [5, 7, 9] discuss a variety of other topics related to intellectual property. These other tools and topics will not be covered in this paper.
2 Safeguarding Intellectual Property
How can we ensure that a document is only seen and used by a person who is authorized (e.g., has paid) to see it? One possibility lies in preventing violations from occurring. Some schemes along these lines have been suggested, such as secure printers with cryptographically secure communications paths [12] and active documents [6] where users may interact with documents only through a special program.
The problems with all such techniques is that they are cumbersome (requiring special software or hardware), restrictive (limiting access means to the document), and not completely secure (from someone with an OCR program for example). The alternative is to use
detection techniques. That is, we assume most users are honest, allow them access to the documents, and focus on detecting those that violate the rules. Many software vendors have found this approach to be superior (protection mechanisms get in the way of honest users, and sales may actually decrease).
One possible direction is ``watermark'' schemes where a publisher incorporates a unique subtle signature (identifying the user) in a document when it is given to the user so that when an unauthorized copy is found, the source will be known [13, 3, 4, 2]. One problem is that these schemes can easily be defeated by users who destroy the watermarks. For example, slightly varied pixels in an image would be lost if it is passed through a lossy JPEG encoder.
A second approach, and one that we advocate in this paper (for text documents), is that of a
copy detection server [1, 11]. The basic idea is as follows: When an author creates a new work, he or she registers it at the server. The server could also be the repository for a copyright recordation and registration system, as suggested in [8]. As documents are registered, they are broken into small units, for now say sentences. Each sentence is hashed and a pointer to it is stored in a large hash table.Documents can be compared to existing documents in the repository, to check for plagiarism or other types of significant overlap. When a document is to be checked, it is also broken into sentences. For each sentence, we probe the hash table to see if that particular sentence has been seen before. If the document and a previously registered document share more than some
threshold number of sentences, then a violation is flagged. The threshold can be set depending on the desired checks, smaller if we are looking for copied paragraphs, larger if we only want to check if documents share large portions. A human would then have to examine both documents to see if it was truly a violation.1As just one example, Knight-Ridder Tribune recently (June 23, 1994) ceased publishing on ClariNet the Dave Barry and the Mike Royko columns because subscribers re-distributed the articles on large mailing lists.
This basic scheme has much in common with
sif, a tool for finding similar files in a file system, created by Udi Manber [10]. However, there are a number of differences in finding similar files versus finding similar sections of text which COPS addresses. First, since we are dealing with text, we operate on a syntactic level and hash syntactic units as opposed to fixed length strings. We also consider the security of copy detection (Section 3.3 ) and attempt to maximize its flexibility by dealing with violations of varying granularities (Section 4 ). One of the most important differences is that it is much more difficult to test a system like COPS since there are no databases of actual copyright violations (Section 5 ).The copy detection server can be used in a variety of ways. For example, a publisher is legally liable for publishing materials the author does not have copyright on; thus, it may wish to check if a soon-to-be-published document is actually an original document. Similarly, bulletin-board software may automatically check new postings in this fashion. An electronic mail gateway may also check the messages that go through (checking for ``transportation of stolen goods''). Program committee members may check if a submission overlaps too much with an author's previous paper. Lawyers may want to check subpoenaed documents to prove illegal behavior. (Copy detection can also be used for computer programs [11], but we only focus on text in this paper.) There are also applications that do not involve detection of undesirable behavior. For example, a user that is retrieving documents from an information retrieval system or who is reading electronic mail, may want to flag duplicate items (with a given overlap threshold). Here the ``registered'' documents are those that have been seen already; the ``copies'' represent messages that are retransmitted or forwarded many times, different editions or versions of the same work, and so on. Of course, potential duplicates should not be deleted automatically; it is up to the user to decide if he wants to view possible duplicates.
In summary, we think that detecting copies of text documents is a fundamental problem for distributed information or database systems. And there are many issues that need to be addressed. For instance, should the decomposition units be paragraphs or something else instead of sentences? Should we take into account order of the units (paragraphs or sentences), e.g., by hashing sequences of units? Is it feasible to only hash a fraction of the sentences of registered documents? This would make the hash table smaller, hopefully still making it very likely that we will catch major violations. If the hash table is relatively small, it can be cloned. Our mail gateway above could then perform its checks locally, instead of having to contact a remote copy detection server for each message. There are also implementation issues that need to be addressed. For example, how are sentences extracted from say Latex or Word documents? Can one extract them from Postscript documents, or from bit maps via OCR?
These and other questions will be addressed in the rest of this paper. We start in Sections 3 and 4 by defining the basic terms, evaluation metrics, and options for copy detection. Then in Section 5 we describe our working prototype, COPS, and report on some initial experiments. A sampling technique that can reduce the storage space of registered documents or can speed up checking time is presented and analyzed in Section 6 .
3 General Concepts
In this section we define some of the basic concepts for copy detection and for evaluating mechanisms that implement it. (As far as we know, text copy detection has not been formally studied, so we start from basics.) The starting point is the concept of a document, a body of text from which some structural information (such as word and sentence boundaries) can be extracted. In an initial phase, formatting information and non-textual components are removed from documents (see Section 5 ). The resulting canonical form document consists of a string of ascii characters with whitespace separating words, punctuation separating sentences and possibly a standard method of marking the beginning of paragraphs.
A
violation occurs when a document infringes upon another document in some way (e.g., by duplicating portions of text). There are a number of violation types which can occur including plagiarism of a few sentences,Most of the violation tests we are interested in are not well defined and require a decision by a human being. For example, plagiarism is particularly difficult to test for. For instance, the sentence ``The proof is as follows'' may occur in many scientific papers and would not be considered plagiarism if it occurred in two documents, while this sentence most certainly would. If we consider a test
Subset that detects if a document is essentially a subset of another one, we again need to consider if the smaller document makes any significant contributions. This again, requires human evaluation.The goal of a copy detection system is to implement well defined algorithmic tests, termed
operating tests (with the same notation as violation tests), that approximate the desired violation tests. For instance, consider the operating test t1(d, r) that holds if 90% of the sentences in d are contained in r. This test may be considered an approximation to the Subset test described above. If the system flags t1 violations, then a human can check if they are indeed Subset violations. 3.1 Ordinary Operational Tests
In the rest of this paper we will focus on a specific class of operational tests, ordinary operational tests (OOTs), that can be implemented efficiently. We believe they can accurately approximate many violation tests of interest, such as Subset, Overlap, and Plagiarism.
Before we describe OOTs we need to define some primitives for specifying the level of detail at which we look at the documents. As mentioned in Section 3 , documents contain some structural information. In particular, documents can be divided into well defined parts, consistent with the underlying structure such sections, paragraphs, sentences, words, or characters. We call each of these types of divisions a unit type and particular instances of these unit types are called
units.We define a
chunk as a sequence of consecutive units in a document of a given unit type. A document may be divided into chunks in a number of ways since chunks can overlap, may be of different sizes, and need not completely cover the document. For example, let us assume we have a document ABCDEFG where the letters represent sentences or some other units. Then it can be organized into chunks as follows : A,B,C,D,E,F,G; or AB,CD,EF,G; or AB,BC,CD,DE,EF,FG; or ABC,CD,EFG; or A,D,G. A method of selecting chunks from a document divided into units is a chunking strategy. It is important to note that unlike units, chunks have no structural significance to the document and so chunking strategies cannot use structural information about the document.
An OOT,
o, uses hashing to detect matching chunks and is implemented by the set of procedures in Figure 1 . The code is intended to convey key concepts, not an efficient or complete implementation. (Section 5 describes our actual prototype system.) First there is the preprocessing operation, PREPROCESS(R, H), that takes as input a set of registered documents R and creates the hash table, H. Second, there are procedures for on-the-fly adding documents to H (registering new documents) and for removing them from H (un-registering documents). Third is the function EVALUATE(d, H) that computes o(d, R).To insert documents in the hash table, procedure
INSERT uses a function INS-CHUNKS(r) to break up a document r into its chunks. The function returns a set of tuples. Each <t,l> tuple represents one chunk in r, where t is the text in the chunk, and l is the location of the chunk, measured in some unit. An entry is stored in the hash table for every <t,l> chunk in the document.Procedure
EVALUATE(d, H) tests a given document d for violations. The procedure uses EVAL-CHUNKS function to break up d. The reason why we use a different chunking function at evaluation time will become apparent in Section 6 . For now, we can assume that both INS-CHUNKS and EVAL-CHUNKS are identical and we use CHUNKS to refer to them.After chunking, procedure
EVALUATE then looks up the chunks in the hash table H, producing a set of tuples MATCH. Each <s,ld,r,lr> in MATCH represents a match: a chunk of size s at location ld in document d matches (has same hash key) as a chunk at location lr in registered document r. The MATCH set is then given to function DECIDE(MATCH, SIZE) (where SIZE is the number of chunks in d) that returns the set of matching registered documents. If the set is non-empty, then there was a violation, i.e., o(d, R) holds.
PREPROCESS(R,H)
CREATETABLE(H)
for each r in R INSERT(r,H)
INSERT(r,H)
C = INS-CHUNKS(r) /* OOT dependent */
for each <t, l> in C
h = HASH(t)
INSERTCHUNK(<h,r,l>, H)
DELETE(r,H)
C = INS-CHUNKS(r)
for each <t, l> in C
h = HASH(c)
DELETECHUNK(<h,r,l>, H)
EVALUATE(d,H)
C = EVAL-CHUNKS(d)
SIZE = |C|
MATCHES = {} /* empty set */
for each <t, ld> in C
h = HASH(t)
/* get all <r, lr> with matching h */
SS = LOOKUP(h, H)
for each <r, lr> in SS
MATCHES += <|t|, ld, r, lr>
return DECIDE(MATCHES,SIZE) /* OOT dependent */
Note that an instance of an OOT is specified simply by its
In the code of Figure 1 we only store the ids of registered documents in
Assume a random registered document
Intuitively, freq measures how frequently a test is true. If an operating test approximates a violation test well, then their freq's should be close but the converse is not true since they can accept on disjoint sets. If the freq of the operating test is small compared to the violation test it is approximating, then it is being too conservative. If it is too large then the operating test is too liberal.
Suppose we have an operating test
t2 and a violation test t1. Then we define the following measures for accuracy. (Note that these can also be applied between two operating tests and in general between any two tests). Definition 3.2 The Alpha metric corresponds to a measure of false negatives, i.e., Alpha(t1, t2) = P(:t2(X, Y) | t1(X, Y)). Note Alpha is not symmetric. A high Alpha(t1, t2) value indicates that operating test t2 is missing too many violations of t1.Definition 3.3 The Beta metric is analogous to Alpha except that it measures false positives, i.e., Beta(t1, t2) = P(t2(X, Y) | :t1(X, Y)). Beta is not symmetric either. A high Beta(t1, t2) value indicates that t2 is finding too many violations not in t1.
Definition 3.4 The Error metric is the combination of Alpha and Beta in that it takes into account both false positives and false negatives and is defined as Error(t1, t2) = P(t1(X, Y) 6= t2(X, Y)). It is symmetric. A high Error value indicates that the two tests are dissimilar.
3.3 Security
So far we have assumed that the author of a test document does not know how our copy detection system works and does not intend to sabotage it. However, another important measure for an OOT is how hard it is for a malicious user to break it. We measure this notion of security in terms of how many changes need to be made to a registered document so that it will not be identified by the OOT as a copy. Definition 3.5 The security of an OOT o (also applicable to any operating test) on a given document r, SEC (o, r), is the minimum number of characters that must be inserted, deleted, or modified in r to produce a new document r' such that o(r', r) is false. The higher a SEC (o, r) value is, the more secure o is.
We can use this notion to evaluate and compare OOTs. For example, consider an OOT o1 that considers the entire document as a single chunk. Then SEC (o1, r) = 1 for all r, because changing a single character makes the document not detectable as a copy. 2
As another example consider OOT o2 that uses sentences as chunks and a matchratio decision function. Then SEC (o2, r) = (1 - )SIZE where SIZE is the number of sentences in r. For instance, if = 0.6 and our document has 100 sentences, we need to change at least 40 of them. As a third example, consider an OOT o3 that uses pairs of overlapping sentences as chunks. For instance, if the document has sentences A, B, C, ... , o3 considers chunks AB, BC, CD, ... . Here we need to modify half as many sentences as before (roughly), since each modification can affect two chunks. Thus, SEC (o3, r) is approximately equal to SEC (o2, r)=2, i.e., o3 is approximately half as secure as o2.
Note that our security definition is weak because it assumes the adversary knows all about our OOT. However, by keeping certain information about our OOT secret we can enhance security. We can model this by having a large class of OOTs,
O, that vary only by some parameters, and then secretly choosing one OOT from O. We assume that the adversary does not know which OOT we have chosen and thus needs to subvert all of them. For this model we define SEC (O, r) as the number of characters that must be inserted, deleted, or modified to make o(r', r) false for all o in O. For examples of using classes of OOT's see chunking strategy D of Section 4.2 and Section 6 (consider the seed for the random number generator as a parameter).2This assumes a decision function which doesn't flag a violation if there are no matches (a reasonable condition). For instance, if o1(d, r) is always true, no matter if there are matches or not, then our statement does not hold.
Finally, notice that the security measures we have presented here do not address ``authorization'' issues. For example, when a user registers a document, how does the system ensure the user is who he claims to be and that he actually ``owns'' the document? When a user checks for violations, can we show him the matching documents, or do we just inform him that there were violations? Should the owner of a document be notified that someone was checking a document that violates his? Should the owner be given the identity of the person submitting the test document? These are important administrative questions that we do not attempt to address in this paper.
4 Taxonomy of OOTs
The units selected, the chunking strategy, and the decision function can affect the accuracy and the security of an OOT. In this section we consider some of the options and the tradeoffs involved.
4.1 Units
To determine how documents are to be divided into chunks we must first choose the units. One key factor to consider is the number of characters in a unit. Larger units (all else being equal) will tend to generate fewer matches and hence will have a smaller freq and be more selective. This, of course, can be compensated by changing the chunk selection strategy or decision function.
Another important factor in the choice of a unit type is the ease of detecting the unit separators. For example, words that are separated by spaces and punctuation are easier to detect than paragraphs which can be distinguished in many ways.
Perhaps the most important factor in unit selection is the violation test of interest. For instance, if it is more meaningful that sequences of sentences were copied rather than sequences of words (e.g., sentence fragments), then sentences and not words should be used as units.
4.2 Chunks
There are a number of strategies for selecting chunks. To contrast them we can consider the number of units involved, the number of hash table entries that are required for a document, and an upper bound for the security SEC(o, r). 3 See Table 1 for a summary of the four strategies we consider. (There are also many variations not covered here.) In the table, |r| refers to the number of units in the document r being chunked, and k is a parameter of the strategies. The ``space'' column gives the number of hash table entries need for r, while ``# units'' gives the chunk size.
(A) One chunk equals one unit. Here every unit (e.g. every sentence) is a chunk. This yields the smallest chunks. As with units, small chunks tend to make the freq of an OOT smaller. The major weakness is the high storage cost; |r| hash table entries are required for a document. However, it is the most secure scheme; SEC(o, r) is bounded by |r|. That is, depending on the decision function, it may be necessary to alter up to |r| characters (one per chunk) to subvert the OOT.
(B) One chunk equals k nonoverlapping units. In this strategy, we break the document up into sequences of k consecutive units and use these sequences as our chunks. It uses (1=k)th the space of Strategy A but is very insecure since altering a document by adding a single unit at the start will cause it to have no matches with the original. We call this effect ``phase dependence''. This effect also leads to high Alpha errors.
3For our discussion we assume that documents do not have significant numbers of repeating units.
(D) Use nonoverlapping units, determining break points by hashing units. We start by hashing the first unit in the document. If the hash value is equal to some constant x modulo k, then the first chunk is simply the first unit. If not, we consider the second unit. If its hash value equals x modulo k, the the first chunk is the first two units. If not we consider the third unit, and so on until we find some unit that hashs to x modulo k, and this identifies the end of the first chunk. We then repeat the procedure to identify the following nonoverlapping chunks.
It can be shown that the expected number of units in each chunk will be
k. Thus, Strategy D is similar to B in its hash table requirements. However, unlike B, it is not affected by phase dependence since similar text will have the same break points. Strategy D, like C, has higher Alpha (and lower Beta) errors as compared to A. Furthermore, all else being the same, freq should be only slightly less than that of C because significant portions of duplicated text will be caught just as in C.The key advantage of Strategy
D is that it is very secure. (It is really a family of strategies with a secret parameter; see Section 3.3 .) Without knowing the hash function, one must change every unit of a test document to be sure it will get through the system without warnings.
4.3 Decision Functions
There are many options for choosing decision functions. The matchratio function (Section 3.1 ) can be useful for approximating Subset and Overlap violation tests. Another simple decision function is matches (with parameter k) that simply tests if the number of matches between the test and the registered document is above a certain value k. This would be useful for detecting violations such as Plagiarism. One might also consider using orderedmatches which tests whether there are more than a certain number of matches occurring the same order in both documents. This would be useful if unordered matches are likely to be coincidental. 5 Prototype and Preliminary Results
We have built a working OOT prototype to test our ideas and to understand how to select good CHUNKS and DECIDE functions. The prototype is called COPS (COpy Protection System) and Figure 2 shows its major modules. Documents can be submitted via email in TEX (including LTEX), DVI, troff and ASCII formats. New documents can be either registered in the system or tested against the existing set of registered documents. If a new document is tested, a summary is returned listing the registered documents that it violates.
Figure 2: Modules in COPS implementation.
COPS allows modules to be easily replaced, permitting experimentation with different strategies (e.g., different
INS-CHUNKS, EVAL-CHUNKS and DECIDE functions). We will begin our explanation with the simplest case (sentence chunking for both insertion and evaluation, and a matchratio decision function) and later discuss possible improvements. A document that has been submitted to the system is given a unique document ID. This ID is used to index a table of document information such as title and author. To register the document, first it must be converted into the canonical form, i.e., plain ASCII text. The process by which this occurs is dependent upon the document format. A TEX document can be piped through the Unix utility detex, while a document with troff formatting commands can be converted with nroff. Similarly DVI and other document formats have filters to handle their conversion to plain ASCII text. After producing plain ASCII we are ready to determine and
When we wish to check a new document against the existing set of registered documents, we use a very similar
procedure. We generate the plain ASCII, determine sentences, and generate a list of hash keys, and look them
up in the hash table (see Section 3.1 ). If more than SIZE sentences match with any given registered document
we report a possible violation.
The conversion process is not perfect. If the document input format is DVI, then it is sometimes impossible to
distinguish ``equations'' from ``plain text''. Consider the sentence, ``Let X+Y equal the answer.'' This sentence
will be translated to ASCII exactly as it is shown. However, if we begin with TEX, then the equation will be
discarded, leaving the sentence ``Let equal the answer.'' Since the conversion to plain ASCII produced different
sentences, our system would be unable to recognize that a sentence match occurred. Later in this section we will
discuss some system enhancements that allows us to detect matching sentences, despite imperfect translations.
Another complication with DVI is that it gives directions for placing text on a page but it does not specify
what text is part of the main body, and what is part of subsidiary structures like footnotes, page headers and
bibliographies. Our DVI converter does not attempt to rearrange text; it simply considers the text in the order it
appears on the page. However, one case it does handle is that of two column format. Instead of reading characters
left to right, top to bottom (which would corrupt most sentences in a two column format), the converter detects
the inter-column gap and reads down the left column and then the right one.
An input format COPS can not handle in general is Postscript. Since Postscript is actually a programming
language, it is very difficult to convert its layout commands to plain ASCII text. Some Postscript generators such
as
In summary, the approach we have taken with the COPS converters is to do a reasonable job converting to
ASCII, but not necessarily perfect. Most matching sentences that are not translated identically, will still be
found by the system, since enhancements discussed later attempt to negate the effects of common translation
misinterpretations. Even if some matching sentences are missed, there should be enough other matches in
overlapping documents so that COPS can still flag the violations. Later, we present experimental results that
confirm this.
The units used by COPS' OOT are words and sentences (see Section 3.1 ). COPS first converts each word
in the text to a hash key. The result is a sequence of hash keys with interspersed end-of-sentence markers. The
chunking of this sequence is done by calling a procedure COMBINE(N-UNITS, STEP, UNIT-TYPE), where N-UNITS
is the number of units to be combined into the next chunk, STEP is the number of units to advance for the next
chunk, and UNIT-TYPE indicates what should be considered a unit. For example, repreatedly calling COMBINE(1,
1, WORD) creates a chunk for each word in the input sequence. Calling COMBINE(1, 1, SENTENCE) creates a
chunk for each sentence. Using COMBINE(3, 3, WORD) takes every three words as a chunk, while COMBINE(3,
1, WORD) produces overlapping three word chunks. COMBINE(2, 1, SENTENCE) would produce overlapping two
sentence chunks. Thus, we can see that this scheme gives us great flexibility for experimenting with different
CHUNKS functions. However, it should be noted that once a CHUNKS function is chosen, it must be used consistently
for all documents. That is, the flexibility just described is useful only in an experimental setting.
The documents average approximately 7300 words and 450 sentences in length. Approximately half of these
documents are grouped into nine topical sets (labeled A, B, ..., I in the tables). The two or three documents
within each group are closely related, usually multiple revisions of a conference or journal paper describing the
same work. The documents in separate topical groups are unrelated except for the author's affiliation with our
research group at Stanford. The remaining half of the documents not in any topical group are drawn from outside
Stanford and not related to any document in our collection.
All of these documents were registered in COPS, and then each was queried against the complete set. Our
goal is to see if COPS can determine the closely related documents. Using the terminology of Section 3 , we
are considering a violation test Related(
Table 2 shows results from our exploration. Instead of reporting the number of violations that a particular
matchratio would yield, we show the percentage of matching sentences in each case. This gives us more
The first result column in Table 2 gives the precent matches of each document against itself. That is, for
each document d in a group, we compute 100xCOUNT(d, MATCH)/SIZE (see Section 3.1 ), average the values
and report it in the row for that group. The fact that all values in the first column are 100% simply confirms
that COPS is working properly.
The numbers in the second column are computed as follows. For each document
Ideally, one wants affinity values that are as high as possible, and noise values that are as low as possible. This
makes it possible for a threshold value that is between the affinity and noise levels to distinguish between related
and unrelated documents. Table 2 reports that related documents have on average 53% matching sentences,
while unrelated ones have 0.6%. The reason why affinity is relatively low is that the notion of ``Related''
documents we have used here is very broad. For example, often the journal version and the conference version
of the same work are quite different.
The noise level of 0.6%, equivalent to 2 or 3 sentences, is larger than what we expected. The discrepancy
is caused by several things. A few sentences, such as, ``This work partially supported by the NSF'' are quite
common in journal articles, so that even unrelated documents might both contain it. Other sentences may also
be exact replicas by coincidence. Hash collisions may be another factor, especially when there are large numbers
of registered documents, but are not an issue in our experiments. Also note the relatively large variance reported
in the table. In particular, some unrelated documents had on the order of 20 matching sentences.
The process by which a document is translated to ASCII also has some effect on the noise level. For example,
the translation we use to convert TEX documents produces somewhat less noise than does our translation from
DVI. This is caused by differences in the inclusion of references. Many unrelated documents cite the same
references, possibly generating matching sentences. Our TEX filter does not include references in its output
(they are in separate ``bib'' files), so noise is reduced. The differences in noise generated by ASCII translation
become less significant when the enhancements discussed later are added to our system.
The larger the noise level, the harder it is to detect plagiarism of small passages (e.g., a paragraph or two).
If we set the threshold
The last row of Table 3 shows the effect of using all enhancements at once. One can see that the combined enhancements are quite effective at reducing the noise while keeping the affinity at roughly the same levels. We note that the parameter values we used for the enhancements (e.g., the number of occurrences that makes a chunk ``common'') worked well for our collection, but probably have to be adjusted for larger collections.
In Figure 3 we study the effect of increasing the number of overlapping sentences per chunk (without any of the enhancements of Table 3 ). The solid line shows the average noise as a function of the number of overlapping sentences in a chunk. As we see, the noise decreases dramatically as the number of overlapping sentences grows. This is beneficial since it decreases the minimum amount of plagiarism detectable. Figure 3 shows an ``effective noise'' curve that is the average noise plus three standard deviations. If we assume that noise is a normally distributed variable, we can interpret the effective noise curve as a lower bound for the threshold in order to eliminate 99% of the false positives due to noise. For example, if we use three sentence chunks and set our threshold at
= 0.01, then the Beta error will be less than 1%.However, as described in Section 4.2 , the Alpha error will increase as we combine sentences in chunks. This mean that, for instance, we will be unable to detect plagiarism of multiple, non-contiguous sentences. Also, the security of the system is reduced (Section 4.2 ): it takes fewer changes to a document to make it pass as a new one.
Effective noise Average noise The effect of chunk size on document noise Number of sentences per chunk 5 4 3 2 1 7 6 5 4 3 2 1 0
Figure 3: Noise as a function of number of overlapping sentences.
5.5 Effect of Converters
A final issue we investigate is the impact of different input converters. For example, say a Latex document is initially registered in COPS. Later, the DVI version of the same document (produced by running the original through the Latex processor) is submitted for testing. We would like to find that (a) the DVI copy clearly matches the registered Latex original, and (b) the DVI copy has a similar number of matches with other documents as the original would have had.
Table 4 explores this issue. The first row is for the basic COPS algorithm; the second row is for the version that includes all the enhancements of Table 3 . The first, third, and fifth columns are as before and are only included for reference. The ``Altered Self'' column reports the average precent of matching sentences when a DVI document is compared against its Latex original. The ``Altered Related'' column gives the average percent matching sentences when a DVI document is compared to all of the related Latex documents. Although the
We believe that the results presented in this section, although not definitive, provide some insight into the
selection of a good threshold value for COPS, at least for the Related target test. A threshold value of say
= 0.05 (25 out of 500 sentences) seems to identify the vast majority of related documents, while not triggering
false violations due to noise. We also conclude that detecting plagiarism of about 10 or less sentences (roughly
2% of documents) will be quite hard, without either high Alpha or Beta errors. 6 Approximating OOTs
To illustrate, say we have an OOT with a
Another sampling option is to sample registered documents. The idea here is to only insert in our hash table
a random sample of chunks for each registered document. For example, say that only 10% of the chunks are
hashed. Next, suppose that we are checking all 100 chunks of a new document and find 2 matches with a
registered document. Since the registered document was sampled, these 2 matches should be equivalent to 20
under the original OOT. Since 20/100 exceed the 15% threshold, the document would be flagged as a violation.
In this case, the savings would be storage space: the hash table will have only 10% of the registered chunks. A
smaller hash table also makes it possible to distribute it to other sites, so that copy detection can be done in a
distributed fashion. Again, the cost is a loss of accuracy.
A third option is to combine the two of these techniques without sacrificing accuracy (any more than either
one alone) by sampling based on the hash numbers of the chunks [10]. For example, if in our test document, we
sample exactly those chunks whose hash number is 0 mod 10, then there is no need to store the hash values of any
registered documents' chunks whose hash value is not 0 mod 10 since there could never be a collision otherwise.
However, this scheme has the drawback that one must always sample a fixed fraction of the documents' chunks
rather than, say, a fixed number of them.
Due to space limitations, in this paper we only consider the first option, sampling for testing. However, note
that the analysis for the sampling at registration time and at both is very similar to what we will present here,
and the results are analogous.
We start by giving a more precise definition of the sampling at testing strategy. We are given an OOT
C = EVAL-CHUNKS1(r)
return RANDOM-SELECT(N, C)
6.1 Accuracy of Randomized OOTs
Now we wish to determine how different o2 is from o1. As in Section 3.2 , let D be our distribution of input documents and let R be the distribution of registered documents. Let X be a random document that follows D and Y be a random document that follows R. Let m(X, Y) be the proportion of chunks (according to o1's chunking function) in X which match chunks in Y. Then let W(x) be the probability density function that m(X, Y) = x, i.e., P(x1 <= m(X, Y) <= x2) = Rx2x1 W(x)dx. Using this we can compute Alpha(o1, o2), Beta(o1, o2), and Error(o1, o2). The details of the computation are in Appendix A; the results are as follows: Alpha(o1, o2) = R1 W(x)Q(x)dx R1 W(x)dx Beta(o1, o2) = R0 W(x)(1 - Q(x))dx Ro W(x)dx Error(o1, o2) = Z1 W(x)Q(x)dx + Z0 W(x)(1 - Q(x))dx where Q(x) = bNc Sum j=0 N j xj(1 - x)N-j
6.2 Results
Before we can evaluate our expressions, we need to know the W(x) distribution. Recall that W(x) tells us how likely it is to have a proportion of x matches between a test and a registered document. One option would be to measure W(x) for a given body of documents, but then our results would be specific to that particular body. Instead, we use a parametrized function that lets us consider a variety of scenarios.
Using the observations of Section 5 , we arrive at the following
W(x) function. With a very high probability pa, the test document will be unrelated to the registered one. In this case, there can still be noise matches, which we model as normally distributed with mean 0 and standard deviation a (which will probably be very small). With probability pb = 1 - pa the test document is unrelated to the registered one. In this case we assume that the number of matching chunks is normally distributed with mean b and standard deviation b. We would expect b to be large since, as we have seen, related documents tend to have widely varying numbers of matches. Thus, our W(x) function is the weighted sum of two normal (truncated at 0 and 1) distributions, normalized to make R10 W(x) = 1.Figure 4 shows a sample
W(x) function with exaggerated parameters to make its form more apparent. The area under the curve in the range 0 <= x <= 0.2 represents the likelihood of noise matches, while the rest of the range represents mainly matches of related documents. In practice, of course, we would expect pa to be much closer to 1 (most comparisons will be between unrelated documents) and a to be much smaller.
Given a parametrized
W(x), we can present results that show how good an approximation o2 is to o1. An important first issue to study is the number of samples N required for accurate results. Figure 5 shows the Alpha(o1, o2), Beta(o1, o2), and Error(o1, o2) values as a function of N, for = 0.4. Recall that the value of 0.4 means that o1 is looking for registered documents whose chunks that match 40% of the chunks of the test document. This value may have been picked, say, because we are interested in a Subset target test. The parameters for W(x) are given in the figure.
In spite of the non-monotonicity, it is important to note how overall the Error decreases very rapidly as
N increases. For N > 10, the Error stays well below 0.01. This shows that o2 can approximate o1 well with a relatively small number of sampled chunks.Note, however, that the Alpha error does not decrease as rapidly, but this is not as serious. The Alpha error for
N beyond say 20 is mainly caused by test documents whose match ratio is slightly higher than = 0.4. (The area under the W(x) curve in the vicinity to the right of 0.4 gives the probability of getting one of these documents.) In these cases, the sampling OOT may not muster enough hits to trigger a detection. However, in this case the original OOT o1 may not very good at approximating the violation test of interest either. In other words, in the percent of matches is close to 40%, it may not be clear if the documents are related on not. Thus, the fact that o1 detects a violation but o2 does not is not as serious, we believe.Our results are sensitive to the
W(x) parameters used. For example, in Figure 6 we demonstrate the effect of a. We can see from Figure 6 that the Error stays very low as long as a is not near = 0.4. If a is close to , we get more documents in the region where o2 has trouble identifying documents selected by o1. Similarly, we find that error keeps very low in the high pa range, which is where we expect it to be in practice.
In summary, using sampling in OOTs seems to work very well under good conditions (when
is far from the bulk of the match ratios). There is a large gain in efficiency with only a small loss of accuracy. As stated earlier, the sample at registration OOT can be analyzed almost identically to what we have done here, and can be shown to substantially reduce the storage costs. 7 Conclusions
In this paper we have proposed a copy detection service that can identify partial or complete overlap of documents. We described a prototype implementation of this service, COPS, and presented experimental results that suggest the service can indeed detect violations of interest. We also analyzed several important variations, including ones for breaking up a document into chunks, and for sampling chunks for detecting overlap.
It is important to note that while we have described copy detection as a centralized function, there are many ways to distribute it. For example, copies of the registered document hash table can be distributed to permit checking for duplicates at remote sites. If the table contains only samples (Section 6 ) it can be relatively small and distributable more easily. Also, document registration can also be performed at a set of distributed registration services. These services could periodically exchange information on new registered documents they have seen.
Perhaps the most important question regarding copy detection is whether authors can be convinced to register their documents: Without a substantial body of documents, the service will not be very useful. We believe they can, especially if one starts with the documents of a particular community (e.g., netnews users, or SIGMOD authors). But regardless of the success of COPS and copy detection, we believe it is essential to explore and understand solutions for safeguarding intellectual property in digital libraries. Their success hinges on finding at least one approach that works.
References
[1] C. Anderson. Robocops: Stewart and Feder's mechanized misconduct search. Nature, 350(6318):454--455, April 1991.
[2] J. Brassil, S. Low, N. Maxemchuk, and L.O'Gorman. Document marking and identification using both line and word shifting. Technical report, AT&T Bell Labratories, 1994. May be obtained from ftp://ftp.research.att.com/dist/brassil/docmark2.ps.
[4] A. Choudhury, N. Maxemchuk, S. Paul, and H. Schulzrinne. Copyright protection for electronic publishing over computer networks. Technical report, AT&T Bell Labratories, 1994. Submitted to IEEE Network Magazine June 1994.
[5] J. R. Garrett and J. S. Alen. Toward a copyright management system for digital libraries. Technical report, Copyright Clearance Center, 1991.
[6] G. N. Griswold. A method for protecting copyright on networks. In Joint Harvard MIT Workshop on Technology
Strategies for Protecting Intellectual Property in the Networked Multimedia Environment, April 1993.
[7] M. B. Jensen. Making copyright work in electronic publishing models.
[8] R. E. Kahn. Deposit, registration and recordation in an electronic copyright management system. Technical report,
Corporation for National Research Initiatives, Reston, Virginia, August 1992.
[9] P. A. Lyons. Knowledge-based systems and copyright.
[10] U. Manber. Finding similar files in a large file system. In
[11] A. Parker and J. O. Hamblen. Computer algorithms for plagiarism detection.
[12] G.J. Popek and C.S. Kline. Encryption and secure computer networks.
[13] D. Wheeler. Computer networks are said to offer new opportunities for plagarists.
Figure 4: An Exaggerated W
Error Beta Alpha pa = 0.95 a = 0.02 pb = 0.05 b = 0.3 b = 0.8 = 0.4 N 30 25 20 15 10 5 0 0.12 0.1 0.08 0.06 0.04 0.02 0
Figure 5: The Effect of the Number of Sample Points on Accuracy
Error pa = 0.95 pb = 0.05 b = 0.3 b = 0.8 = 0.4 N = 20 a 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
Figure 6: The Effect of
a on Error.