Many web documents (such as JAVA FAQs) are 
being replicated on the Internet. 
Often entire document collections (such as hyperlinked 
Linux manuals) are being replicated many times.
In this paper, we make the case for identifying 
replicated documents and collections to improve web crawlers,
archivers, and ranking functions used in search engines.
The paper describes how to efficiently identify replicated
documents and hyperlinked document collections.
The challenge is to identify these replicas 
from an input data set of several tens of millions of 
web pages and several hundreds of gigabytes of textual data.
We also present two real-life case studies
where we used replication information to improve a crawler 
and a search engine.
We report these results for a data set of 25 million web pages
(about 150 gigabytes of HTML data) crawled from the web.