In the ocean of Web data, Web search engines are the primary way to access
content. As the data is on the order of petabytes, current search engines are
very large centralized systems based on replicated clusters. Web data, however,
is always evolving. The number of Web sites continues to grow rapidly (150
millions in November 2007) and there are currently more than 20 billion
indexed pages. On the other hand, Internet users are above one billion and
hundreds of million of queries are issued each day. In the near future,
centralized systems are likely to become less effective against such a
data-query load, thus suggesting the need of fully distributed search engines.
Such engines need to maintain high quality answers, fast response time,
high query throughput, high availability and scalability;
in spite of network latency and scattered data. In this talk we present the
main challenges behind the design of a distributed Web retrieval system and
our past and on-going work in solving some of them, including crawling,
indexing and query processing.