User-Centric Web Crawling
           Christopher Olston, CMU and Yahoo! Research

Given the considerable size, dynamicity, and degree of autonomy of the
Web, it is not feasible for a search engine to maintain its local
repository exactly synchronized with the Web. As a result, answers to
search queries may be inaccurate. This problem can be especially
pronounced for topic-specific search engines such as science portals,
which do not always wield considerable computing and networking power.

We consider how to schedule Web pages for selective (re)downloading
into a search engine repository. Our scheduling objective is to
maximize the quality of the user experience for those who query the
search engine. We begin with a quantitative characterization of the
way in which the discrepancy between the content of the repository and
the current content of the live Web impacts the quality of the user
experience. This characterization leads to a user-centric metric of
the quality of a search engine's local repository. We use this metric
to derive a policy for scheduling Web page (re)downloading that is
driven by search engine usage and free of exterior tuning parameters.

We provide empirical comparisons of our user-centric method against
prior Web page refresh strategies, using real Web data. Our results
demonstrate that our method requires far fewer resources to maintain
same search engine quality level for users, leaving substantially more
resources available for incorporating new Web pages into the search
repository.