User-Centric Web Crawling Christopher Olston, CMU and Yahoo! Research Given the considerable size, dynamicity, and degree of autonomy of the Web, it is not feasible for a search engine to maintain its local repository exactly synchronized with the Web. As a result, answers to search queries may be inaccurate. This problem can be especially pronounced for topic-specific search engines such as science portals, which do not always wield considerable computing and networking power. We consider how to schedule Web pages for selective (re)downloading into a search engine repository. Our scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We provide empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.