Web Search, Digital Libraries, and Metadata

Steve Lawrence*
NEC Research Institute


The web and search engines represent a significant improvement for information access, however there is much room for improvement to
existing techniques. Our results show that search engines only index a fraction of all publicly indexable web pages, do not index sites
equally, and may not index new pages for months. We also analyze metadata and the volume and distribution of information on the web. We
discuss CiteSeer, which is the largest free full-text index of scientific literature in the world. CiteSeer automatically extracts metadata from research articles, and provides a number of novel features including autonomous citation indexing and the extraction of citation context.


Steve Lawrence is a Research Scientist at NEC Research Institute in Princeton, NJ. Dr. Lawrence has published over 50 articles in areas including information retrieval, web analysis, digital libraries, and machine learning, including articles in Science, Nature, CACM, and IEEE Computer. Dr. Lawrence has been interviewed by many news organizations including the New York Times, Wall Street Journal, Washington Post, Reuters, Associated Press, UPI, CNN, BBC, MSNBC, and NPR. Hundreds of articles about his research have appeared worldwide in over 10 different languages.