Hector Garcia-Molina (PI), Chris Manning
(co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University
Hector Garcia-Molina, Chris Manning, Jeff
Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Project Award Information
World-Wide Web, information retrieval, database, natural-language processing, data mining
Our proposed work is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal information resource, simple to use, up to date, and comprehensive. The project consists of four interrelated thrusts: (i) Combining Technologies: integrating technologies for information retrieval, database management, and hypertext navigation, to achieve a "universal" information model; (ii) Personalization: developing tools for personalizing information management; (iii) Semantics: Using natural-language processing and structural techniques for analyzing the semantics of Web pages; and (iv) Data Mining: designing new algorithms for mining information in order to synthesize new knowledge.
Publications and Products
Goals, Objectives, and Targeted Activities
· Combining Technologies: We cannot achieve the challenging goals of our Global Information Base without (a) extending existing technologies to work in vast information spaces and, (b) developing new technologies where existing ones fail to scale. We investigated: Multi-Model Queries  examines how relational and full-text queries may be combined to yield results over multiple sources with radically different data models (e.g.,"Find all Web pages that contain the phrase 'National Science Foundation', and that are being linked to by at least 10 other Web pages."). For our testbed, we use WebBase, a 120 million page repository, developed as part of our Digital Library sister project. Search Over New Sources: We have been integrating new classes of information into the Global Infobase: Music and Internet chat rooms. Sound analysis is a notoriously difficult problem, especially in the realm of music, where acoustically very different signals nevertheless represent at some semantic level identical material. In  we document our new system to retrieve similar music pieces from an audio database without metadata or other symbolic information. The real-time, conversational nature of Internet relay chat (IRC) poses a number of interesting problems with respect to indexing archives for effective search and we present our preliminary results in . Finally, in our technology integration thrust, we developed new algorithms for scalable saches that serve vast numbers of cooperating sources. In , we present a best-effort synchronization scheduling policy that exploits cooperation between data sources and the cache.
· Personalization: If our Global Information Base harbors a dark side, it is an escalation of the well-documented 'information overload' problem. We have therefore focused on the personalization in the interaction with information sources. Context-Sensitive Search: Many Web search engines compute absolute rankings for Web pages. While highly effective, this ranking does not take into account the user's context. In , we developed and implemented a topic-sensitive link-based ranking measure for web search, which exploits search context, including query context (e.g., query history) and user context (e.g., bookmarks and browsing history) to enhance the precision of search engines. Structure-Based Similarity: In  and  we document our efforts in deducing Web page similarity through the analysis of Web structure. For example, pages with common parentage might be related. Personalized-Precision Retrieval: Building on our scalable caching work , we were able to provide personalization of querying in a novel area. We enable users to specify standing queries augmented by a specification of how up-to-date the results need to be. This quality of service measure is important to our Global Information Base, because standing queries are only feasible if the system has some 'wiggle room' to optimize its operation [5,6].
· Semantics: We are attacking problems in getting more meaning out of web pages than simply lists of words they contain. Much work has been done on clustering documents, but little on the problem of labeling clusters. For effective human navigation, the quality of the labeling is at least as important as the quality of the underlying clustering technique, and  studies the effectiveness of current labeling techniques and devises algorithms for generating more effective labels. In , we address the problem of designing a crawler capable of extracting content from the hidden Web (pages behind forms). We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. In ongoing unpublished work, we are continuing work on web wrappers. Many web sites contain large sets of pages generated from a database using a common template. We are developing an algorithm to extract database values from web pages which uses sets of words that have similar occurrence patterns in the input pages to construct the template. Experiments show that the extracted values make semantic sense in most cases.
· Data Mining: Work in this area applies data mining and machine learning techniques to automate tasks of web analysis. In , we developed a methodology for evaluating various strategies for similarity search on the web, using the Open Directory (a free Yahoo!-like hierarchy) as an external quality measure. The space of alternatives tried includes link structure, words in the documents, in "anchor text" leading to the document, and words surrounding the anchor. The best results include use of text surrounding the anchor, with a weight for closeness. Clustering is a central problem in exploratory data mining, with strong web applications. We have explored two more fundamental pieces of research.  presents an improved method for data clustering in the presence of sparse prior knowledge, given in the form of pairwise instance constraints. By allowing these constraints to have space-level effects, we are able to exploit constraints more effectively than prior work. In , we discuss that classical hierarchical agglomerative clustering methods, while widely used, have lacked a solid theoretical foundation, and remedy this situation by providing probabilistic generative models for these methods.
The main references are listed under Publications and Products above.
The World-Wide Web has created a resource comprising much of the world's knowledge and is incorporating a progressively larger fraction each year. Yet today our ability to use the Web as an information resource -- whether to advance science, to enhance our lives, or just to get the best deal on a videotape -- is in a primitive state. Information on the Web is often hard to find and may be of dubious quality. Although information is presented in a universal HTML format, there are many fundamental differences across sites: words have different meanings, information is structured differently or not at all, and query interfaces vary widely. Our ultimate goal is not to replace the Web with a new information resource, but rather to add functionality and tools to the Web -- to transform it into the Global InfoBase.
Please see Publications and Products above.