FROM THE WEB TO THE GLOBAL INFOBASE

Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University

Contact Information

Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom

WWW page

Our projects main Web page is at http://www-db.stanford.edu/gib/. Other sites relevant to the grant include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.

Project Award Information

Keywords

World-Wide Web, information retrieval, database, natural-language processing, data mining

Broader Impact Activities

There were four noteworthy activities this year (Jul 03-Jun 04):

(1) Three of the students working on the Global InfoBase (GIB) Project started a company, Kaltix, which was then acquired by Google. The students, Sep Kamvar, Taher Haveliwala and Glen Jeh, founded Kaltix, based on the search personalization technology developed by the GIB Project. After the founders gave talks at the major search engine companies (Google, Yahoo, Microsoft, ...) describing the GIB technologies and their implementation, Google acquired Kaltix in late 2003. The founders are now working at Google, and the Google implementation of the core technology is presently showcased on the Google Labs site.

(2) In May 2003, David Hart at NSF produced a press release highlighting our work, initially reported last year, on efficient personalized PageRank calculation (NSF PR 03-56), timed to coincide with
the World Wide Web 2003 conference. As well as correctly highlighting the strong linkages between basic mathematical research and real-world computing advances, this press release brought a lot
of media and industry attention to the work. In many ways that set in train a sequence of events which led the Kaltix students to set commercialize the technology and set up the company.

(3) We organized a Stanford-IBM Day on November 7, 2004. This one day workshop highlighted work done at Stanford and at IBM, including GIB work. About 75 people from Stanford and IBM participated. Details of the event can be found here. This type of event plays an important technology transfer role, as we at Stanford disseminate our results to industry, while at the same time
learning what industry is up to and what problems they are facing.

(4) Similarly, we organized a Stanford-Google day April 8, 2004. About 25 Stanford faculty and students, including GIB participants, visited Google and made presentation on our current work.
Researchers from Google discussed some of the problems they are focusing on. Again, this event provided another opportunity for technology transfer.

Students

Sep Kamvar, worked on the mathematical foundations of pagerank-style methods, developing with Taher Haveliwala results on the eigengap and condition number of the web, and methods for effectively personalizing PageRank computations. Kamvar completed his PhD and is now working at Google.

Chris Olston, worked in the area of approximate replication. Olston completed his PhD and is now an Assistant Professor in the Computer Science Department at Carnegie-Mellon University.

Sriram Raghavan, worked on the integration of information technologies, as well as on efficient execution of complex web queries. Raghavan is now working at the IBM Almaden Research Center.
He is expected to complete his thesis by the end of the summer of 2004.

Research Activities

In the period July 2003 - June 2004 we continued work on the GIB project, touching on the four project areas, combining technologies, personalization, semantics, and data mining. In what follows we provide some highlights of our research. For a complete list of publications and additional details, please visit the project web site.

Adaptive Methods for the Computation of PageRank. We have observed that the convergence patterns of pages in the PageRank algorithm have a nonuniform distribution. Specifically, many pages converge to their true PageRank quickly, while relatively few pages take a much longer time to converge. Furthermore, we observe that these slow-converging pages are generally those pages with high PageRank. We used this observation to devise a simple algorithm to speed up the computation of PageRank, in which the PageRank of pages that have converged are not recomputed at each iteration after convergence. This algorithm, which we call Adaptive PageRank, speeds up the computation of PageRank by nearly 30%. For additional details, see:
Sepandar D. Kamvar, Taher H. Haveliwala, and Gene H. Golub. Adaptive Methods for the Computation of PageRank. International Conference on the Numerical Solution of Markov Chains, September 2003. 

Computing PageRank using Power Extrapolation. We have developed a novel technique for speeding up the computation of PageRank, a hyperlink-based estimate of the ``importance'' of Web pages, based on the ideas presented in "Extrapolation Methods for Accelerating PageRank Computations". The original PageRank algorithm uses the Power Method to compute successive iterates that converge to the principal eigenvector of the Markov matrix representing the Web link graph. Our new algorithm, called Power Extrapolation, accelerates the convergence of the Power Method by subtracting off the error along several nonprincipal eigenvectors from the current iterate of the Power Method, making use of known nonprincipal eigenvalues of the Web hyperlink matrix. Empirically, we have shown that using Power Extrapolation speeds up PageRank computation by 30% on a Web graph of 80 million nodes in realistic scenarios over the standard power method, in a way that is simple to understand and implement. For additional details, please see:
Taher Haveliwala, Sepandar Kamvar, Dan Klein, Christopher Manning, and Gene Golub. Computing PageRank using Power Extrapolation , Preprint, July 2003. 

An Analytical Comparison of Approaches to Personalizing PageRank. PageRank, the popular link-analysis algorithm for ranking web pages, assigns a query and user independent estimate of "importance" to web pages. Query and user sensitive extensions of PageRank, which use a basis set of biased PageRank vectors, have been proposed in order to personalize the ranking function in a tractable way. We have analytically compared three recent approaches to personalizing PageRank and have studied the tradeoffs of each one. For additional details, please see: 
Taher Haveliwala, Sepandar D. Kamvar, and Glen Jeh. An Analytical Comparison of Approaches to Personalizing PageRank, Preprint, June 2003. 

The Condition Number of the PageRank Problem. We have determined analytically the condition number of the PageRank problem. Our result has implications for the accuracy to which PageRank can be computed, currently and as the web scales. Furthermore, it provides a simple proof that, for the parameters currently used by Google in computing PageRank, small changes in the link structure of the web do not cause large changes in the PageRanks of pages of the web. For additional details, please see:
Sepandar D. Kamvar and Taher H. Haveliwala, The Condition Number of the PageRank Problem , Preprint, June 2003.  

The Second Eigenvalue of the Google Matrix. We have determined analytically the modulus of the second eigenvalue for the web hyperlink matrix used by Google for computing PageRank. Our results have implications for the convergence rate of the standard PageRank algorithm as the web scales, for the stability of PageRank to perturbations to the link structure of the web, for the detection of Google spammers, and for the design of algorithms to speed up PageRank. For additional details, please see:
Taher H. Haveliwala and Sepandar D. Kamvar, The Second Eigenvalue of the Google Matrix. Preprint, April 2003. 

Mining Properties of Graph-Structured Information. We have developed a new data mining system for graph-structured information, called F-Miner. Prior to F-miner, most data mining algorithms on graphs looked for nodes satisfying specific properties, such as specific notions of structural similarity or specific measures of link-based importance. While such analyses for predetermined properties can be effective in well-understood domains, sometimes identifying an appropriate property for analysis can be a challenge, and focusing on a single property may neglect other important aspects of the data. Glen and Jennifer developed a foundation for mining the properties themselves, including a theoretical framework defining the space of graph properties, a variety of mining queries enabled by the framework, and techniques to handle the enormous size of the query space. The F-miner prototype system demonstrates the utility and feasibility of property mining. For additional details, please see:
G. Jeh and J. Widom. Mining the Space of Graph Properties. To appear in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, August 2004. 

Keyword Search over Relational Databases. We developed a new system called EKSO (Efficient Keyword Search through Offline indexing) that provides a general-purpose keyword search interface over any relational database. Typically, relational database systems require the user to learn SQL and to know the schema of the underlying data even to pose simple searches. EKSO supports highly efficient keyword-based search over relational databases: The database is "crawled" in advance, text-indexing virtual documents that correspond to interconnected database content. At query time, the text index supports keyword-based searches with instantaneous response, identifying database objects corresponding to the virtual documents matching the query. EKSO uses the DB2 Net Search Extender for indexing and keyword-search processing. Experimental results show that index size is manageable, query response time is instantaneous, and database updates (which are propagated incrementally as recomputed virtual documents to the text index) do not significantly hinder query performance. Qi and Jennifer also performed a user study confirming the superiority of keyword-based search over SQL for a wide range of database retrieval tasks. For additional details, please see:
Q. Su and J. Widom. Indexing Relational Database Content Offline for Efficient Keyword-Based Search. Submitted for conference publication, May 2004. 

Approximate Replication. Chris Olston completed his Ph.D. thesis under the guidance of Jennifer Widom in the area of approximate replication. The thesis addresses the common case of distributed information management environments where data is replicated across multiple nodes. Keeping replicas exactly consistent poses a significant challenge due to the communication cost, but fortunately in many applications some amount of inaccuracy or staleness may be tolerable, in which case approximate replication is used. Chris's thesis identifies a fundamental tradeoff between precision and performance in these environments, and develops two complementary approaches to working with this tradeoff: (1) Maximize precision of replicas in the presence of constraints on communication costs. (2) Minimize communication cost in the presence of constraints on replica precision. For each approach, the thesis develops a formal problem definition, analyses, algorithms, and implementation techniques. Experiments were conducted with both synthetic and real-world data. A testbed network traffic monitoring system was built using the approximate replication techniques-- it can track usage patterns and flag potential security hazards. 

Exploiting Hierarchical Domain Structure to Compute Similarity. As part of our efforts to integrate information technologies, we studied the notion of similarity between objects. This notion finds use in many contexts, e.g., in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We have developed new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We also extended our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), e.g., to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We have also performed an experimental comparison of our measures against traditional similarity measures, including an informal user study that evaluated how well our measures match human intuition. For additional details, please see: 
Ganesan, Prasanna; Garcia-Molina, Hector; Widom, Jennifer. Exploiting Hierarchical Domain Structure to Compute Similarity,  ACM Transactions on Information Systems (TOIS), Vol. 21, No. 1, 2003, pp. 64-93.