This talk will report on recent efforts to extract structured information from the (unstructured) web. In particular, we focus primarily on graph theoretic methods to do this task, and describe why classical relational approaches and data mining tools are both lacking in this context. We introduce a new class of algorithms which are "dual" in nature to commonly used data mining methods such as a priori, in that they attempt to prune the data set intelligently rather than the candidate frequent sets. We report on the performance improvements due to using this new algorithmic method. This work is joint work with Ravi Kumar, Prabhakar Raghavan and Andrew Tomkins, and was reported partly in the VLDB'99 conference.
Sridhar Rajagopalan has received a B.Tech. from the Indian Institute of Technology, Delhi, in 1989 and a Ph.D. from the University of California, Berkeley in 1994. He has been a DIMACS postdoctoral fellow between 1994 and 1996. He is now a Research Staff Member at the IBM Almaden Research Center. His research interests are algorithms and algorithm engineering, randomization, information and coding theory, and information retrieval issues on the web.