References for introductory lecture

  1. Andrei Broder et al, "Graph Structure of the Web". WWW9 conference, 2000.
  2. Chris Anderson, "The Long Tail". Wired magazine, October 2004.
  3. Sergey Brin and Larry Page. The anatomy of a large scale hypertextual web search engine. WWW7, 1998.
  4. Lada A Adamic. "Zipf, Power-laws, and Pareto - a ranking tutorial."
  5. Lada A. Adamic and Bernardo A. Huberman. "Zipf's law and the Internet." Glottometrics 3, 2002, 143-150.

Web Crawling

  1. Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient Crawling Through URL Ordering." Computer Networks and ISDN Systems, 30(1-7):161-172, 1998.
  2. Junghoo Cho, Hector Garcia-Molina "Effective page refresh policies for Web crawlers." ACM Transactions on Database Systems, 28(4): December 2003.
  3. Ka Cheung Sia, Junghoo Cho "Efficient Monitoring Algorithm for Fast News Alert". Technical report, UCLA, 2005.
  4. M. Najork and J. L. Wiener. "Breadth-First Crawling Yields High-Quality Pages." In Proceedings of the 10th International World Wide Web Conference, pages 114--118, Hong Kong, May 2001
  5. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig. "Syntactic Clustering of the Web." WWW6, 1997.

Page Rank, Hubs and Authorities

  1. Sergey Brin and Larry Page. The anatomy of a large scale hypertextual web search engine. WWW7, 1998.
  2. J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999).
  3. Taher Haveliwala. Efficient Computation of PageRank.Technical Report, Stanford University, 1999.
  4. Taher Haveliwala. Topic-Sensitive Page Rank. Proceedings of WWW11, 2002.
  5. Glen Jeh and Jennifer Widom. Scaling Personalized Web Search. Proceedings of WWW12, 2003.

Web Spam

  1. Zoltán Gyöngyi, Hector Garcia-Molina. Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web (at the 14th International World Wide Web Conference), Chiba, Japan, 2005.
  2. Zoltán Gyöngyi, Hector Garcia-Molina and Jan Pedersen. Combating Web Spam with TrustRank. 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.
  3. Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen. Link Spam Detection Based on Mass Estimation. 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006. paper, presentation

Recommendation Systems

  1. G. Adomavicius and A. Tuzhilin. Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE TKDE, June 2005.
  2. Greg Linden, Brent Smith, and Jeremy York. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing, Jan/Feb 2003.
  3. Sean McNee, John Riedl, Joseph A. Konstan. Accurate is not always good: How accuracy metrics have hurt recommender systems. ACM CHI 2006.
  4. Moses Charikar. Similarity Estimation Techniques from Rounding Algorithms. ACM STOC 2002.
  5. Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. ACM SIGIR 2006.

Relation Extraction

  1. Sergey Brin. Extracting Patterns and Relations from the World Wide Web. WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98, 1998.
  2. Eugene Agichtein and Luis Gravano. Snowball: Extracting Relations from Large Plain-Text Collections . Proceedings of the Fifth ACM International Conference on Digital Libraries, 2000.
  3. S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais and E. Horvitz (2002). Web question answering: Is more always better?  In Proceedings of SIGIR'02,  Aug 2002, pp. 291-298.

Virtual Databases

  1. Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper Induction for Information Extraction . Intl. Joint Conference on Artificial Intelligence (IJCAI), 1997.
  2. Anand Rajaraman, Jeffrey D. Ullman, Querying Websites using Compact Skeletons. Journal of Computer and System Sciences 66(4): 809-851 (2003).