Web Mining References
References for Introductory Lecture
 Andrei Broder et al, "Graph Structure of the Web". WWW9 conference, 2000.
 Chris Anderson, "The Long Tail". Wired magazine, October 2004.
 Sergey Brin and Larry Page. The anatomy of a large scale hypertextual web search engine. WWW7, 1998.
 Lada A Adamic. "Zipf, Powerlaws, and Pareto  a ranking tutorial."
 Lada A. Adamic and Bernardo A. Huberman. "Zipf's
law and the Internet." Glottometrics 3, 2002, 143150.
Web Crawling
 Junghoo Cho, Hector GarciaMolina, Lawrence Page "Efficient
Crawling Through URL Ordering." Computer Networks and ISDN
Systems, 30(17):161172, 1998.
 Junghoo Cho, Hector GarciaMolina
"Effective page refresh policies for Web crawlers."
ACM Transactions on Database Systems, 28(4): December 2003.
 Ka Cheung Sia, Junghoo Cho
"Efficient Monitoring Algorithm for Fast News Alert".
Technical report, UCLA, 2005.
 M. Najork and J. L. Wiener.
"BreadthFirst Crawling Yields HighQuality
Pages." In Proceedings of the 10th International World Wide Web Conference,
pages 114118, Hong Kong, May 2001
 Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
"Syntactic Clustering of the Web."
WWW6, 1997.
Page Rank, Hubs and Authorities
 Sergey Brin and Larry Page. The anatomy of a
large scale hypertextual web search engine. WWW7, 1998.
 J. Kleinberg. Authoritative
sources in a hyperlinked environment. Proc. 9th ACMSIAM Symposium on Discrete Algorithms,
1998. Extended version in Journal of the ACM 46(1999).
 Taher Haveliwala. Efficient Computation of PageRank.Technical Report, Stanford University, 1999.
 Taher Haveliwala.
TopicSensitive Page Rank. Proceedings of WWW11, 2002.
 Glen Jeh and Jennifer Widom. Scaling Personalized Web
Search. Proceedings of WWW12, 2003.
Web Spam
 Zoltán Gyöngyi, Hector GarciaMolina.
Web Spam Taxonomy.
First International Workshop on Adversarial Information Retrieval on the
Web (at the 14th
International World Wide Web Conference), Chiba, Japan, 2005.
 Zoltán Gyöngyi, Hector GarciaMolina and Jan Pedersen.
Combating Web Spam with TrustRank.
30th International Conference on Very Large Data Bases (VLDB),
Toronto, Canada, 2004.
 Zoltán Gyöngyi, Pavel Berkhin, Hector GarciaMolina, Jan Pedersen.
Link Spam Detection Based on Mass Estimation.
32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006.
paper,
presentation
Recommendation Systems
 G. Adomavicius and A. Tuzhilin. Towards the Next
Generation of Recommender Systems: A Survey of the StateoftheArt
and Possible Extensions. IEEE TKDE, June 2005.
 Greg Linden, Brent Smith, and Jeremy York. Amazon.com
Recommendations: ItemtoItem Collaborative Filtering. IEEE
Internet Computing, Jan/Feb 2003.
 Sean McNee, John Riedl, Joseph A. Konstan. Accurate is not
always good: How accuracy metrics have hurt recommender systems.
ACM CHI 2006.
 Moses Charikar. Similarity Estimation Techniques from
Rounding Algorithms. ACM STOC 2002.
 Monika Henzinger. Finding NearDuplicate Web Pages: A
LargeScale Evaluation of Algorithms. ACM SIGIR 2006.
Relation Extraction
 Sergey Brin. Extracting Patterns
and Relations from the
World Wide Web. WebDB Workshop at 6th International Conference on
Extending Database Technology, EDBT'98, 1998.
 Eugene Agichtein and Luis Gravano.
Snowball: Extracting Relations from Large PlainText Collections
. Proceedings of the Fifth ACM International Conference on Digital
Libraries, 2000.
 S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng
(2002). P. Bennett, S. Dumais and E. Horvitz (2002).
Web question answering: Is more always better? In Proceedings of SIGIR'02, Aug 2002,
pp. 291298.
Virtual Databases
 Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos.
Wrapper Induction for Information Extraction
.
Intl. Joint Conference on Artificial Intelligence (IJCAI), 1997.
 Anand Rajaraman, Jeffrey D. Ullman,
Querying Websites using Compact Skeletons.
Journal of Computer and System Sciences 66(4): 809851 (2003).
Similarity Search

S. Chaudhuri, V. Ganti, and Raghav Kaushik,
A Primitive Operator
for Similarity Joins in Data Cleaning, 22nd ICDE (2006).

C. Xiao, W. Wang, X. Lin, and J. X. Yu,
Efficient Similarity
Joins for Near Duplicate Detection, 17th WWW Conference (2008), pp. 131140.

P. Indyk and R. Motwani. "Approximate Nearest Neighbor:
Towards Removing the Curse of Dimensionality,"
30th STOC (1998), pp. 604613.

A. Gionis, P. Indyk, and R. Motwani,
Similarity Search in High Dimensions
Via Hashing, 25th VLDB (1999), pp. 518529.

E. Cohen. SizeEstimation Framework with Applications to
Transitive Closure and Reachability. Journal of Computer
and System Sciences 55 (1997), pp. 441453.

A Debate About the "Long Tail"
between Chris Anderson and Anita Elberse.

A. S. Das, M. Datar, A. Garg, and S. Rajaram,
Google News Personalization:
Scalable OnLine Collaborative Filtering 16th WWW Conference, pp. 271280.

F. Chang et al.,
Bigtable: A Distributed System
for Structured Data, 7th OSDI (2006).

B. H. Bloom,
"Space/time tradeoffs in hash coding with allowable errors,"
Comm. ACM 13:7 (1970), pp. 422426.

W.H. Kautz and R.C. Singleton,
"Nonadaptive binary superimposed codes,"
IEEE Trans. Inform. Theory 10 (1964), pp. 363377.
NetFlix Prize

R. M. Bell, Y. Koren, and C. Volinsky,
The
BellKor 2008 Solution to the Netflix Prize.

A. Toscher and M. Jahrer,
The
BigChaos Solution to the Netflix Prize 2008
Association Rules

M. Fang, N. Shivakumar, H. GarciaMolina, R. Motwani, and J. Ullman,
``Computing
Iceberg Queries Efficiently,''
1998 VLDB.
Postscript.

H. Toivonen, ``Sampling Large Databases for Association Rules,''
VLDB 1996, pp. 134145.
Postscript.

J. S. Park, M.S. Chen, and P. S. Yu, ``An Effective HashBased Algorithm
for Mining Association Rules,''
1995 SIGMOD, pp. 175186.
PDF

R. Agrawal, T. Imielinski, A. Swami: ``Mining Associations between Sets of Items
in Massive Databases'', Proc. of the ACM
SIGMOD Int'l Conference on Management of Data,
Washington D.C., May 1993, 207216.
Postscript.
PDF.

R. Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'',
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, Sept. 1994.
Postscript.
PDF.
Stream Mining

M. Datar, A. Gionis, P. Indyk, and R. Motwani,
"Maintaining Stream Statistics Over Sliding Windows,"
SIAM J. Computing, 31 (2002): 17941813.
OnLine.

N. Alon, Y. Matias, and M. Szegedy,
"The Space Complexity of Approximating Frequency Moments,"
28th STOC, pp. 2029, 1996.

P. Flajolet and G. N. Martin,
"Probabilistic Counting for Database Applications,"
JCSS 31:2 (Sept., 1985), pp. 182209. Also 24th FOCS,
pp. 7682, 1983.

J. Vitter,
"Random Sampling with a Reservoir,"
ACM Trans. on Mathmatical Software 11:1 (1985), pp. 3757.

Babcock et al.,
"Models and Issues in Data Streams,"
21st PODS (2002).
Online.
Clustering
 B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan,
"Maintaining Variance and kMedians Over Data Stream Windows,"
2003 PODS. See PS, PDF, etc..
 P. Bradley, U. Fayyad, and C. Reina, ``Scaling Clustering Algorithms to
Large Databases,'' 1998 KDD.
 S. Guha, R. Rastogi, and K. Shim, ``CURE: An Efficient Clustering
Algorithm for Large Databases,'' SIGMOD 1998.