Web Mining References
References for Introductory Lecture
- Andrei Broder et al, "Graph Structure of the Web". WWW9 conference, 2000.
- Chris Anderson, "The Long Tail". Wired magazine, October 2004.
- Sergey Brin and Larry Page. The anatomy of a large scale hypertextual web search engine. WWW7, 1998.
- Lada A Adamic. "Zipf, Power-laws, and Pareto - a ranking tutorial."
- Lada A. Adamic and Bernardo A. Huberman. "Zipf's
law and the Internet." Glottometrics 3, 2002, 143-150.
Web Crawling
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient
Crawling Through URL Ordering." Computer Networks and ISDN
Systems, 30(1-7):161-172, 1998.
- Junghoo Cho, Hector Garcia-Molina
"Effective page refresh policies for Web crawlers."
ACM Transactions on Database Systems, 28(4): December 2003.
- Ka Cheung Sia, Junghoo Cho
"Efficient Monitoring Algorithm for Fast News Alert".
Technical report, UCLA, 2005.
- M. Najork and J. L. Wiener.
"Breadth-First Crawling Yields High-Quality
Pages." In Proceedings of the 10th International World Wide Web Conference,
pages 114--118, Hong Kong, May 2001
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
"Syntactic Clustering of the Web."
WWW6, 1997.
Page Rank, Hubs and Authorities
- Sergey Brin and Larry Page. The anatomy of a
large scale hypertextual web search engine. WWW7, 1998.
- J. Kleinberg. Authoritative
sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998. Extended version in Journal of the ACM 46(1999).
- Taher Haveliwala. Efficient Computation of PageRank.Technical Report, Stanford University, 1999.
- Taher Haveliwala.
Topic-Sensitive Page Rank. Proceedings of WWW11, 2002.
- Glen Jeh and Jennifer Widom. Scaling Personalized Web
Search. Proceedings of WWW12, 2003.
Web Spam
- Zoltán Gyöngyi, Hector Garcia-Molina.
Web Spam Taxonomy.
First International Workshop on Adversarial Information Retrieval on the
Web (at the 14th
International World Wide Web Conference), Chiba, Japan, 2005.
- Zoltán Gyöngyi, Hector Garcia-Molina and Jan Pedersen.
Combating Web Spam with TrustRank.
30th International Conference on Very Large Data Bases (VLDB),
Toronto, Canada, 2004.
- Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen.
Link Spam Detection Based on Mass Estimation.
32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006.
paper,
presentation
Recommendation Systems
- G. Adomavicius and A. Tuzhilin. Towards the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art
and Possible Extensions. IEEE TKDE, June 2005.
- Greg Linden, Brent Smith, and Jeremy York. Amazon.com
Recommendations: Item-to-Item Collaborative Filtering. IEEE
Internet Computing, Jan/Feb 2003.
- Sean McNee, John Riedl, Joseph A. Konstan. Accurate is not
always good: How accuracy metrics have hurt recommender systems.
ACM CHI 2006.
- Moses Charikar. Similarity Estimation Techniques from
Rounding Algorithms. ACM STOC 2002.
- Monika Henzinger. Finding Near-Duplicate Web Pages: A
Large-Scale Evaluation of Algorithms. ACM SIGIR 2006.
Relation Extraction
- Sergey Brin. Extracting Patterns
and Relations from the
World Wide Web. WebDB Workshop at 6th International Conference on
Extending Database Technology, EDBT'98, 1998.
- Eugene Agichtein and Luis Gravano.
Snowball: Extracting Relations from Large Plain-Text Collections
. Proceedings of the Fifth ACM International Conference on Digital
Libraries, 2000.
- S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng
(2002). P. Bennett, S. Dumais and E. Horvitz (2002).
Web question answering: Is more always better? In Proceedings of SIGIR'02, Aug 2002,
pp. 291-298.
Virtual Databases
- Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos.
Wrapper Induction for Information Extraction
.
Intl. Joint Conference on Artificial Intelligence (IJCAI), 1997.
- Anand Rajaraman, Jeffrey D. Ullman,
Querying Websites using Compact Skeletons.
Journal of Computer and System Sciences 66(4): 809-851 (2003).
Similarity Search
-
S. Chaudhuri, V. Ganti, and Raghav Kaushik,
A Primitive Operator
for Similarity Joins in Data Cleaning, 22nd ICDE (2006).
-
C. Xiao, W. Wang, X. Lin, and J. X. Yu,
Efficient Similarity
Joins for Near Duplicate Detection, 17th WWW Conference (2008), pp. 131-140.
-
P. Indyk and R. Motwani. "Approximate Nearest Neighbor:
Towards Removing the Curse of Dimensionality,"
30th STOC (1998), pp. 604-613.
-
A. Gionis, P. Indyk, and R. Motwani,
Similarity Search in High Dimensions
Via Hashing, 25th VLDB (1999), pp. 518-529.
-
E. Cohen. Size-Estimation Framework with Applications to
Transitive Closure and Reachability. Journal of Computer
and System Sciences 55 (1997), pp. 441-453.
-
A Debate About the "Long Tail"
between Chris Anderson and Anita Elberse.
-
A. S. Das, M. Datar, A. Garg, and S. Rajaram,
Google News Personalization:
Scalable On-Line Collaborative Filtering 16th WWW Conference, pp. 271-280.
-
F. Chang et al.,
Bigtable: A Distributed System
for Structured Data, 7th OSDI (2006).
-
B. H. Bloom,
"Space/time trade-offs in hash coding with allowable errors,"
Comm. ACM 13:7 (1970), pp. 422-426.
-
W.H. Kautz and R.C. Singleton,
"Nonadaptive binary superimposed codes,"
IEEE Trans. Inform. Theory 10 (1964), pp. 363-377.
NetFlix Prize
-
R. M. Bell, Y. Koren, and C. Volinsky,
The
BellKor 2008 Solution to the Netflix Prize.
-
A. Toscher and M. Jahrer,
The
BigChaos Solution to the Netflix Prize 2008
Association Rules
-
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman,
``Computing
Iceberg Queries Efficiently,''
1998 VLDB.
Postscript.
-
H. Toivonen, ``Sampling Large Databases for Association Rules,''
VLDB 1996, pp. 134-145.
Postscript.
-
J. S. Park, M.-S. Chen, and P. S. Yu, ``An Effective Hash-Based Algorithm
for Mining Association Rules,''
1995 SIGMOD, pp. 175--186.
PDF
-
R. Agrawal, T. Imielinski, A. Swami: ``Mining Associations between Sets of Items
in Massive Databases'', Proc. of the ACM
SIGMOD Int'l Conference on Management of Data,
Washington D.C., May 1993, 207-216.
Postscript.
PDF.
-
R. Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'',
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, Sept. 1994.
Postscript.
PDF.
Stream Mining
-
M. Datar, A. Gionis, P. Indyk, and R. Motwani,
"Maintaining Stream Statistics Over Sliding Windows,"
SIAM J. Computing, 31 (2002): 1794-1813.
On-Line.
-
N. Alon, Y. Matias, and M. Szegedy,
"The Space Complexity of Approximating Frequency Moments,"
28th STOC, pp. 20-29, 1996.
-
P. Flajolet and G. N. Martin,
"Probabilistic Counting for Database Applications,"
JCSS 31:2 (Sept., 1985), pp. 182-209. Also 24th FOCS,
pp. 76-82, 1983.
-
J. Vitter,
"Random Sampling with a Reservoir,"
ACM Trans. on Mathmatical Software 11:1 (1985), pp. 37-57.
-
Babcock et al.,
"Models and Issues in Data Streams,"
21st PODS (2002).
On-line.
Clustering
- B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan,
"Maintaining Variance and k-Medians Over Data Stream Windows,"
2003 PODS. See PS, PDF, etc..
- P. Bradley, U. Fayyad, and C. Reina, ``Scaling Clustering Algorithms to
Large Databases,'' 1998 KDD.
- S. Guha, R. Rastogi, and K. Shim, ``CURE: An Efficient Clustering
Algorithm for Large Databases,'' SIGMOD 1998.