CS345 Autumn, 2005: Data Mining.

Course Information

Instructors: Anand Rajaraman (anand @ cs dt stanford dt edu), Jeffrey D. Ullman (ullman @ gmail dt com).

TA: Robbie Yan (xyan @ stanford dt edu).

Email Address for Questions: cs345a-aut0506-staff @ lists dt stanford dt edu (This is the best way to reach all three of us simultaneously)

Meeting: MW 3:15 - 4:30PM; Room: Education 128. Would you believe our room got changed again? This one is right across the mall from the place we met the first time. We hope it will be the last change.

Office Hours: Instructors will be available after classes that they teach. Jeff is in 433 Gates and Anand in 438 Gates. Robbie's office hours: 2-5PM Tuesdays in B26B Gates.

Prerequisites: CS145 or equivalent.

Materials: There is no text, but students will use the Gradiance automated homework system for which a nominal fee will be charged. Notes and/or slides will be posted on-line. You can see earlier versions of the notes and slides covering Data Mining. Not all these topics will be covered this year.

Requirements: There will be periodic homeworks (some on-line, using the Gradiance system), a final exam, and a project on web-mining, using the Stanford WebBase. The homework will count just enough to encourage you to do it, about 20%. The project and final will account for the bulk of the credit, in roughly equal proportions.

Handouts

Note: The slides labeled as "for Anand's lecture" are authored by Anand Rajaraman.

Introduction to Data Mining Powerpoint Slides for Jeff's half of 9/26 lecture. [PDF]
Introduction to Web Mining Powerpoint Slides for Anand's half of 9/26 lecture and part of 9/28 lecture. [PDF]
Mining Frequent Pairs; A-Priori Powerpoint Slides for Jeff's 9/28 lecture. [PDF]
Improved Methods for Frequent-Pair Mining Powerpoint Slides for Jeff's 10/3 lecture. [PDF]
Web Crawling Powerpoint slides for Anand's 10/5 lecture. [PDF]
PageRank and Hubs/Authorities Powerpoint Slides for Jeff's 10/10 and 10/12 lectures. [PDF]
Minhashing and Locality-Sensitive Hashing Powerpoint Slides for Jeff's 10/12 and 10/17 lectures. [PDF]
Topic-Specific PageRank Powerpoint Slides for Anand's 10/19 lecture. [PDF]
Web Spam Powerpoint Slides for Anand's 10/24 lecture. [PDF]
Introduction to Stream Mining Powerpoint Slides for Jeff's 10/26 lecture. [PDF]
Stream Mining, Estimating Frequencies Powerpoint Slides - Set #1 and Powerpoint Slides - Set #2 for Jeff's 10/31 and 11/2 lectures. [PDF] [PDF]
Extracting Relational Data from the Web Powerpoint Slides for Anand's 11/07 lecture. [PDF]
Virtual Databases Powerpoint Slides - Set #1 and Powerpoint Slides - Set #2 for Anand's 11/09 lecture. [PDF] [PDF]
Introduction to Clustering, k-Means Powerpoint Slides for Jeff's 11/14 and 11/16 lectures. [PDF]
More on Clustering Powerpoint Slides for Jeff's 11/28 lecture. [PDF]
Optimizing Selection of Ads Powerpoint Slides for Anand's 11/30 lecture. [PDF]

Assignments

Some of the homework will be on the Gradiance system. You should go there to open your account, and enter the class code that will be told to you in class. You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

The first Gradiance assignment is now available. It closes at 11:59PM on Monday Oct. 10. After that time, you will be able to see the solutions, provided you tried the work at least once.
The second Gradiance assignment is now available. It closes at 11:59PM on Monday Oct. 17.
The third Gradiance assignment is now available. This assignment consists of three separate homeworks of 4 questions each. One is on Minhash/LSH, and the other two are on crawling. They all close at 11:59PM on Monday Oct. 31, so you have two weeks to do the assignments.
The fourth Gradiance assignment is now available. The topic is stream mining, covering material taught on Oct. 26 and 31. It closes at 11:59PM on Monday Nov. 7.

Project

CS345A Project specification:

Overview: a software project that discovers or leverages interesting relationships within a significant amount of data. Best if the project leverages what we have learned in class.
Some project ideas (these serve merely as ideas. They should by no means restrict your imagination)
- Implement anti-spam algorithm (e.g. Trust Rank) on a collection of webpages
- Implement a better version of topic-sensitive PageRank on a collection of webpages (by "better," we mean "incorporating your own ideas")
- Implement collaborative filtering technique on certain basket/item data (from Ebay or Amazon, for instance)
- Implement key components of a vertical search engine (e.g. crawler)
- Implement a meta-search engine that post-processes Google and Yahoo's results
- Implement a meta-search engine that queries multiple databases and meshes and presents their results in a meaningful way
- Implement a heuristic/algorithm for ranking Stanford domain web pages
- Design and implement frequent itemset identification by better 1-pass algorithms
- Design and implement your alternative to locality-sensitive hashing algorithm
Deliverables:
- Project proposal (1-2 paragraph description of your project) due 11:59PM on Friday, November 4. You should include your project title and delineate your
  1. project idea
  2. data source
  3. key algorithms/technology, and
  4. what you expect to submit at the end of the quarter.
  It is to your benefit to flesh out your ideas as much as possible in this proposal.
- Final project writeup (5-10 pages) due 11:59PM on Wednesday, December 7. This is a comprehensive description of your project. You should include the following:
  1. project idea
  2. your specific implementation
  3. key results and metrics of your system
  4. what worked, what did not work, what surprised you, and why
- Demo with professors and TA We will schedule demos with each team during dead week and finals week (dead week means the week before finals)
- Final presentation In the last day of class (Wednesday, December 7), each team presents their project to the rest of the class. The presentation should have no more than 3 slides (not counting title/reference etc.) and last no more than 5 minutes. 3 minutes will be allotted for questions.

here

Course Outline

Some of the topics to be covered in the data-mining section are:

Data Mining

Frequent Itemsets (Association Rules): a-priori and improvements.
Low-frequency, High-correlation Mining: min-hashing, locality-sensitive hashing.
Mining data streams.
Clustering for Large-Scale Data.

Web Mining

Anatomy of a search engine.
Crawling the web.
Web Graph Analysis
Extracting structured data from the web.
Classification and vertical search.
Web Log Analysis.
Search advertising and optimization.

Here is a Tentative Schedule for the course. It will be updated periodically.

Resources and Readings

Notes from Jeff Ullman's data-mining course from Spring, 2000.
For the course project: Stanford WebBase
References for Web Mining lectures.
References for lectures on stream mining.
References for lectures on clustering.

Frequently Asked Questions

For any given homework, which score on Gradiance counts? The best one or the last one?
The last one. After the homework closes, you will be allowed to view your last submission with solutions. Also, if you wish, you can then play with the assignment again, e.g. to practice for an exam, but the score will not count. Note, however, that if you don't submit at least once, you don't get to see solutions.
Why should we believe that, just because no itemset in the negative border is frequent in the full dataset, there is no other itemset that is not frequent in the sample, yet is frequent in the full dataset?
The proof is not trivial. It can be found on Toivonen's paper. You can prove it yourself by a careful induction on the size of an itemset S that either:
1. S is frequent in the sample, or
2. S contains a member of the negative border.
(note that 2 includes the case where S *is* a member of the negative border.)
We will leave the proof to you. But once you have that, suppose S is frequent in the full dataset but not in the sample. Then (2) holds, and by monotonicity, there is a member of the negative border that is frequent in the full dataset.