 CS345 Autumn, 2005: Data Mining.

Course Info  Handouts 
Assignments 
Project 
Course Outline 
Resources and Reading 
Frequently Asked Questions
Course Information
Instructors: Anand Rajaraman (anand @ cs dt stanford dt edu),
Jeffrey D. Ullman (ullman @ gmail dt com).
TA: Robbie Yan (xyan @ stanford dt edu).
Email Address for Questions: cs345aaut0506staff @ lists dt stanford dt edu (This is the
best way to reach all three of us simultaneously)
Meeting: MW 3:15  4:30PM; Room: Education 128. Would you believe
our room got changed again? This one is right across the mall from the
place we met the first time. We hope it will be the last change.
Office Hours: Instructors will be available after classes that they
teach. Jeff is in 433 Gates and Anand in 438 Gates. Robbie's office hours:
25PM Tuesdays in B26B Gates.
Prerequisites: CS145 or equivalent.
Materials: There is no text, but students will use the
Gradiance automated homework system for which a nominal fee will be charged.
Notes and/or slides will be posted online.
You can see earlier versions of
the notes and slides covering Data Mining.
Not all these topics will be covered this year.
Requirements: There will be periodic homeworks (some online, using
the Gradiance system), a final exam,
and a project on webmining, using the Stanford WebBase.
The homework will count just enough to encourage you to do it, about
20%.
The project and final will account for the bulk of the credit, in
roughly equal proportions.
Handouts
Note: The slides labeled as "for Anand's lecture" are authored
by Anand Rajaraman.
 Introduction to Data Mining
Powerpoint
Slides for Jeff's half of 9/26 lecture.
[PDF]
 Introduction to Web Mining
Powerpoint
Slides for Anand's half of 9/26 lecture and part of 9/28 lecture.
[PDF]
 Mining Frequent Pairs; APriori
Powerpoint
Slides for Jeff's 9/28 lecture.
[PDF]
 Improved Methods for FrequentPair Mining
Powerpoint
Slides for Jeff's 10/3 lecture.
[PDF]
 Web Crawling
Powerpoint slides for Anand's 10/5 lecture.
[PDF]
 PageRank and Hubs/Authorities
Powerpoint
Slides for Jeff's 10/10 and 10/12 lectures.
[PDF]
 Minhashing and LocalitySensitive Hashing
Powerpoint
Slides for Jeff's 10/12 and 10/17 lectures.
[PDF]
 TopicSpecific PageRank
Powerpoint
Slides for Anand's 10/19 lecture.
[PDF]
 Web Spam
Powerpoint
Slides for Anand's 10/24 lecture.
[PDF]
 Introduction to Stream Mining
Powerpoint
Slides for Jeff's 10/26 lecture.
[PDF]
 Stream Mining, Estimating Frequencies
Powerpoint
Slides  Set #1 and Powerpoint
Slides  Set #2 for Jeff's 10/31 and 11/2 lectures.
[PDF]
[PDF]
 Extracting Relational Data from the Web
Powerpoint
Slides for Anand's 11/07 lecture.
[PDF]
 Virtual Databases
Powerpoint
Slides  Set #1 and Powerpoint
Slides  Set #2 for Anand's 11/09 lecture.
[PDF]
[PDF]
 Introduction to Clustering, kMeans
Powerpoint
Slides for Jeff's 11/14 and 11/16 lectures.
[PDF]
 More on Clustering
Powerpoint
Slides for Jeff's 11/28 lecture.
[PDF]
 Optimizing Selection of Ads
Powerpoint
Slides for Anand's 11/30 lecture.
[PDF]
Assignments
Some of the homework will be on the Gradiance system.
You should go there to open your account, and enter the class code that will
be told to you in class.
You can try
the work as many times as you like, and we hope everyone will eventually
get 100%. The secret is that each of the questions involves a
"longanswer" problem, which you should work. The Gradiance system gives you
random right and wrong answers each time you open it, and thus samples
your knowledge of the full problem. While there are ways to game the
system, we group several questions at a time, so it is hard to get 100%
without actually working the problems. Also notice that you have to
wait 10 minutes between openings, so bruteforce random guessing will
not work.

The first Gradiance assignment
is now available. It closes at 11:59PM on Monday Oct. 10. After that time, you
will be able to see the solutions, provided you tried the work at least once.

The second Gradiance assignment
is now available. It closes at 11:59PM on Monday Oct. 17.

The third
Gradiance assignment
is now available. This assignment consists of three separate homeworks of
4 questions each. One is on Minhash/LSH, and the other two are on crawling.
They all close at 11:59PM on Monday Oct. 31, so you have two weeks to do the
assignments.

The fourth
Gradiance assignment
is now available. The topic is stream mining, covering material taught on
Oct. 26 and 31.
It closes at 11:59PM on Monday Nov. 7.
Project
CS345A Project specification:
 Overview: a software project that discovers or leverages interesting relationships within a
significant amount of data. Best if the project leverages what we have learned in class.
 Some project ideas (these serve merely as ideas. They should by no means restrict your
imagination)
 Implement antispam algorithm (e.g. Trust Rank) on a collection of webpages
 Implement a better version of topicsensitive PageRank on a collection of webpages (by
"better,"
we mean "incorporating your own ideas")
 Implement collaborative filtering technique on certain basket/item data (from Ebay or Amazon,
for
instance)
 Implement key components of a vertical search engine (e.g. crawler)
 Implement a metasearch engine that postprocesses Google and Yahoo's results
 Implement a metasearch engine that queries multiple databases and meshes and presents their
results in a meaningful way
 Implement a heuristic/algorithm for ranking Stanford domain web pages
 Design and implement frequent itemset identification by better 1pass algorithms
 Design and implement your alternative to localitysensitive hashing algorithm
 Deliverables:
 Project proposal (12 paragraph description of your project) due 11:59PM on Friday, November
4.
You should include your project title and delineate your
 project idea
 data source
 key algorithms/technology, and
 what you expect to submit at the end of the quarter.
It is to your benefit to flesh out your ideas as much as possible in this
proposal.
 Final project writeup (510 pages) due 11:59PM on Wednesday, December 7.
This is a comprehensive description of your project. You should include the following:
 project idea
 your specific implementation
 key results and metrics of your system
 what worked, what did not work, what surprised you, and why
 Demo with professors and TA
We will schedule demos with each team during dead week and finals week (dead week means the week
before finals)
 Final presentation
In the last day of class (Wednesday, December 7), each team presents their project to the rest of
the class. The presentation should have no more than 3 slides (not counting title/reference etc.)
and last no more than 5 minutes. 3 minutes will be allotted for questions.
Resources: Stanford WebBase project. Find description here. Find how to access web
pages in the
repository here.
Course Outline
Some of the topics to be covered in the datamining section are:
Data Mining

Frequent Itemsets (Association Rules): apriori and improvements.

Lowfrequency, Highcorrelation Mining: minhashing, localitysensitive
hashing.

Mining data streams.

Clustering for LargeScale Data.
Web Mining
 Anatomy of a search engine.
 Crawling the web.
 Web Graph Analysis
 Extracting structured data from the web.
 Classification and vertical search.
 Web Log Analysis.
 Search advertising and optimization.
Here is a Tentative Schedule
for the course.
It will be updated periodically.
Resources and Readings
Frequently Asked Questions
 For any given homework, which score on Gradiance counts? The best one or the last one?
The last one. After the homework
closes, you will be allowed to view your last submission with
solutions. Also, if you wish, you can then play with the assignment
again, e.g. to practice for an exam, but the score will not count.
Note, however, that if you don't submit at least once, you don't get
to see solutions.

Why should we believe that, just because no itemset in
the negative border is frequent in the full dataset, there is no other
itemset that is not frequent in the sample, yet is frequent in the
full dataset?
The proof is not trivial. It can be found on Toivonen's paper.
You can prove it yourself by a careful induction on the size of an itemset S
that either:
1. S is frequent in the sample, or
2. S contains a member of the negative border.
(note that 2 includes the case where S *is* a member of the negative border.)
We will leave the proof to you. But once you have that, suppose S is
frequent in the full dataset but not in the sample. Then (2) holds,
and by monotonicity, there is a member of the negative border that is
frequent in the full dataset.