| CS345 Autumn, 2005: Data Mining.
|
Course Info | Handouts |
Assignments |
Project |
Course Outline |
Resources and Reading |
Frequently Asked Questions
Course Information
Instructors: Anand Rajaraman (anand @ cs dt stanford dt edu),
Jeffrey D. Ullman (ullman @ gmail dt com).
TA: Robbie Yan (xyan @ stanford dt edu).
Email Address for Questions: cs345a-aut0506-staff @ lists dt stanford dt edu (This is the
best way to reach all three of us simultaneously)
Meeting: MW 3:15 - 4:30PM; Room: Education 128. Would you believe
our room got changed again? This one is right across the mall from the
place we met the first time. We hope it will be the last change.
Office Hours: Instructors will be available after classes that they
teach. Jeff is in 433 Gates and Anand in 438 Gates. Robbie's office hours:
2-5PM Tuesdays in B26B Gates.
Prerequisites: CS145 or equivalent.
Materials: There is no text, but students will use the
Gradiance automated homework system for which a nominal fee will be charged.
Notes and/or slides will be posted on-line.
You can see earlier versions of
the notes and slides covering Data Mining.
Not all these topics will be covered this year.
Requirements: There will be periodic homeworks (some on-line, using
the Gradiance system), a final exam,
and a project on web-mining, using the Stanford WebBase.
The homework will count just enough to encourage you to do it, about
20%.
The project and final will account for the bulk of the credit, in
roughly equal proportions.
Handouts
Note: The slides labeled as "for Anand's lecture" are authored
by Anand Rajaraman.
- Introduction to Data Mining
Powerpoint
Slides for Jeff's half of 9/26 lecture.
[PDF]
- Introduction to Web Mining
Powerpoint
Slides for Anand's half of 9/26 lecture and part of 9/28 lecture.
[PDF]
- Mining Frequent Pairs; A-Priori
Powerpoint
Slides for Jeff's 9/28 lecture.
[PDF]
- Improved Methods for Frequent-Pair Mining
Powerpoint
Slides for Jeff's 10/3 lecture.
[PDF]
- Web Crawling
Powerpoint slides for Anand's 10/5 lecture.
[PDF]
- PageRank and Hubs/Authorities
Powerpoint
Slides for Jeff's 10/10 and 10/12 lectures.
[PDF]
- Minhashing and Locality-Sensitive Hashing
Powerpoint
Slides for Jeff's 10/12 and 10/17 lectures.
[PDF]
- Topic-Specific PageRank
Powerpoint
Slides for Anand's 10/19 lecture.
[PDF]
- Web Spam
Powerpoint
Slides for Anand's 10/24 lecture.
[PDF]
- Introduction to Stream Mining
Powerpoint
Slides for Jeff's 10/26 lecture.
[PDF]
- Stream Mining, Estimating Frequencies
Powerpoint
Slides - Set #1 and Powerpoint
Slides - Set #2 for Jeff's 10/31 and 11/2 lectures.
[PDF]
[PDF]
- Extracting Relational Data from the Web
Powerpoint
Slides for Anand's 11/07 lecture.
[PDF]
- Virtual Databases
Powerpoint
Slides - Set #1 and Powerpoint
Slides - Set #2 for Anand's 11/09 lecture.
[PDF]
[PDF]
- Introduction to Clustering, k-Means
Powerpoint
Slides for Jeff's 11/14 and 11/16 lectures.
[PDF]
- More on Clustering
Powerpoint
Slides for Jeff's 11/28 lecture.
[PDF]
- Optimizing Selection of Ads
Powerpoint
Slides for Anand's 11/30 lecture.
[PDF]
Assignments
Some of the homework will be on the Gradiance system.
You should go there to open your account, and enter the class code that will
be told to you in class.
You can try
the work as many times as you like, and we hope everyone will eventually
get 100%. The secret is that each of the questions involves a
"long-answer" problem, which you should work. The Gradiance system gives you
random right and wrong answers each time you open it, and thus samples
your knowledge of the full problem. While there are ways to game the
system, we group several questions at a time, so it is hard to get 100%
without actually working the problems. Also notice that you have to
wait 10 minutes between openings, so brute-force random guessing will
not work.
-
The first Gradiance assignment
is now available. It closes at 11:59PM on Monday Oct. 10. After that time, you
will be able to see the solutions, provided you tried the work at least once.
-
The second Gradiance assignment
is now available. It closes at 11:59PM on Monday Oct. 17.
-
The third
Gradiance assignment
is now available. This assignment consists of three separate homeworks of
4 questions each. One is on Minhash/LSH, and the other two are on crawling.
They all close at 11:59PM on Monday Oct. 31, so you have two weeks to do the
assignments.
-
The fourth
Gradiance assignment
is now available. The topic is stream mining, covering material taught on
Oct. 26 and 31.
It closes at 11:59PM on Monday Nov. 7.
Project
CS345A Project specification:
- Overview: a software project that discovers or leverages interesting relationships within a
significant amount of data. Best if the project leverages what we have learned in class.
- Some project ideas (these serve merely as ideas. They should by no means restrict your
imagination)
- Implement anti-spam algorithm (e.g. Trust Rank) on a collection of webpages
- Implement a better version of topic-sensitive PageRank on a collection of webpages (by
"better,"
we mean "incorporating your own ideas")
- Implement collaborative filtering technique on certain basket/item data (from Ebay or Amazon,
for
instance)
- Implement key components of a vertical search engine (e.g. crawler)
- Implement a meta-search engine that post-processes Google and Yahoo's results
- Implement a meta-search engine that queries multiple databases and meshes and presents their
results in a meaningful way
- Implement a heuristic/algorithm for ranking Stanford domain web pages
- Design and implement frequent itemset identification by better 1-pass algorithms
- Design and implement your alternative to locality-sensitive hashing algorithm
- Deliverables:
- Project proposal (1-2 paragraph description of your project) due 11:59PM on Friday, November
4.
You should include your project title and delineate your
- project idea
- data source
- key algorithms/technology, and
- what you expect to submit at the end of the quarter.
It is to your benefit to flesh out your ideas as much as possible in this
proposal.
- Final project writeup (5-10 pages) due 11:59PM on Wednesday, December 7.
This is a comprehensive description of your project. You should include the following:
- project idea
- your specific implementation
- key results and metrics of your system
- what worked, what did not work, what surprised you, and why
- Demo with professors and TA
We will schedule demos with each team during dead week and finals week (dead week means the week
before finals)
- Final presentation
In the last day of class (Wednesday, December 7), each team presents their project to the rest of
the class. The presentation should have no more than 3 slides (not counting title/reference etc.)
and last no more than 5 minutes. 3 minutes will be allotted for questions.
Resources: Stanford WebBase project. Find description here. Find how to access web
pages in the
repository here.
Course Outline
Some of the topics to be covered in the data-mining section are:
Data Mining
-
Frequent Itemsets (Association Rules): a-priori and improvements.
-
Low-frequency, High-correlation Mining: min-hashing, locality-sensitive
hashing.
-
Mining data streams.
-
Clustering for Large-Scale Data.
Web Mining
- Anatomy of a search engine.
- Crawling the web.
- Web Graph Analysis
- Extracting structured data from the web.
- Classification and vertical search.
- Web Log Analysis.
- Search advertising and optimization.
Here is a Tentative Schedule
for the course.
It will be updated periodically.
Resources and Readings
Frequently Asked Questions
- For any given homework, which score on Gradiance counts? The best one or the last one?
The last one. After the homework
closes, you will be allowed to view your last submission with
solutions. Also, if you wish, you can then play with the assignment
again, e.g. to practice for an exam, but the score will not count.
Note, however, that if you don't submit at least once, you don't get
to see solutions.
-
Why should we believe that, just because no itemset in
the negative border is frequent in the full dataset, there is no other
itemset that is not frequent in the sample, yet is frequent in the
full dataset?
The proof is not trivial. It can be found on Toivonen's paper.
You can prove it yourself by a careful induction on the size of an itemset S
that either:
1. S is frequent in the sample, or
2. S contains a member of the negative border.
(note that 2 includes the case where S *is* a member of the negative border.)
We will leave the proof to you. But once you have that, suppose S is
frequent in the full dataset but not in the sample. Then (2) holds,
and by monotonicity, there is a member of the negative border that is
frequent in the full dataset.