| CS345A, Winter 2009: Data Mining.
|
Course Info | Handouts |
Assignments |
Project |
Course Outline |
Resources and Reading
Course Information
NEW NEW ROOM: 200-002. This is the big auditorium in the basement of the History Corner.
It seats 163, so there should be plenty of room for us to spread out.
Instructors: Anand
Rajaraman (anand @ kosmix dt com),
Jeffrey D. Ullman (ullman @ gmail dt com).
TA: Anish Johnson (ajohna @ stanford dt edu).
Staff Mailing List:
cs345a-win0809-staff@mailman.stanford.edu
Meeting: MW 4:15 - 5:30PM; Room: History Corner basement 200-002.
Office Hours:
Anand Rajaraman: MW 5:30-6:30pm (after the class in the same room)
Jeff Ullman 2-4PM on the days I teach, in 433 Gates.
TA: Anish Johnson Tuesdays: 9:15-10:45am in B26A Gates
Thursdays: 1-3pm in B24B Gates
Prerequisites: CS145 or equivalent.
Materials: There is no text. However, if you have the
second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman,
Widom), you will find Section 20.2 and Chapters 22 and 23 relevant.
Slides from the lectures will be made available in PPT and PDF formats.
Students will use the
Gradiance automated homework system for which a fee will be charged.
Note: if you already have Gradiance (GOAL) privileges from CS145 or CS245 within the past year,
you should also have access to the CS345A homework without paying an additional fee.
Notes and/or slides will be posted on-line.
You can see earlier versions of
the notes and slides covering Data Mining.
Not all these topics will be covered this year.
Requirements: There will be periodic homeworks (some on-line, using
the Gradiance system), a final exam,
and a project on web-mining.
The homework will count just enough to encourage you to do it, about
20%.
The project and final will account for the bulk of the credit, in
roughly equal proportions.
Handouts
Date | Topic | PowerPoint Slides | PDF Document |
1/7 | Introductory Remarks (JDU) | PPT | PDF |
1/7 | Introductory Remarks (AR) | PPT | PDF |
1/12 | Map-Reduce | PPT | PDF |
1/14 | Frequent Itemsets 1 | PPT | PDF |
1/14-1/21 | Frequent Itemsets 2 | PPT | PDF |
1/16 | Peter Pawlowski's Talk on Aster Data | PPTX | PDF |
1/16 | Nanda Kishore's Talk on ShareThis | PPT | PDF |
1/26 | Recommendation Systems | PPT | PDF |
1/28 | Shingling, Minhashing, Locality-Sensitive Hashing | PPT | PDF |
2/2 | Applications and Variants of LSH | PPT | PDF |
2/2-2/4 | Distance Measures, Generalizations of Minhashing and LSH | PPT | PDF |
2/4 | High-Similarity Algorithms | PPT | PDF |
2/9 | PageRank | PPT | PDF |
2/11 | Link Spam, Hubs & Authorities | PPT | PDF |
2/18 | Generalization of Map-Reduce | PPT | PDF |
2/18-2/23 | Clustering | PPT | PDF |
2/23 | Streaming Data | PPT | PDF |
2/25 | Relation Extraction | PPT | PDF |
3/2 | On-Line Algorithms, Advertising Optimization | PPT | PDF |
3/4 | Algorithms on Streams | PPT | PDF |
Assignments
There will be assignments of two kinds.
Gradiance Assignments
Some of the homework will be on the Gradiance system.
You should go there to open your account, and enter the class token 83769DC9.
If you have taken CS145 or CS245 within the past year, your account for
that class should grant you free access for CS345. If not, you will have
to purchase the access on-line.
Note: If you have to purchase access, use either
Garcia-Widom-Ullman, 2nd Edition or Ullman-Widom 3rd Edition (the books
used for 145 and 245). Do not purchase access to the Tan-Steinbach-Kumar
materials, even though the title is "Data Mining."
You can try
the work as many times as you like, and we hope everyone will eventually
get 100%. The secret is that each of the questions involves a
"long-answer" problem, which you should work. The Gradiance system gives you
random right and wrong answers each time you open it, and thus samples
your knowledge of the full problem. While there are ways to game the
system, we group several questions at a time, so it is hard to get 100%
without actually working the problems. Also notice that you have to
wait 10 minutes between openings, so brute-force random guessing will
not work.
Solutions appear after the problem-set is due. However, you must submit at
least once, so your most recent solution appears with the solutions
embedded.
Challenge Problems
These are more complex problems for which written solutions are requested.
They will be "lightly graded," meaning that we shall accept any
reasonable attempt, and those doing exceptionally well will get "extra
credit," but there will not be exact numerical grades assigned.
Assignment | Due Date |
Gradiance HW
#1 | Monday, January 26 (11:59PM) |
Challenge Problems
#1 Solution | Wednesday, January 28 (In class) |
Gradiance HW
#2 | Wednesday, January 28 (11:59PM) |
Challenge Problems
#2 Solution | Wednesday, February 4 (In class) |
Gradiance HW
#3 | Wednesday, February 4 (11:59PM) |
Project Proposal | Monday, February 9 (11:59PM) |
Gradiance HW
#4 | Wednesday, February 11 (11:59PM) |
Gradiance HW
#5 | Wednesday, February 18 (11:59PM) |
Gradiance HW
#6 | Monday, March 9 (11:59PM) |
Gradiance HW
#7 | Wednesday, March 11 (11:59PM) |
Final Project Report Due | Wednesday March 11 (In class) |
Project
CS345A Project specification:
- Overview: a software project that discovers or leverages interesting relationships within a
significant amount of data. Best if the project leverages what we have learned in class.
-
A Project Proposal should be sent in positively by midnight, Feb. 9.
-
Access to Aster nCluster resources now available.
-
ShareThis datasets have been loaded onto both nClusters: check in the 'sharethis' database. Details to the database format here
- Some project ideas (these serve merely as ideas. They should by no means restrict your
imagination)
- Implement anti-spam algorithm (e.g. Trust Rank) on a collection of webpages
- Implement a better version of topic-sensitive PageRank on a collection of webpages (by
"better,"
we mean "incorporating your own ideas")
- Implement collaborative filtering technique on certain basket/item data (from Ebay or Amazon,
for
instance)
- Tell something useful about a collection of documents -- Web pages, news articles, reviews, blogs, e.g.
Possible goals include identifying sentiment (is a review positive or negative?),
telling wise blogs from foolish, telling real news from publicity releases, e.g.
- Take a shot at the $1,000,000 Netflix Prize.
-
Be sure to look under Resources to see what data sets are available.
- Deliverables:
- Final project writeup (5-10 pages) due Wednesday, March 11, in class.
This is a comprehensive description of your project. You should include the following:
- Project idea
- Your specific implementation
- Key results and metrics of your system
- What worked, what did not work, what surprised you, and why
- Final Presentation:
Each team presents their project to the rest of
the class. The presentation should have no more than 3 slides (not counting title/reference etc.)
and last no more than 5 minutes. 3 minutes will be allotted for questions.
There are three sessions, at 4:15PM on Weds.-Fri., March 11-13.
We'll announce a way for people to pick slots shortly, but if you have absolute constraints, you should
contact the staff list now.
- Demo with Course Staff:
If you woud like more time to demo your work, you can optionally schedule time
with the course staff during dead week and finals week.
Course Outline
Here is a tentative schedule of topics:
Date | Topic | Lecturer |
1/7 | Introduction | JDU, AR |
1/12 | Map-Reduce | AR |
1/14 | Frequent Itemsets | JDU |
1/16 | Special Lecture on Aster/Map-Reduce, ShareThis | 5:15PM in B12 Gates |
1/21 | Frequent Itemsets | JDU |
1/26 | Recommendation Systems | AR |
1/28 | Similarity Search | JDU |
2/2 | Similarity Search | JDU |
2/4 | Similarity Search | JDU |
2/9 | Link Analysis | AR |
2/11 | Spam Detection | AR |
2/18 | Generalizing Map-Reduce Clustering | JDU AJA |
2/23 | Clustering, Streaming Data | JDU |
2/25 | Extracting Structured Data from the Web | AR |
3/2 | Advertising on the Web | AR |
3/4 | Stream Mining | JDU |
3/9 | Stream Sampling | DEK |
3/11 | Project Reports | students |
3/12 | Project Reports | students; Rm. 260-012 |
3/13 | Project Reports | students; Rm. 260-012 |
3/19 | Final Exam | 12:15-3:15PM, Rm. 200-002 (regular classroom) |
References and Resources
- Last Year's Final. Note that the subject matter is somewhat different this year, so you
should not assume the coverage on this year's final will be exactly the same. It will, however, cover all material in
the course up to and including the 3/4 lecture.
- References. We'll put in this file citations or links to papers that
you may wish to read to learn more about certain topics covered in the class. They are not required reading.
- Yahoo! Catalog of data sets available.
Note: Jeff Ullman will have to apply for any sets you want, and we must agree not to distribute them further.
There may be a delay, so get requests in early.
- Yannis Antonellis and Jawed Karim offer a file that contains information about the
search queries that were used to reach pages on the Stanford Web server.
The format is
visitor_hash | timestamp | requested_url | referer_from_a_search_engine |
E.g.,
a997c1950718d75c03f22ca8715e50b3 | [28/Feb/2007:23:45:47 -0800] |
/group/svsa/cgi-bin/www/officers.php | "http://www.google.com/sea
rch?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts" |
See http://www.stanford.edu/~antonell/tags_dataset.html
for more information about how to get and use this file.
- The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized
crawl if you have a need. Find description here. Find how to access web
pages in the
repository here.
-
Here's a data set that might be interesting to some of you as you think
about your project. e.g., on news clustering, identifying trends in news
stories, etc. There is a nominal fee to get the DVD with the data, but
if someone if really interested I'm sure we could arrange to make it
available:
http://open.blogs.nytimes.com/2009/01/12/fatten-up-your-corpus/
Excerpt:
Available for noncommercial research license from The Linguistic Data
Consortium (LDC), the corpus spans 20 years of newspapers between 1987
and 2007 (that's 7,475 issues, to be exact). This collection includes
the text of 1.8 million articles written at The Times (for wire service
articles, you'll have to look elsewhere). Of these, more than 1.5
million have been manually annotated by The New York Times Index with
distinct tags for people, places, topics and organizations drawn from a
controlled vocabulary. A further 650,000 articles also include summaries
written by indexers from the New York Times Index. The corpus is
provided as a collection of XML documents in the News Industry Text
Format and includes open source Java tools for parsing documents into
memory resident objects.
-
A former CS345A student and the TA from last year have started a company, Celixis,
to do a cellphone-based advisor. They offer two data sets that might be of interest; both are based on
restaurant reviews:
- A corpus of restaurants and reviews (100+ thousand restaurants, text of reviews can be tagged by part-of-speech).
They are interested, for example, in knowing the keywords or key phrases (consecutive words) that best characterize
different kinds of restaurants. As a baseline for word occurrence, they can also provide
a sample corpus of the web (10+ million pages), and average single word stats over that corpus.
- A training set of (user id, restaurant id, rating) tuples.
They can also provide a corpus of restaurant info and reviews (in case a model-based approach is used).
This data can be used in a manner similar to the Netflix data, but they are not offering $1M for a
good solution. And you would have to excise from the data a small portion to measure your performance,
while Netflix retains the test data itself.
If you are interested in obtaining either of these data sets, they can be emailed as love-cs345 at cellixis dt cm.
-
ACM has just issued its Multimedia Grand Challege(s). Many of these involve images
in a way that we don't have the resources to deal with in the next month, but you might want to read the material
to see if anything looks doable and interesting.