CS345A, Winter 2007-8: Data Mining.


Course Information

Instructors: Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman (ullman @ gmail dt com).

TA: Babak Pahlavan

Meeting: MW 4:15 - 5:30PM; Room: Math corner basement 380-380X .

Office Hours:
Anand Rajaraman: MW 5:30-6:30pm (after the class in the same room)
Jeff Ullman 2-4PM on the days I teach, in 433 Gates.
Babak Pahlavan (TA) 9:30AM-12:30PM on Wednesdays in Gates Room # 24B.

Prerequisites: CS145 or equivalent.

Materials: There is no text, but students will use the Gradiance automated homework system for which a nominal fee will be charged. Notes and/or slides will be posted on-line. We will also distribute some notes that will become part of the next edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom). You can see earlier versions of the notes and slides covering Data Mining. Not all these topics will be covered this year.

Requirements: There will be periodic homeworks (some on-line, using the Gradiance system), a final exam, and a project on web-mining, using the Stanford WebBase. The homework will count just enough to encourage you to do it, about 20%. The project and final will account for the bulk of the credit, in roughly equal proportions.


Handouts

DateTopicPowerPoint SlidesPDF Document
1/9Introductory Remarks (JDU)PPTPDF
1/9Introductory Remarks (AR)PPTPDF
1/14Association Rules I (JDU)PPTPDF
1/14-16Association Rules II (JDU)PPTPDF
1/16-23Map-Reduce (AR) PPT PDF
1/23-28PageRank (AR) PPT PDF
1/28HITS and Spam (AR) PPT PDF
2/4Shingling, Minhashing (JDU) PPT PDF
2/6Locality-Sensitive Hashing (JDU) PPT PDF
2/11Recommendation Systems (AR) PPT PDF
2/13Clustering I (JDU) PPT PDF
2/20Clustering II (JDU) PPT PDF
2/25RelationExtraction (AR) PPT PDF
2/27Advertising (AR) PPT PDF
3/3Stream Mining I (JDU) PPT PDF
3/5Stream Mining II (JDU) PPT PDF

Assignments

Some of the homework will be on the Gradiance system. You should go there to open your account, and enter the class code that will be told to you in class. You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.


Project

CS345A Project specification: