CS345, Autumn 2006: Data Mining.

Course Info | Handouts | Assignments | Project | Course Outline | Resources and Reading | Frequently Asked Questions


Course Information

Final Exam: The final exam will be on Wed, Dec 13th, from 12:15pm-3:15pm. Location: 380-380X (in the basement of the math corner). The final is open book and open notes, but laptops are prohibited. Bring a calculator. Here is last year's final.

The Final for 2006.

Instructors: Anand Rajaraman (anand @ kosmix dt com), Jeffrey D. Ullman (ullman @ gmail dt com).

TA: Jeff Klingner

Email Address for Questions: cs345a-aut0607-staff @ lists dt stanford dt edu (This is the best way to reach all three of us simultaneously)

Meeting: MW 3:15 - 4:30PM; Room: 200-030 (In the history corner, the part of the quad closest to Hoover tower.)

Office Hours: Instructors will be available after classes that they teach. Jeff Ullman is in 433 Gates and Anand in 413 Gates. Jeff Klingner's office hours: Tuesdays 10am-noon & Thursdays 3pm-5pm, Gates 396, or by appointment.

Prerequisites: CS145 or equivalent.

Materials: There is no text, but students will use the Gradiance automated homework system for which a nominal fee will be charged. Notes and/or slides will be posted on-line. You can see earlier versions of the notes and slides covering Data Mining. Not all these topics will be covered this year.

Requirements: There will be periodic homeworks (some on-line, using the Gradiance system), a final exam, and a project on web-mining, using the Stanford WebBase. The homework will count just enough to encourage you to do it, about 20%. The project and final will account for the bulk of the credit, in roughly equal proportions.

Newsgroup: There is a class newsgroup: su.class.cs345a on nntp.stanford.edu. You can use the newsgroup to share datasets, form study groups, or find project partners. The course staff will not read the newsgroup regularly, and we won't use it for any official announcements. To get in touch with us, use cs345a-aut0607-staff @ lists dt stanford dt edu.


Handouts

DateTopicPowerPoint SlidesPDF Document
9/25Introductory RemarksPPTPDF
9/25Introduction to Web MiningPPTPDF
9/27Association Rules 1PPTPDF
10/2Association Rules 2PPTPDF
10/4Page RankPPTPDF
10/9Topic-Specific Page RankPPTPDF
10/11HITS and SpamPPTPDF
10/16Near-Neighbors and MinhashingPPTPDF
10/18Locality-Sensitive HashingPPTPDF
10/23Clustering - Part 1PPTPDF
10/25Recommendation SystemsPPTPDF
10/30Clustering - Part 2PPTPDF
11/01Structured Data ExtractionPPTPDF
11/06Virtual DatabasesPPTPDF
11/06Compact SkeletonsPPTPDF
11/13Online Algorithms, Search AdvertisingPPTPDF
11/15Stream Mining 1PPTPDF
11/27Stream Mining 2PPTPDF
11/27Stream Mining 3PPTPDF
11/29Stream Mining 4PPTPDF

Assignments

Some of the homework will be on the Gradiance system. You should go there to open your account, and enter the class code that will be told to you in class. You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Solutions appear after the problem-set is due. However, you must submit at least once, so your most recent solution appears with the solutions embedded.

AssignmentDue Date
Association Rules #1Tuesday, Oct. 10 (11:59PM)
Association Rules #2Wednesday, Oct. 11 (11:59PM)
Page RankMonday, Oct. 16 (11:59PM)
Minhashing, LSHWednesday, Oct. 30 (11:59PM)
HITS, TSPR, Spam Monday Oct. 30 (11:59PM)
Project Proposal Wednesday, Nov. 1 (11:59PM)
Distance MeasuresMonday, Nov. 6 (11:59PM)
Recommendation Systems Wednesday, Nov. 8 (11:59PM)
Clustering Monday, Nov. 13 (11:59PM)
Stream Mining Wednesday, Dec. 6 (11:59PM)

Project

CS345A Project specification:

Presentation Schedule

DateTimePresenter(s)Project Title
12/43:15-4:00Gred LindenGuest Lecture: Amazon's Recommendation Engine
12/44:00-4:10Abhita Chugh and Ravi Tiruvury Detecting Web Spam with CombinedRank
12/44:10-4:20Rahul Thathoo and Zahid KhanTowards Implementing Better Movie Recommendation Systems
12/44:20-4:30Brian Tran and Minho KimTopic Specific Recommendation
12/44:30-4:40David ReissIdentifying terms with similar meanings across corpora
12/44:40-4:50NielFred PicciottoFinding Interesting Videos Early via Trend-Setting Viewers
12/44:50-5:00Sean KandelWeb Data Extraction Using Tag Trees
12/45:00-5:10Priyank ChodisettiA shot at Netflix Challenge - Hybrid Recommendation System
12/63:15-3:25Hayato AkatsukaWeather Mining
12/63:25-3:35Alex GiladiUsing LSH for motion estimation
12/63:35-3:45Joseph BonneauSports Peformance and Salary
12/63:45-3:55Negin NejatiWeb Mining for Extracting Relations
12/63:55-4:05Vincenzo Di Nicola and Jyotika Prasad42: A Web Based Question Answering System
12/64:05-4:15Manjunath RajashekharFrequent Itemsets Mining in Distributed Wireless Sensor Networks
12/64:15-4:25Hao LiuClustering Based News Event Detection and Tracking
12/64:25-4:35Jack ChengImprovements on Netflix Recommendation System Using Data-mining Algorithms
12/64:35-4:45Arpit Aggarwal and Omkar MateRecommendation System for Portfolio Management
12/64:45-4:55Romain ColleNear-duplicates detection: Comparison of the two algorithms seen in class
12/64:55-5:05Alan Sheinberg and Greg NelsonNetflix Challenge: Combined Collaborative Filtering
12/65:05-5:15Fred WulffCourse Helper: A Course Recommendation System

Course Outline

Here is a tentative schedule of topics:

DateTopicLecturer
09/25IntroductionJDU, AR
09/27Association RulesJDU
10/02Association RulesJDU
10/04Link AnalysisAR
10/09Link AnalysisAR
10/11Spam DetectionAR
10/16Minhashing, ShinglesJDU
10/18LSHJDU
10/23ClusteringJDU
10/25Recommendation SystemsAR
10/30ClusteringJDU
11/01Extracting Structured Data from the WebAR
11/06Extracting Structured Data from the WebAR
11/08Data VisualizationJK
11/13Advertising on the webAR
11/15Stream MiningJDU
11/27Stream MiningJDU
11/29Stream MiningJDU
12/04Project Reportsstudents
12/06Project Reportsstudents
12/13Final Exam, 12:15pm - 3:15pm

Resources and Readings