Solution for CS345 Final

The best ideas I found were:

Use low-support, high-correlation mining for pairs of books that have similar customers. I suspect that a really low similarity would identify a relatively small number of book-pairs that are related. For instance, if two books are each bought by 100 people, and two of them buy both books, that is probably unusual and significant. It was good to see that many people observed that with 1,000,000 columns, a min-hash scheme might require more main memory than was available. An LSH scheme, or partition of the columns into some small number of groups, were the best solutions suggested.
Use a non-Euclidean clustering algorithm to cluster books. The distance function has to be based on the similarity, with high similarity = low distance.

If you said this much (and nothing else), you would, under my grading system, have a score that is so high there is no grade on the grade-sheet high enough. Unfortunately, most people either said these things as one of many possibilities or said some things that really won't work.

Common Errors

Trying to use high-support itemset (or pair) mining. The problem is that even a support like 2 people buying the same book is significant, so you wind up with too many candidate pairs no matter what you do. A tricky point: I expected you to figure out what the typical distribution of nonzero's in the matrix was. You have to think about it, but the following reasoning should be available to anyone in the class:
1. CS students buy an above average number of books and are more likely than average to buy on-line.
2. Thus, the typical customer at Nile probably bought no more books there than you bought at any one on-line bookseller, such as Amazon.
3. I'm guessing you didn't buy more than 10 books from Amazon. If so, then there are only 4.5*10^8 pairs of books that were bought by even one customer, or 1/1000 of the pairs.
4. Thus, even 1 customer buying a pair of books can be said to relate those books strongly. Notice how different this situation is from Wal-Mart, where probably every pair of their 100,000 items has appeared in some basket or other.
If you did try to find high-support-pairs, say support 100, what you would get is a few pairs of books from series (everybody seemed to use Harry Potter books, whatever they are, as an example), plus invalid things like a calculus book and a biology book that were texts required at the same university. Aside: After writing the above, I noticed a real example: Amazon says (in June, 2000) that people who bought the Ullman/Widom DB book were likely to buy John Mitchell's programming-languages book. Does this ``nugget'' have something to do with the fact that both were used at Stanford?
The next big mistake was people who tried to do Euclidean clustering. A typical example would be to treat a book as a vector of length 10^7, with 1 if a customer bought the book, 0 if not. The trouble is that this distance is worse than useless for clustering by subject. For instance, consider two books that were each bought by only one person (different for each book). Then the distance between these books is sqrt{2} --- very close indeed. On the other hand, consider two books in a popular series. Perhaps 10,000 people bought each book, and 4000 of those people bought both. Then the distance between the two books is about 110, which is very far away when they should be close.
People tried to mine episodes from the purchase history. You might be able to do so, but you have to be careful. Simply citing [1] is not enough, since that paper assumes a single event-sequence, and you have 10^7 sequences. The best idea seen was to concatenate the customer histories, with an imaginary time-gap of w (the window size) between them.
Some people figured that query-flocks were important, since the instructor was an author of the paper, but no one was able to come up with an example of their use in the problem at hand.
Some people tried to mine sequences as in [2]. The functions typically were f(t) = the (index) number of the book bought by this customer at time t. If you know signal theory, what you have defined is ``white noise.'' That is, f is a random-valued function, and its Fourier transform is uniform in all terms. Thus, it will be impossible to distinguish the functions, and if you do, it is an example of ignorance of Bonferroni's principle!
There were attempts to define a directed graph and some notion of Page rank or hubs-and-authorities. However, I saw nothing that couldn't be replaced (with an efficiency gain) by counting the number of sales for each book or purchases by each customer.

I have the feeling that there is a books-authors type of mining that could be useful, but I haven't seen anyone make a good case. Intuitively, you seed with some obvious books in a class; find customers who bought some of those books, find other books heavily bought by those customers, and so on.

A Word About Diction

I didn't try to correct every misspelling, grammar error, or diction error. However, I did circle ``nonreferential 'this'es'' whenever I noticed one, and several of you had quite a few. The problem is that ``this'' is a pronoun (or an adjective). As a pronoun, it has to stand for the previous noun. Many people use ``this'' or ``that'' to stand for ``one of the points I just made --- you know what I mean.'' The trouble is that your reader doesn't know what you mean in all cases. Often, a nonreferential ``this'' hides the fact that you are not sure yourself why something is true. Thus, while I didn't deduct anything for this type of prose, I strongly recommend that you get nonreferential ``this''es out of your writing in the future. You'll be amazed at how much doing so sharpens your writing and forces you to be specific when dealing with tricky points!