Solution for CS345 Final
The best ideas I found were:
-
Use low-support, high-correlation mining for pairs of books that have
similar customers.
I suspect that a really low similarity would identify a relatively small
number of book-pairs that are related.
For instance, if two books are each bought by 100 people, and two of
them buy both books, that is probably unusual and significant.
It was good to see that many people observed that with 1,000,000
columns, a min-hash scheme might require more main memory than was
available.
An LSH scheme, or partition of the columns into some small number of
groups, were the best solutions suggested.
-
Use a non-Euclidean clustering algorithm to cluster books.
The distance function has to be based on the similarity, with high
similarity = low distance.
If you said this much (and nothing else), you would, under my grading
system, have a score that is so high there is no grade on the
grade-sheet high enough.
Unfortunately, most people either said these things as one of many
possibilities or said some things that really won't work.
Common Errors
-
Trying to use high-support itemset (or pair) mining.
The problem is that even a support like 2 people buying the same book is
significant, so you wind up with too many candidate pairs no matter what
you do.
A tricky point: I expected you to figure out what the typical
distribution of nonzero's in the matrix was.
You have to think about it, but the following reasoning should be
available to anyone in the class:
-
CS students buy an above average number of books and are more likely
than average to buy on-line.
-
Thus, the typical customer at Nile probably bought no more books there
than you bought at any one on-line bookseller, such as Amazon.
-
I'm guessing you didn't buy more than 10 books from Amazon.
If so, then there are only 4.5*10^8 pairs of books that were bought by
even one customer, or 1/1000 of the pairs.
-
Thus, even 1 customer buying a pair of books can be said to relate those
books strongly.
Notice how different this situation
is from Wal-Mart, where probably every pair of
their 100,000 items has appeared in some basket or other.
If you did try to find high-support-pairs, say support 100,
what you would get is a few pairs of books from series (everybody seemed
to use Harry Potter books, whatever they are, as an example), plus
invalid things like a calculus book and a biology book that were texts
required at the same university.
Aside: After writing the above, I noticed a real example:
Amazon says (in June, 2000)
that people who bought the Ullman/Widom DB book were likely
to buy John Mitchell's programming-languages book.
Does this ``nugget'' have something to do with the fact that both were used at
Stanford?
-
The next big mistake was people who tried to do Euclidean clustering.
A typical example would be to treat a book as a vector of length 10^7,
with 1 if a customer bought the book, 0 if not.
The trouble is that this distance is worse than useless for clustering
by subject.
For instance, consider two books that were each bought by only one
person (different for each book).
Then the distance between these books is sqrt{2} --- very close indeed.
On the other hand, consider two books in a popular series.
Perhaps 10,000 people bought each book, and 4000 of those people bought
both.
Then the distance between the two books is about 110, which is very far
away when they should be close.
-
People tried to mine episodes from the purchase history.
You might be able to do so, but you have to be careful.
Simply citing [1] is not enough, since that paper assumes a single
event-sequence, and you have 10^7 sequences.
The best idea seen was to concatenate the customer histories, with an
imaginary time-gap of w (the window size) between them.
-
Some people figured that query-flocks were important, since the
instructor was an author of the paper, but no one was able to come up
with an example of their use in the problem at hand.
-
Some people tried to mine sequences as in [2].
The functions typically were f(t) = the (index)
number of the book bought
by this customer at time t.
If you know signal theory, what you have defined is ``white noise.''
That is, f is a random-valued function, and its Fourier transform
is uniform in all terms.
Thus, it will be impossible to distinguish the functions, and if you do,
it is an example of ignorance of Bonferroni's principle!
-
There were attempts to define a directed graph and some notion of
Page rank or hubs-and-authorities.
However, I saw nothing that couldn't be replaced (with an
efficiency gain) by
counting the number of sales for each book or purchases by each
customer.
I have the feeling that there is a books-authors type of mining that
could be useful, but I haven't seen anyone make a good case.
Intuitively, you seed with some obvious books in a class; find customers
who bought some of those books, find other books heavily bought by those
customers, and so on.
A Word About Diction
I didn't try to correct every misspelling, grammar error, or diction
error.
However, I did circle ``nonreferential 'this'es'' whenever I noticed
one, and several of you had quite a few.
The problem is that ``this'' is a pronoun (or an adjective).
As a pronoun, it has to stand for the previous noun.
Many people use ``this'' or ``that'' to stand for ``one of the points I
just made --- you know what I mean.''
The trouble is that your reader doesn't know what you mean in all
cases.
Often, a nonreferential ``this'' hides the fact that you are not sure
yourself why something is true.
Thus, while I didn't deduct anything for this type of prose, I strongly
recommend that you get nonreferential ``this''es out of your writing in
the future.
You'll be amazed at how much doing so
sharpens your writing and forces you to
be specific when dealing with tricky points!