Cluster Variance Distribution

Below is a graph showing a histogram of cluster variance versus number of clusters that have a given variance value. Note how the plot follows the power log distribution.


Representative Quote Length Distribution

Below is a graph showing a histogram of representative quote length versus number of clusters that have a given representative quote length. The plot appears to follow some sort of normal-ish distribution, with a large number of very short quotes, and a later peak at around 12 words.


Cluster Lifespan Distribution

Below is a graph showing a histogram of cluster lifespan versus number of clusters that have a given lifespan (in unit days). Note the mess as the lifespan increases. We think this is undesireable "spam" data that is staying in the cluster base for two long because it comes from reliably popular topic (such as Adele's Rolling in the Deep). We are currently working on getting rid of the spam! Otherwise, note the less dramatic power law distribution that the cluster lifespan follows. This makes intuitive sense; much fewer clusters should survive as they age.


Representative Quote Length Vs. Cluster Lifespan

Below is a graph showing representative quote length vs. the average of all lifespans of clusters whose representative quotes are of that length. As you can see, there is no statistically obvious correllation between representative quote length and lifespan.


Representative Quote Length Vs. Cluster Variance

Below is a graph showing representative quote length vs. cluster variance (i.e. the number of unique quotes in a cluster). As you can see, the variance seems to grow linearly (marginally) with quote length - the longer the quote, the higher its potential variance! This makes intuitive sense because news sources would indeed prefer to cut a long quote into smaller variante. The title and axes for this graph are wrong - please ignore. :)

Cluster Lifespan Vs. Average Cluster Size

Below is a graph showing cluster lifespan vs. average cluster size (i.e. number of source mentions). Clusters that live longer seem to be linearly more popular! This makes intuitive sense as well, because a more popular cluster would generally be able to live longer and die more slowly. We do find it intriguing that the relationship is linear.

Next Steps

We will try to get more detailed data next week, such as a fine-grained perception of when clusters are born and killed off, and cluster peaks. This way we can get a better sense of whether shorter or longer quotes become more popular, or show up first, etc, as suggested. We can also then do analysis on the most popular quote for clusters as opposed to their representative quote. Please feel free to suggest any more things we might look into!