"Too Many Jelly Beans?"  

Tag Hierarchies

Background

Recently I have been interested in trying to create hierarchical taxonomies from flat tag data. Tagging systems like del.icio.us, Flickr, and CiteULike tend to have (relatively) flat tags. This means that while one can easily browse by a tag, like photography, one cannot as easily see tags which are more or less broad than that tag. It is also difficult to get a broad overview of what tags exist in these sorts of systems as a result, aside from frequency based displays like tag clouds.

Some commentators have suggested that ontology is overrated, even irrelevant. That there is no hierarchy in ideas, only links:

'Just Links' image courtesy of Clay Shirky.

This may be overstating the point a little bit. While often many hierarchies can be created for any given set of data, hierarchies are indisputably useful for a major type of information retrieval task: browsing. When we do not know exactly what we are looking for, it is much easier to be able to broaden and narrow our area of interest than to perform some sort of random walk from idea to idea. The top few categories of a traditional hierarchy give us a much better idea of the contents of a media collection than thousands of individual tags, even if these tags are ranked by their frequency in the collection.

Tagging systems are excellent at the task that they were designed for---allowing a large, disparate group of users to collaboratively label massive, dynamic information systems like the web, media collections of millions of images, and so on. We are working to make these systems better by automating production of hierarchical taxonomies that describe the data from the raw flat tags generated by users.

I've found some interesting features of tagging datasets from del.icio.us and CiteULike which have in turn suggested reasonably good ways to create hierarchies. An example hierarchy generated using some of these methods from del.icio.us is here: mgfgsm-hierarchy.

Papers

Title:Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems
Authors:Paul Heymann and Hector Garcia-Molina
Type:Preliminary Technical Report
Accessible: (info) (ps) (pdf)
Description:This paper describes a simple algorithm for constructing hierarchies in social tagging systems that usually works reasonably well. The main contribution is a notion of generality in social tagging systems based on centrality in a similarity graph.