ArcSpread Logo

Independent Study Opportunities

with Andreas Paepcke

Photo Andreas
								     Paepcke ArcSpread:

Make deep investigations into massive Web archives no harder than setting up a spreadsheet

Goal

Future historians will be able to answer the following questions: Political scientists will discover trends like: Sociologists will answer our musings on:

Approach

Information for the questions in the box on the right lies buried in growing Web archives, like the Stanford InfoLab's WebBase, the Internet Archive, and collections at the Library of Congress.

None of these archives are nearly as usable as they should be for the public, or for researchers. We are developing tools to change this shortcoming.

ArcSpread is the core for one of our tool sets.

Spreadsheet example

ArcSpread's spreadsheet style will be one of the interaction experiences for exploring the archives. Formulas will trigger Hadoop computations over entire crawls. Cells will fill with sets of result pages, links, page titles, or word sets. Specialized browsers will then allow exploration of the result cells. A word browser will specialize on visualization and statistical analysis of word usage. Link set tools will expose the comings and goings of links. Summarization tools will enable explorations of entire sites.

An evolving three-tiered architecture offers room for lots of student initiative and ingenuity.

Architecture Projects range from systems in the 'machine room' tier, through user experience development at the spreadsheet and browser level, up to visualization of data over time. Techniques from all across Computer Science could be applied: Natural language processing can summarize. Machine learning can categorize, or offer regressions for explaining and predicting. Parallel algorithm can enable rapid, on demand, time series analysis, and novel visualizations can make results immediately ready for intelligent interpretation.

A 60 node Hadoop cluster with Pig is available to all projects. We also have WebBase, our own Web archive filled with regular crawls of government and 40,000 other sites.

Specific Projects

Andreas Paepcke
Home Page