Approach
Information for the questions in the box
on the right lies buried in growing Web archives, like the
Stanford
InfoLab's
WebBase,
the
Internet
Archive, and collections at the Library of
Congress.
None of these archives are nearly as usable as they should be for the
public, or for researchers. We are developing tools to change this
shortcoming.
ArcSpread is the core for one of our tool sets.
ArcSpread's spreadsheet style will be one of the interaction
experiences for exploring the archives. Formulas will trigger Hadoop
computations over entire crawls. Cells will fill with sets of result
pages, links, page titles, or word sets. Specialized browsers will
then allow exploration of the result cells. A word browser will
specialize on visualization and statistical analysis of word
usage. Link set tools will expose the comings and goings of
links. Summarization tools will enable explorations of entire sites.
An evolving three-tiered architecture offers room for lots of student
initiative and ingenuity.
Projects range from systems in the 'machine room' tier, through user
experience development at the spreadsheet and browser level, up to
visualization of data over time. Techniques from all across Computer
Science could be applied: Natural language processing can
summarize. Machine learning can categorize, or offer regressions for
explaining and predicting. Parallel algorithm can enable rapid, on
demand, time series analysis, and novel visualizations can make
results immediately ready for intelligent interpretation.
A 60 node Hadoop cluster with Pig is available to all projects. We
also have
WebBase, our own Web archive filled with regular crawls
of government and 40,000 other sites.
Specific Projects
- Annotate People and Places: We have a
processing module that uses a part of speech tagger to identify
proper nouns. Those could be people names, or geographic names. One
project would be to automatically separate these two
categories. Then each member of the resulting sets could be
enriched. Geographic names could be annotated with their
latitude/longitude, so that they can be placed on a map. Persons
could be annotated with a biography, news items about them, or other
information that can be retrieved from the current Web
automatically. As a side product, an index would be created for
locating mentions of the people or locations in the archive.
- Indexing Into Time:WebBase is unique in
that it contains a very carefully monitored time series of the same
Web sites over several years. We need to compute mass statistics
over these time spans, so that time phenomena can be queried, or
visualized. Statistics include word usage, ratios of image to text,
link structure, change rates, sentiment, and any others you can
think of.
- Queries and Visualization over Time
Phenomena: Given index structures developed in the previous
project opportunity, we want to develop a vocabulary to talk about
site characteristics, and their changes over time. Visualizations
are then needed to expose those changes. For example, a
visualization that shows how links out of one site changed of
time: how many links changed? Do changes show a pattern, such as
favoring particular content of the target pages?
- Site Summaries: We need to develop
multiple techniques for summarizing entire Web sites. A former
student has developed technology for extracting 'important' text
from Web pages. Maybe an aggregation of those text snippets can
serve as a summary. Other options are image collages, or link
graphs.
- Spreadsheet Formulas: We have a
spreadsheet
implementation that enables the use of formulas for tagging
photos, and for retrieving them by their tags. Starting with this
prototype we need to introduce formulas specific to Web
archives. Implementations for these formulas must translate to
Hadoop jobs, or discovery and use of cached, previously computed
results.
- Execution Optimization:
Spreadsheet update performance will be a challenge. While
we do not expect to operate at human interaction speeds,
judicious caching, informed execution plans, and streaming
of partial results will make interactions useful. This
project calls for interest in systems algorithms.
- Browsers: The solution of the spreadsheet
when formulas are changed will fill cells with sets of items. Those
items might be words from Web pages, links, entire Web pages, Web
sites, or computed Web statistics. Double-clicking on a cell will
open a browser that is specialized for those item types. For
example, a set of Web pages might bring up a intra-archive Web
browser, like the Internet
Archive Wayback
Machine. Other data type sets, like links, will require a
different type of specialized browser. Development of any such
browser is welcome.
- Define Your Own Project:Take a look at
WebBase's
site
list. In particular, scan the list of
the government sites, and
the general crawl. Seeing which sites are
covered, you might have your own idea for what might be visualized
or studied, and which tools would need to be build to realize your
idea.