Note: these are information fragments for our 2011 NSF annual
report. This fragmentation of information is required for data entry
to the NSF Web site. We provide these pieces here just for
completeness. The NSF Web site will coalesce this information into a
unified structure. The lines with hash mark in the first column are
the prompts on the NSF entry forms.
# What people have worked on your project?
Within Stanford we had three participants working on the project this
past reporting period. Petros Venetis is a Ph.D. student. Gary Wesley
worked on maintaining our WebBase archive facilities, and
Andreas functioned as project Director, and programmed prototypes.
# What other organizations have been involved as partners?
The Old Dominion University is a close partner:
In-kind support (organization makes software, computers, equipment, etc. available to project staff)
Facilities (project staff use organization's facilities for project activities)
Collaborative research (organization's staff work with project staff on the project)
Old Dominion is a close collaborator as per the original proposal.
Harding University.
In-kind support (organization makes software, computers, equipment, etc. available to project staff)
Facilities (project staff use organization's facilities for project activities)
Collaborative research (organization's staff work with project staff on the project)
Harding is a close collaborator as per the original proposal.
# Have you had other collaborators or contacts?
We have been coordinating and exchanging resources with both the
California Digital Library, and the Internet Archive. In addition we
have consulted with our entire advisory board. A data sharing
procedure is in progress with IBM's Austin research center.
1. Describe the major research and education activities of the
project. Research Activities and Findings
The Stanford arm of the project has focused on the foundations for
tools that will allow historians, social scientists, and members of
other non-technical disciplines to analyze large Web archives.
2. Describe the major findings resulting from these activities.
During the 2010/2011 reporting period we concentrated our effort on
two areas. First, we developed beginnings of visualizations that help
discovery of vocabulary changes across Web crawl time series. For
example, we have experimented with areas of small squares, like
crossword puzzles. Each column represents one crawl, and each row
represents one word. Color of the squares encodes word frequency. We
make the squares very small, so that we can visualize many more word
frequencies than, for example, the well known word, or tag clouds.
Second, we have worked on infrastructure that allows for parallel
analysis processing of many archived Web pagers. For example, we can
request word co-occurrence information on, say, 10,000 pages of the
U.S. State Department's Web site in March 2006. A number of these
analytics are fast enough to one day incorporate into an interactive
analysis tool.
3. Describe the opportunities for training, development and
mentoring provided by your project.
The visualization work is entirely student driven. Project members
have gained enormous experience with the parallel processing system
Hadoop, and the 'relationesque' superstructure 'Pig' that is built
atop Hadoop. Hadoop is used throughout data-intensive industries, and
knowledge of its workings will serve the students well.
4. Describe outreach activities your project has
undertaken. Training and development.
We held a workshop at Stanford with most of our advisory board in
attendance. On that occasion we explained our plans, and solicited
input from the attendees. The resulting understanding of status quo
has been helping us plan next steps.
5. Publications and products
Data:
Description:
We began to convert selected WebBase holdings (past Web crawls) to the
WARC format, which makes the data accessible to tools that operate on
Internet Archive, and Library of Congress data.
How disseminate:
We disseminate the data via the open Web.
Software:
Describe:
We developed an Excel compatible Load and Store module for the Pig
system on top of Hadoop.
How to disseminate:
We submitted the contribution to the respective open source
project. The software will be included in the next release.
6. Contributions
Within the discipline:
Our WebBase has over the past several years provided the computing
community with large streams of data. This 'wholesale' focus is unique
in that most archives merely offer search and retrieval of small
record sets. WebBase has enabled the community to analyze large
sets of Web pages. What we are doing in this project is to add
wholesale analytics to this data feed.
Contributions to other disciplines:
Stanford's portion of this larger project currently aims to help
non-computer scientists, such as historians, political, and other
social scientists analyze the past Web. Such tools are essential if we
are to continue the study of society, and of the past. As political
and social life shifts to the Web, analysis tools must follow. Our
initial focus is to equip professions that have traditionally looked
at society's artifacts such that they can continue their work in the
new medium.
Contributions to human resource development:
Stanford's Computer Science department has always insisted on
involving Ph.D. and Masters students in all research endeavors. We are
continuing this tradition.
Contributions beyond science:
Please see section "Contributions to other disciplines".
Other publications:
Eldar Sadikov, Montserrat Medina, Jure Leskovek, Hector Garcia-Molina
Correcting for Missing Data in Information Cascades
ACM International Conference on Web Search and Data Mining (WSDM 2011)
2011
Paul Heymann, Hector Garcia-Molina
Turkalytics: Real-time Analytics for Human Computation
International World Wide Web Conference (WWW)
2011
-----------------
One coherent piece to put into Findings:
We began the year with a kickoff meeting, to which we invited our
advisory board. We spent an entire day for presentations and
discussions.
Here is a partial list of points made by the attendees. Their feedback
was a mix of observations about archiving, and concrete pointers to
literature or projects.
Observations about Archiving
o Given that the entire Web cannot be preserved, the dilemma is that
only the future will tell what would have been worth collecting. In
this sense the selection part of curatorship is not all that different
from the past. Having many parties cover pieces of the Web, sampling
at different intervals offers the chance that important material can
at least be reconstructed from different collections.
o What is often missing from current Web archives is which guarantees
the collections provide. Making such guarantees explicit would raise
respect for, and usefulness of the collections. For example: crawl
rate, crawl failure rate, completeness.
o The observation was made that the size of Web archives has a long
'tail,' meaning that many archives are very specialized, and hold
relatively little content. It is the integration over this set of
archives that is needed for comprehensive coverage.
o Many motivations exist for not sharing data. Some reasons are
privacy concerns, proprietary career related considerations, and
simply the messiness of some data sets.
o Part of archiving is the issue of when and what to forget. Also: the
inverse; some information must remain hidden from view for a specified
amount of time (usually years), and can then finally be revealed.
o The old problem of collecting the deep/hidden Web of dynamically
generated pages.
o It is not just text, images, and sound that might need to be
preserved. Also games and social worlds.
o It is important to make Web archives useful now, not just in the
future, or when a Web site crashes and loses data. Only then can we
count on participation in archive building.
Specific Projects and Collections
o We discussed the issues the Library of Congress struggles with
around the newly acquired Twitter collection. Among those are the
privacy concerns that are brought up by the public, even though the
data is open to begin with. While these concerns are therefore at
first glance unwarranted, the point was made that when aggregated,
even public data can grow into a qualitatively different data set.
o Attention was drawn to Zoetrope, a research project for viewing time
series Web snapshots
(http://www.adobe.com/technology/pdfs/uist08zoetrope.pdf).
o We heard an introduction to Mememto, an infrastructure for
recovering lost Web sites.
o We heard about a Google effort to collect 14 billion tables from the
Web, and discover which tables contain data, rather than just being
used for layout control. Google Fusion allows users to upload tables.
o Many tools make it difficult to extract one's own data. The site
http://www.dataliberation.org/ attempts to remedy that situation
o http://www.sitemaps.org allows Web archives to communicate with
search engine crawlers.
As planned in our NSF proposal, we began right away after the kickoff
meeting to create an open inventory of what is being archives. The
result is available at
https://spreadsheets.google.com/spreadsheet/ccc?key=0AneDa9XDbDFpdDhEUFluLU82dGhIbnVZWlpIU2FRdXc&hl=en_US#gid=0
We collected over 1500 archives.