Stanford WAC Team at Datafest
On May 19, 2012, the Stanford arm of the Web Archive Cooperative
project participated at
. The event was organized as part of the Computational
Reporting program. Participants consisted of journalists, political
scientists, and computing people. Participants formed teams, which
selected topics concerning campaign finance. A representative of the
Sunshine Foundation was on hand to consult about available data
sources and promising search strategies.
As projected by our WAC NSF proposal, social sciences are increasingly
aware that mining the Web and its archives can yield important
insights. Groups began with plans around a whiteboard. Once groups had
decided on topics to investigate, small and large teams formed to
pursue answers to their questions.
WAC Team Volunteers at San Jose Tech Museum: CS Outreach to
In an effort to induce interest in Computer Science at an early
age, members of the Web Archive Cooperative volunteered at the San
Jose Tech Museum.
Children as well as adults grew deeply involved in programming a large
robot to gesture and dance. One of the WAC members wrote the prototype
for the underlying software (under separate funding). The system was
exhibited on six consecutive weekends at the museum, with help from us
volunteers. The age range of those attracted to the exhibit, and truly
engrossed in it was from 2 years to full adulthood, an astounding
Web Archive Workshop Coming Up
Preparations are on the way for our June 2012 workshop at Stanford. Harding
organizing. Details here.
Web Archive Cooperative Volunteers at Hack-the-Future
The WAC team volunteered at a Bay Area event 'Hack-the-Future,' in
which children engage in technology projects designed to inspire them
towards careers in the sciences. Andreas Paepcke of the WAC team
developed a very simple method for programming robot
interactions (separate funding). This Hack-the-Future project featured
an implementation of this method created by engineers at Willow
Garage, who provided the robot for the event.
This girl's program works...mostly.
Project Introduced to Low-Income High-Achievement High-School Students
With funding from Google, Stanford University's Computer Science Deparment organized a series of on-campus events for
high achieving, low-income high school students of color from across the country. The students were selected for
their passion around math and science. The goal is to inspire the students towards an engineering career.
During the one-week residential and academic program, students take rigorous coursework that prepares
them to excel in science, technology, engineering or mathematics, with a strong preference for Computer Science.
The WAC project PI Prof. Hector Garcia-Molina introduced the project and other research examples to a group of
fouteen Mexican Americans/Latinos and sixteen African Americans. Many of the students were first-generation citizens.
Web Archiving Receives Publicity
Mike Nelson's part of the project enjoyed a round of good publicity around web archiving, which started with an Old Dominion Web Science and Digital Libraries Research Group blog update post
That entry was picked up by the Chronicle of Higher Education with an article on July 6, 2011
That story in turn led to an article in the Washington Post on July 17, 2011
, which then elicited a short TV interview on "Canada AM" on July 21.
Global Web Archiving Workshop at JCDL
On June 17 and 18, 2011 we organized a workhop Web Archive Globalization in the context of the Joint Conference on Digital Libraries. The roughly 20 participants had lively discussions in response to four prepared presentations. These talks were given by Eric Hetzner (California Digital Library), Nicholas Taylor (Library of Congress), Brad Tofel (Internet Archive), and Rob Sanderson (Los Alamos National Lab). The Library of Congress slides are available online, as are the presentations by the California Digital Library, the Internet Archive, and those of the Los Alamos National Lab.
The attending PIs of the Web Archive Cooperative project met ahead of
the workshop to coordinate both the workshop and the project in
More discussion was had over dinner. A side note: free travel
advice. Light and small five packs of paper underwear and socks (you
had to be there to see the beauty...). We will add additional workshop
From Alex Thurman
Web Collection Curator
Columbia University Libraries
535 W. 114th Street
New York, NY 10027
Here are some links to resources mentioned during the Workshop.
Columbia is currently surveying three user groups to help guide the design of our web archives access portal. The groups are: human rights researchers (students, faculty); content providers (NGOs whose sites we're archiving); and librarians/archivists. The surveys vary slightly for the 3 different groups, but are largely identical. If you'd like to see and/or complete the survey sent to librarians/archivists, the link follows. When we have our results I can share them with this group if desired.
Take the survey: http://www.surveymonkey.com/s/columbiawebarchives_L
More detailed web archives user studies are available from the Portuguese Web Archive at:
The results of the survey of web archiving initiatives that they conducted and posted on Wikipedia is at:
Two access portals that came up in discussion were the UK Web Archive and Trove
Directory of Existing Archives
As a first order of business we compiled a list of the Web archives that we are aware of. This list is available as a Google Docs spreadsheet. We invite the public to add entries for other archives as they become available. The current list comprises over 1500 entries.
JCDL Archiving Workshop
In the context of the Joint Conference on Digital Libraries (JCDL 2011) this project will organize a workshop
. The meeting will bring together interested parties from major archives, government, private, and academic. We will report on results.
We are working with the distributed computing infrastructure Hadoop. The goal is seamlessly to stream our WebBase archive through a compute cluster for analysis and processing. In this context we contributed an Excel load and store module to the Apache Pig open source project.
Please visit our WebBase archive
, where we make several years of archived Web content available.
Advisory Board Meeting
We kicked the project off by inviting our advisory board to
Stanford. We presented our plans, and listened to the board's
suggestions. We received valuable pointers to efforts elsewhere, both
US and international. These leads later exposed significant
differences among US and many European national collection efforts. It
seems that while European government-run Web archiving efforts are
broad, and probably quite complete, the resulting archives are often
closed for all practical purposes. For example, several countries
limit access to a handful of terminals in their national library
ChallengesChallenge: Describing Resources
Each federation member has a
set of ``resources'', e.g., web crawls, query logs, crawling software,
etc. To be usable, each resource needs to be described in a way that
can be understood for other federation members. How was the resource
obtained? On what dates? What does it contain? Who can access the
resource? Who do we ``compare'' archives and their holdings? While
standards are emerging, their resource descriptions are not yet
detailed enough to allow integration with other resources. The
challenge is to identify descriptions that truly facilitate
experimentation and integration, and at the same time are reasonable
for the resource owner to generate.
Challenge: Resource Discovery and Characterization.
Thrust 1 (Resource Description): Create a comprehensive ``taxonomy'' of descriptive
features and access mechanisms (APIs) for a wide range of resource types.
Develop a cost/benefit ratio for each feature/mechanism that
describes how difficult it is to obtain/implement the feature/mechanism, and how useful it is
to support experimentation and research.
Identify, promote and develop metrics for quantifying and comparing
Study how to include new features/mechanisms in existing or new standards.
Build a reference implementation for an archive
that supports advanced resource description/access
for a variety of resources.
A WAC needs a discovery service that lets researchers find resources of interest.
Resource owners can manually register resources
at the discovery service, or the service can automatically
harvest information about emerging resources
(e.g., by monitoring crawler traffic at Web sites).
If a resource is not fully described, the discovery service
may be able to analyze the resource and extract
its characteristics (e.g., site depth of a crawl, coverage,
Challenge: Linking and Combining Resources.
Thrust 2 (Discovery)
Study and evaluate options for a resource discovery service.
We will explore three goals for such a service:
(1) the manual or automated discovery of entire existing Web related
archives; (2) the selection
among known archives of the ones that
support a specific research question; and (3)
the identification of individual resources from within the
We will also develop tools for characterizing discovered archives,
especially for the case where the archive does not provide rich
Characterization of an archive includes elements such as an estimate
of the archive's coverage, particulars of the crawling parameters,
like dates/frequencies, crawl duration, depth, per-site ceiling on the
number of collected pages, content statistics, and link structure.
Using open-source software as much as possible,
we will build, operate, and evaluate a discovery service for our WAC.
Finally, we will also support what we call forward discovery,
i.e., the identification of candidates for future archiving.
For WAC forward discovery we will provide a clearinghouse
where the community can express recommendations.
Such a clearinghouse is needed, because
recommending parties do not often themselves possess archiving
capacity. On the other hand, the community at large is an
indispensable resource for identifying niches of interest on the Web
that might be of importance in the future.
The WAC provides integrated access to independent resources. This
integration requires sophisticated resource and metadata translation
mechanisms. For instance, URLs in one archive need to be mapped to
ones in another; annotation tags in one resource need to be translated
to their synonyms in another. Redundant (or approximately redundant)
objects need to be identified, merged and possibly exploited (e.g., if
an archived URI is damaged, are there redundant or similar URIs that
can be substituted?). Inconsistencies in the way
resources were gathered need to be resolved, or at least described.
For example, how do we unify two Web page crawls, one that visited
sites every 3 days and another that visited sites every 5 days?
Challenge: Preserving Resources.
Thrust 3 (Archive Linking)
We will develop mechnisms for integrating diverse archives,
and will apply the mechnisms to site reconstruction
(from various archives) and archive views (a logical fusion
of resources from multiple sources).
Since integration issues are so challenging, we will
set up an experimental testbed with small but diverse resources.
The testbed will contain several crawls of the same target sites,
each obtained with different crawlers and using different parameters.
The testbed will also contain related resources, e.g., the tags
at Delicious for the same set of sites.
The testbed will let us study and quantify differences among
the crawls, and will let us evaluate strategies for combining
and linking resources.
The WAC preserves past Web states, but who preserves the
WAC content itself? In other words, WAC resources
stored at member archives can be lost due to
hardware failures or the member archive going out of business.
Resources can be preserved through replication, but
(a) member archives must be willing to store backup copies;
(b) the number of desired copies and their location must be determined, and
(c) update propagation mechanisms must be in place
to keep replicas synchronized.
The size and rate of change of WAC resources make
all these aspects especially challenging.
Challenge: Filling the Gaps.
Thrust 4 (Preservation)
We will explore storage trading schemes that allow members
to trade local backup space for remote space.
We will extend the notion of self-preserving objects
to develop a Web archive replication tool.
We will study alternatives for replica synchronization.
For example, is it best to have a single crawler generate
one Web archive that is replicated at two other sites,
or is it better to have three coordinated crawlers that
each create their own archive of the same target sites?
As we conduct our research, we are bound to see gaps in coverage:
data sets that researchers need but are not available anywhere,
or tools that researchers need but have not been developed.
Gaps occur when resources exist but are not shared
(e.g., query logs are often considered sensitive),
or for emerging applications where data collection
tools have not been developed
(e.g., the next Facebook or Twitter-like system).
Challenge: Community Building.
Thrust 5 (Filling Gaps):We will study ways to fill the gaps. For example, it is
possible to gather query logs in a distributed fashion by using the
referrer field of HTTP requests from search engines. Thus, a
community of Web sites can gather a query log, filling an important
gap that exists today. We will also build data gathering tools for
emerging applications (e.g., an archive of Tweeter feeds and
profiles). Another gap exists for what is called the Deep
Web, i.e., information that resides in
backend databases but is displayed through dynamic Web pages. Such
information is very valuable but is hard to find in open Web archives
In addition to filling information gaps, we will also address gaps in
tools that facilitate archiving and resource sharing. For example, we
will adapt annoymization techniques explored by the security community
to the context of Web content, thus making owners more willing to
share their resources.
The success of a WAC will depend on the willingness of members
to gather, implement, and share resources.
In turn, this willingness will depend on
the availability of useful standards and tools,
on the initial seeding of the WAC with a substantial
number of resources, and an understanding
of the legal and social issues related to research
of shared Web resources.
Thrust 6 (Community Building)
We will organize a number of workshops to bring together
key Web Science researchers, to discuss available resources
and impediments to sharing.
These workshops will drive our research,
identifying needed tools and protocols.
With small groups of participants, we will
establish challenge problems to attack,
e.g., combining a set of Web archives.
With restricted participation, we expect
to get access to more resources, demonstrating
that a collective effort can yield benefits to all.
Reports of these results at future workshops
can incentivize others to participate in the WAC.
In addition, we will set up an Advisory Board
of industrial, government, and academic experts
to guide our project.
To keep Web Science vibrant, future researchers and practitioners need
to be trained.
However, current knowledge
(e.g., how to effectively run massive Web crawls,
how to extract meaningful information from massive Web data sets)
is widely dispersed, and current tools are poorly documented.
Thrust 7 (Education)
We will run a Summer Institute for Web Science graduate students.
At this Institute, students will learn to use the latest tools,
and will learn from each other's experiences in dealing with Web data.
In addition, we will develop a one-day workshop which can be offered
at Web Science conferences (WWW, SIGIR, etc.) to educate participants
about the resource made available by the WAC. We will also develop
an undergraduate Web Sciences track for computer science majors that
will use WAC tools.
Project Advisory Board
Martha Anderson, LOC
Pamela Anderson, Berkeley
Christine Borgman UCLA
Patricia Cruse, Cal. Digital Library
Richard Furuta Texas A&M
Alon Halevy, Google
Carl Lagoze, Cornell
Gary Marchionini, U.North Carolina
Raghu Ramakrishnan, Yahoo
Herbert van de Sompel, LANL
Publications by the Stanford portion of the project (first eight months):
Sadikov, Eldar and Medina, Montserrat and Leskovec, Jure and Garcia-Molina, Hector (2011) Correcting for Missing Data in Information Cascades. In: Fourth ACM International Conference on Web Search and Data Mining (WSDM2011), 9-12 February, 2011, Hong Kong.
Paul Heymann and Hector Garcia-Molina. 2011. Turkalytics: analytics for human computation. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 477-486. DOI=10.1145/1963405.1963473 http://doi.acm.org/10.1145/1963405.1963473
See also the fragments of our 2011 NSF annual report. The full report is available at NSF.