Web Archive Cooperative
Making Web Archives Useful Today

Supported by the National Science Foundation (1009916)

The Web Archive Cooperative (WAC) project aims to advance technology and practices for the non-commercial archiving of theWeb. The effort includes principle investigators from Old Dominion University, Harding University, and Stanford University. You are currently visiting the Stanford project portion. Please also visit the Old Dominion and Harding arms of the project.

Please see below for a summary of challenges we set ourselves. You will also find a list of our advisory board. We are very fortunate to have these seasoned researchers guide us along the way. You might also scan our publications as we produce them.

Towards the top of this page (that is right below) you will find project accomplishments as they evolve.

 

Senior
							     Researchers

WAC Member One of Knight Fellow 'Favorite Professors'

In the context of the WAC project Andreas collaborated with one of the 2011/2012 Stanford Knight Fellow journalists, Liz McClure. In return, Liz nominated Andreas as her 'Favorite Professor' during her stay at Stanford. A reception brought together the outgoing Knight Fellows and 'their professors.'
Andreas with Knight
					      Fellow and Faculty Group photo Group Stanford
     Faculty and Knight Fellows.

Stanford WAC Team at Datafest

On May 19, 2012, the Stanford arm of the Web Archive Cooperative project participated at the Stanford Datafest. The event was organized as part of the Computational Reporting program. Participants consisted of journalists, political scientists, and computing people. Participants formed teams, which selected topics concerning campaign finance. A representative of the Sunshine Foundation was on hand to consult about available data sources and promising search strategies.

As projected by our WAC NSF proposal, social sciences are increasingly aware that mining the Web and its archives can yield important insights. Groups began with plans around a whiteboard. Once groups had decided on topics to investigate, small and large teams formed to pursue answers to their questions.

Small group at
						 Datafest. Large group at
						 Datafest. Multiple groups
						     at Datafest.

WAC Team Volunteers at San Jose Tech Museum: CS Outreach to Kids

In an effort to induce interest in Computer Science at an early age, members of the Web Archive Cooperative volunteered at the San Jose Tech Museum. Small girl with
robot Adults and kids Girlscouts
     and robot Girls at computers Children as well as adults grew deeply involved in programming a large robot to gesture and dance. One of the WAC members wrote the prototype for the underlying software (under separate funding). The system was exhibited on six consecutive weekends at the museum, with help from us volunteers. The age range of those attracted to the exhibit, and truly engrossed in it was from 2 years to full adulthood, an astounding spread.

Web Archive Workshop Coming Up

Preparations are on the way for our June 2012 workshop at Stanford. Harding University is organizing. Details here.

Web Archive Cooperative Volunteers at Hack-the-Future

The WAC team volunteered at a Bay Area event 'Hack-the-Future,' in which children engage in technology projects designed to inspire them towards careers in the sciences. Andreas Paepcke of the WAC team developed a very simple method for programming robot interactions (separate funding). This Hack-the-Future project featured an implementation of this method created by engineers at Willow Garage, who provided the robot for the event. Girl in robot
						    embrace Help with
						    programming This girl's program works...mostly.

Project Introduced to Low-Income High-Achievement High-School Students

With funding from Google, Stanford University's Computer Science Deparment organized a series of on-campus events for high achieving, low-income high school students of color from across the country. The students were selected for their passion around math and science. The goal is to inspire the students towards an engineering career. During the one-week residential and academic program, students take rigorous coursework that prepares them to excel in science, technology, engineering or mathematics, with a strong preference for Computer Science. The WAC project PI Prof. Hector Garcia-Molina introduced the project and other research examples to a group of fouteen Mexican Americans/Latinos and sixteen African Americans. Many of the students were first-generation citizens. LEAD
							 students:
							 Prof. Garcia-Molina
							 speaking LEAD program summer
						  students(1) LEAD students(2)

Web Archiving Receives Publicity

Mike Nelson's part of the project enjoyed a round of good publicity around web archiving, which started with an Old Dominion Web Science and Digital Libraries Research Group blog update post. That entry was picked up by the Chronicle of Higher Education with an article on July 6, 2011. That story in turn led to an article in the Washington Post on July 17, 2011, which then elicited a short TV interview on "Canada AM" on July 21.

Global Web Archiving Workshop at JCDL

On June 17 and 18, 2011 we organized a workhop Web Archive Globalization in the context of the Joint Conference on Digital Libraries. The roughly 20 participants had lively discussions in response to four prepared presentations. These talks were given by Eric Hetzner (California Digital Library), Nicholas Taylor (Library of Congress), Brad Tofel (Internet Archive), and Rob Sanderson (Los Alamos National Lab). The Library of Congress slides are available online, as are the presentations by the California Digital Library, the Internet Archive, and those of the Los Alamos National Lab.

The attending PIs of the Web Archive Cooperative project met ahead of the workshop to coordinate both the workshop and the project in general.
WAC Coordination Meeting JCDL 2011 Group Photo Dinner side1 Dinner side2. Paper underwear
More discussion was had over dinner. A side note: free travel advice. Light and small five packs of paper underwear and socks (you had to be there to see the beauty...). We will add additional workshop notes shortly.

Post-Workshop Information

From Alex Thurman
Web Collection Curator
Columbia University Libraries
535 W. 114th Street
New York, NY 10027
at2186@columbia.edu
Here are some links to resources mentioned during the Workshop.

Columbia is currently surveying three user groups to help guide the design of our web archives access portal. The groups are: human rights researchers (students, faculty); content providers (NGOs whose sites we're archiving); and librarians/archivists. The surveys vary slightly for the 3 different groups, but are largely identical. If you'd like to see and/or complete the survey sent to librarians/archivists, the link follows. When we have our results I can share them with this group if desired.

Take the survey: http://www.surveymonkey.com/s/columbiawebarchives_L

More detailed web archives user studies are available from the Portuguese Web Archive at: http://sobre.arquivo.pt/about-the-archive/publications

The results of the survey of web archiving initiatives that they conducted and posted on Wikipedia is at: http://en.wikipedia.org/wiki/List_of_Web_Archiving_Initiatives

Two access portals that came up in discussion were the UK Web Archive and Trove

Directory of Existing Archives

As a first order of business we compiled a list of the Web archives that we are aware of. This list is available as a Google Docs spreadsheet. We invite the public to add entries for other archives as they become available. The current list comprises over 1500 entries.

JCDL Archiving Workshop

In the context of the Joint Conference on Digital Libraries (JCDL 2011) this project will organize a workshop. The meeting will bring together interested parties from major archives, government, private, and academic. We will report on results.

Software Releases

We are working with the distributed computing infrastructure Hadoop. The goal is seamlessly to stream our WebBase archive through a compute cluster for analysis and processing. In this context we contributed an Excel load and store module to the Apache Pig open source project.

Data Access

Please visit our WebBase archive, where we make several years of archived Web content available.

Advisory Board Meeting

We kicked the project off by inviting our advisory board to Stanford. We presented our plans, and listened to the board's suggestions. We received valuable pointers to efforts elsewhere, both US and international. These leads later exposed significant differences among US and many European national collection efforts. It seems that while European government-run Web archiving efforts are broad, and probably quite complete, the resulting archives are often closed for all practical purposes. For example, several countries limit access to a handful of terminals in their national library buildings.Kickoff
							   meeting

Challenges

Challenge: Describing Resources
Each federation member has a set of ``resources'', e.g., web crawls, query logs, crawling software, etc. To be usable, each resource needs to be described in a way that can be understood for other federation members. How was the resource obtained? On what dates? What does it contain? Who can access the resource? Who do we ``compare'' archives and their holdings? While standards are emerging, their resource descriptions are not yet detailed enough to allow integration with other resources. The challenge is to identify descriptions that truly facilitate experimentation and integration, and at the same time are reasonable for the resource owner to generate.

Challenge: Resource Discovery and Characterization.
A WAC needs a discovery service that lets researchers find resources of interest. Resource owners can manually register resources at the discovery service, or the service can automatically harvest information about emerging resources (e.g., by monitoring crawler traffic at Web sites). If a resource is not fully described, the discovery service may be able to analyze the resource and extract its characteristics (e.g., site depth of a crawl, coverage, diameter). Challenge: Linking and Combining Resources.
The WAC provides integrated access to independent resources. This integration requires sophisticated resource and metadata translation mechanisms. For instance, URLs in one archive need to be mapped to ones in another; annotation tags in one resource need to be translated to their synonyms in another. Redundant (or approximately redundant) objects need to be identified, merged and possibly exploited (e.g., if an archived URI is damaged, are there redundant or similar URIs that can be substituted?). Inconsistencies in the way resources were gathered need to be resolved, or at least described. For example, how do we unify two Web page crawls, one that visited sites every 3 days and another that visited sites every 5 days? Challenge: Preserving Resources.
The WAC preserves past Web states, but who preserves the WAC content itself? In other words, WAC resources stored at member archives can be lost due to hardware failures or the member archive going out of business. Resources can be preserved through replication, but (a) member archives must be willing to store backup copies; (b) the number of desired copies and their location must be determined, and (c) update propagation mechanisms must be in place to keep replicas synchronized. The size and rate of change of WAC resources make all these aspects especially challenging. Challenge: Filling the Gaps.
As we conduct our research, we are bound to see gaps in coverage: data sets that researchers need but are not available anywhere, or tools that researchers need but have not been developed. Gaps occur when resources exist but are not shared (e.g., query logs are often considered sensitive), or for emerging applications where data collection tools have not been developed (e.g., the next Facebook or Twitter-like system). Challenge: Community Building.
The success of a WAC will depend on the willingness of members to gather, implement, and share resources. In turn, this willingness will depend on the availability of useful standards and tools, on the initial seeding of the WAC with a substantial number of resources, and an understanding of the legal and social issues related to research of shared Web resources. Challenge: Education.
To keep Web Science vibrant, future researchers and practitioners need to be trained. However, current knowledge (e.g., how to effectively run massive Web crawls, how to extract meaningful information from massive Web data sets) is widely dispersed, and current tools are poorly documented.

Project Advisory Board

Rakesh.Agrawal@microsoft.com
Martha Anderson, LOC
Pamela Anderson, Berkeley
Christine Borgman UCLA
Patricia Cruse, Cal. Digital Library
Richard Furuta Texas A&M
Alon Halevy, Google
Carl Lagoze, Cornell
Gary Marchionini, U.North Carolina
Raghu Ramakrishnan, Yahoo
Herbert van de Sompel, LANL

Project Publications

Publications by the Stanford portion of the project (first eight months):

Sadikov, Eldar and Medina, Montserrat and Leskovec, Jure and Garcia-Molina, Hector (2011) Correcting for Missing Data in Information Cascades. In: Fourth ACM International Conference on Web Search and Data Mining (WSDM2011), 9-12 February, 2011, Hong Kong.


Paul Heymann and Hector Garcia-Molina. 2011. Turkalytics: analytics for human computation. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 477-486. DOI=10.1145/1963405.1963473 http://doi.acm.org/10.1145/1963405.1963473

See also the fragments of our 2011 NSF annual report. The full report is available at NSF.