WAC Summer Workshop:

Web Analytics with Hadoop and Pig

Aaron Binns

Thanks to slides.html5rocks.com for the HTML presentation template

About Me

  • Aaron Binns
  • aaron@archive.org
  • Joined Internet Archive January 2008
  • Senior Software Engineer
  • Full-text search
  • Web Analytics & other fun stuff!
  • IIPC Program Officer [2010-2012]

Internet Archive

Over 8.5PB of web, books, audio, movies, television and more!
Over 176,000,000,000 resources
Nearly 3PB of archived web content
Over 1,000,000 free ebook titles
Subscription web archiving service with over 200 partners


  • Web analytics — what is it?
  • Web Archive File Formats: WARC, CDX, WAT
  • Hadoop & Pig
  • Examples
    • Simple counting...at scale
    • Closest capture
    • Page similarity
    • Language identification (almost)

Web analytics — what is it?

"Web Analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing Web usage."
Web Analytics Association

"The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc.

There is a growing trend among companies, organizations and individuals alike to gather information through web data mining to utilize that information in their best interest."
Web Data Mining.net


  • tld (.uk, .fr) snapshot: 20-30TB
  • Wide-web snapshot: 100+TB
  • Full Wayback: ~3,000 TB
  • ©opyright
  • Privacy
  • Embargoed content

Web Archive Formats: WARC

Web Archive Formats: CDX

  • Index for Wayback Machine — not a standard
  • Space-delimited text file
  • Only essential metadata needed by Wayback
    • URL
    • Content Digest (SHA-1)
    • Capture Timestamp
    • Content-Type
    • HTTP response code
    • etc.

Web Archive Formats: WAT

  • Yet Another Metadata Format! ☺
  • Yet Another Metadata Format! ☹
  • Less than full WARC, more than CDX
  • Essential metadata for many types of analysis
  • Avoids barriers to data exchange
  • Work-in-progress: we want your feedback

Web Archive Formats: WAT

  • WAT is WARC ☺
    • WAT records are WARC metadata records
    • WARC-Refers-To header identifies original WARC record
  • WAT payload is JSON
    • Compact & Hierarchical
    • Supported by every programming environment
  • Contains "essential metadata"
    • page title
    • HTML "meta" keywords, description, etc.
    • Links with text

WAT: Example

WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://state.tn.us/robots.txt WARC-Date: 2009-03-12T22:31:30Z WARC-Record-ID: <urn:uuid:6fe4e186-97d6-4e39-8c68-932137c281e1> WARC-Refers-To: <urn:uuid:129978d2-04cc-4601-8b67-047aeae49fc2> Content-Type: application/json Content-Length: 1361 {"Envelope":{"Format":"WARC","WARC-Header-Length":"343","Block-Digest":"sha1:R2HAVAYDRXLBXOPOMIZX7IGNSKQJBRZY","Actual-Content-Length":"306","WARC-Header-Metadata":{"WARC-Type":"response","WARC-Date":"2009-03-1 2T22:31:30Z","Content-Length":"306","WARC-Record-ID":"","WARC-IP-Address":"","WARC-Payload-Digest":"sha1:Q4FXJXFW7O2MF52UIFEEA5BLPKDUTAFU","WARC-Targe t-URI":"http://state.tn.us/robots.txt","Content-Type":"application/http; msgtype=response"},"Payload-Metadata":{"Trailing-Slop-Length":"4","Actual-Content-Type":"application/http; msgtype=response","HTTP-Respon se-Metadata":{"Headers":{"ETag":"\"c51f7-16-34ce5610\"","Date":"Thu, 12 Mar 2009 22:31:30 GMT","Content-Length":"22","Last-Modified":"Tue, 27 Jan 1998 21:48:00 GMT","Content-Type":"text/plain","Connection":"clo se","Accept-Ranges":"bytes","Server":"Oracle-Application-Server-10g/ Oracle-HTTP-Server"},"Headers-Length":"284","Entity-Length":"22","Entity-Trailing-Slop-Bytes":"0","Response-Message":{"Status":"200" ,"Version":"HTTP/1.1","Reason":"OK"},"Entity-Digest":"sha1:Q4FXJXFW7O2MF52UIFEEA5BLPKDUTAFU"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"457","Header-Length":"10","I nflated-CRC":"125238600","Inflated-Length":"653"},"Offset":"677","Filename":"TENN-000001.warc.gz"}}

Relative data sizes


Toolkit: Goals

  • Arbitrary analysis of web archives
  • Lower barriers and costs to researchers
  • Easy and quick to learn
  • Scales up and down

Toolkit: Overview

  • Software
    • Apache Hadoop
    • Apache Pig
  • Web Archive Files
    • WARC
    • CDX
    • WAT

Why not just use a RDBMS?

  • Web data is {un,semi}-structured
  • Web data is heterogeneous
  • Web data is nasty
  • Duplicating data from web archive
  • Cost ($ € £ ¥)
  • Follow the industry leader(s)

Apache Hadoop

  • Distributed Computation Framework
  • Java
  • Open source / Apache Licensed
  • Inspired by Google MapReduce paper (2004)
  • Yahoo!, Facebook, Twitter, IBM, etc.
  • IBM Watson powered by Hadoop

Apache Hadoop, part 2

  • HDFS
    • Distributed storage
    • Durable, deafault 3x replication
    • Scalable: Yahoo! 60+PB HDFS
  • MapReduce
    • Distrbuted computation
    • You write Java functions
    • Hadoop distributes work across cluster
    • Tolerates failures

Apache Hadoop, part 3

  • Commodity Hardware
  • Horizontal scaling
  • Heterogeneous hardware
    • Different cpu, memory, disk
  • Increase capacity incrementally
  • Replace failed hardware when convenient
  • Yahoo! has 6,000 node Hadoop cluster

Oh noes!


Apache Pig

  • No Java required!
  • Scripting language
  • Similar to SQL
  • Target users: analysts
  • (Pretty) easy to learn
  • Extensible with custom functions
    • UDF: user-defined function
    • IA provides library of UDFs

Pig Example: HTML page titles

%default INPUT ''; REGISTER 'ia-tools.jar'; titles = LOAD '$INPUT' USING ArchiveJSONViewLoader( 'Envelope.WARC-Header-Metadata.WARC-Target-URI', 'Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.Head.Title') AS ( src:chararray, title:chararray ); dump titles;
Run pig script on my laptop
$ pig -p INPUT="test.wat.gz" html-page-titles.pig
Run pig script on Hadoop cluster
$ pig -p INPUT="/crawl/001/*.wat.gz" html-page-titles.pig

Simple counting...at scale

Consider a 2 billion web page collection

  • Generate Wayback Index for collection
  • How many URLs from each TLD?
  • How many URLs from each domain?
  • How many HTTP 2xx, 3xx, 4xx, 5xx?
  • etc.

Simple counting...at scale

REGISTER 'ia-tools.jar'; records = LOAD '*.cdx' AS (url,date,digest,mime,code,...); mimes = GROUP records BY mime; mimes = FOREACH mimes GENERATE group, COUNT(records); codes = GROUP records BY code; codes = FOREACH codes GENERATE group, COUNT(records); urlinfos = FOREACH records GENERATE DOMAIN(url) as domain, TLD(url) as tld; domains = GROUP urlinfos BY domain; domains = FOREACH domains GENERATE group, COUNT(urlinfos); tlds = GROUP urlinfos BY tld; tlds = FOREACH tlds GENERATE group, COUNT(urlinfos); ...

Closest Capture

  • Measure temporal coherency of a page
March 5, 2008March 19, 2008
  • Example:
2008-01-03foo.com  2008-01-02bar.com
2008-02-22foo.com  2008-02-19bar.com
2008-03-05foo.com  2008-02-20bar.com
2009-05-30foo.com  2008-03-19bar.com
2009-06-10foo.com  2010-11-13bar.com

Closest Capture

edges1 = LOAD '$INPUT' AS (src:chararray,tstamp:long,dest:chararray); edges2 = LOAD '$INPUT' AS (src:chararray,tstamp:long,dest:chararray); edges = JOIN edges1 BY dest, edges2 BY src; edges = FOREACH edges GENERATE edges1::src AS src, edges1::tstamp AS tstamp, edges1::dest AS dest, (edges2::tstamp - edges1::tstamp) AS difftime; edges = GROUP edges BY (src,tstamp,dest); edges = FOREACH edges { pos_diffs = FILTER edges BY difftime >= 0; neg_diffs = FILTER edges BY difftime < 0; GENERATE FLATTEN(group), MIN(pos_diffs.difftime), MAX(neg_diffs.difftime); } STORE edges INTO '$OUTPUT';

Web Page Similarity

|A ⋂ B|

|A ⋃ B|
  • Example:
A = Aaron Binns| Aaron |


B = Aaron Ximm| Aaron Binns Ximm |3

Web Page Similarity

2011-08-22foo.comHello, welcome to foo.com!
2011-08-23foo.comHello, welcome to foo.com!
2011-08-24foo.comWelcome to foo.com!
2011-08-25foo.comSorry, but foo.com is closed, bye!
  • ...compute Jaccard of pairs of web pages

Web Page Similarity

2000-01-18cnet.com...not much changed
2000-12-06cnet.com...not much changed
2001-07-12cnet.comSite redesign!

Language Identification

Exercise left to the reader...

Script Identification


From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak 便宜, we just mean "cheap."

a az qalıb breyn rinq intellektual oyunu üzrə yarışın zona mərhələləri keçirilib miq un qalıqlarının dənizdən çıxarılması davam edir məhəmməd peyğəmbərin karikaturalarını çap edən qəzetin baş redaktoru iş otağında ölüb

آذربایجان دا انسان حاقلاری ائوی آچیلاجاق ب م ت ائلچيسي برمه موخاليفتي نين ليدئري ايله گؤروشه بيليب ترس شوونيسم فارس از آزادي ملتهاي تورکمن

ウェブ 画像 動画 地図 ニュース ショッピングGmailもっと見る ログイン

فارسی دری تاجیکیфорсӣ-тоҷикӣ

मानक हिन्दी)

हिंदी उर्दू, هندی اردو




Script Identification

REGISTER bacon.jar; text = LOAD 'test/script_tagger.txt' AS (words:chararray); tokens = FOREACH text GENERATE FLATTEN(script_tag(TOBAG(words))) AS (token:chararray,script:chararray); scripts = GROUP tokens BY script; scripts = FOREACH scripts { tokens = DISTINCT tokens; GENERATE group as script:chararray, tokens.(token); } dump scripts;

Script Identification

From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak 便宜, we just mean "cheap."

Pinyin for disambiguation (Language Log)

{('From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak ',LATIN), ('便宜, ',CJK), ('we just mean "cheap."',LATIN)}