WAC Summer Workshop:

Web Analytics with Hadoop and Pig

Aaron Binns
aaron@archive.org
http://archive.org/~aaron/wacsw.zip

Thanks to slides.html5rocks.com for the HTML presentation template

About Me

Aaron Binns
aaron@archive.org
Joined Internet Archive January 2008
Senior Software Engineer
Full-text search
Web Analytics & other fun stuff!
IIPC Program Officer [2010-2012]

Internet Archive

	Over 8.5PB of web, books, audio, movies, television and more!
	Over 176,000,000,000 resources Nearly 3PB of archived web content
	Over 1,000,000 free ebook titles
	Subscription web archiving service with over 200 partners

Agenda

Web analytics — what is it?
Web Archive File Formats: WARC, CDX, WAT
Hadoop & Pig
Examples

Simple counting...at scale
Closest capture
Page similarity
Language identification (almost)

Web analytics — what is it?

"Web Analytics is the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing Web usage."

Web Analytics Association

"The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc.

There is a growing trend among companies, organizations and individuals alike to gather information through web data mining to utilize that information in their best interest."

Web Data Mining.net

BIG DATA

(No <blink> tags were hurt creating this slide)

Challenges

tld (.uk, .fr) snapshot: 20-30TB
Wide-web snapshot: 100+TB
Full Wayback: ~3,000 TB

_©opyright
Privacy
Embargoed content

Web Archive Formats: WARC

Web archive container file format
ISO standard

Full HTTP transaction

DNS & whois records
HTTP request
HTTP response

Metadata records

Web Archive Formats: CDX

Index for Wayback Machine — not a standard
Space-delimited text file
Only essential metadata needed by Wayback

URL
Content Digest (SHA-1)
Capture Timestamp
Content-Type
HTTP response code
etc.

Web Archive Formats: WAT

Yet Another Metadata Format! ☺
Yet Another Metadata Format! ☹
Less than full WARC, more than CDX
Essential metadata for many types of analysis
Avoids barriers to data exchange
Work-in-progress: we want your feedback

Web Archive Formats: WAT

WAT is WARC ☺

WAT records are WARC metadata records
WARC-Refers-To header identifies original WARC record

WAT payload is JSON

Compact & Hierarchical
Supported by every programming environment

Contains "essential metadata"

page title
HTML "meta" keywords, description, etc.
Links with text

WAT: Example

WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://state.tn.us/robots.txt WARC-Date: 2009-03-12T22:31:30Z WARC-Record-ID: <urn:uuid:6fe4e186-97d6-4e39-8c68-932137c281e1> WARC-Refers-To: <urn:uuid:129978d2-04cc-4601-8b67-047aeae49fc2> Content-Type: application/json Content-Length: 1361 {"Envelope":{"Format":"WARC","WARC-Header-Length":"343","Block-Digest":"sha1:R2HAVAYDRXLBXOPOMIZX7IGNSKQJBRZY","Actual-Content-Length":"306","WARC-Header-Metadata":{"WARC-Type":"response","WARC-Date":"2009-03-1 2T22:31:30Z","Content-Length":"306","WARC-Record-ID":"","WARC-IP-Address":"170.143.36.24","WARC-Payload-Digest":"sha1:Q4FXJXFW7O2MF52UIFEEA5BLPKDUTAFU","WARC-Targe t-URI":"http://state.tn.us/robots.txt","Content-Type":"application/http; msgtype=response"},"Payload-Metadata":{"Trailing-Slop-Length":"4","Actual-Content-Type":"application/http; msgtype=response","HTTP-Respon se-Metadata":{"Headers":{"ETag":"\"c51f7-16-34ce5610\"","Date":"Thu, 12 Mar 2009 22:31:30 GMT","Content-Length":"22","Last-Modified":"Tue, 27 Jan 1998 21:48:00 GMT","Content-Type":"text/plain","Connection":"clo se","Accept-Ranges":"bytes","Server":"Oracle-Application-Server-10g/9.0.4.0.0 Oracle-HTTP-Server"},"Headers-Length":"284","Entity-Length":"22","Entity-Trailing-Slop-Bytes":"0","Response-Message":{"Status":"200" ,"Version":"HTTP/1.1","Reason":"OK"},"Entity-Digest":"sha1:Q4FXJXFW7O2MF52UIFEEA5BLPKDUTAFU"}}},"Container":{"Compressed":true,"Gzip-Metadata":{"Footer-Length":"8","Deflate-Length":"457","Header-Length":"10","I nflated-CRC":"125238600","Inflated-Length":"653"},"Offset":"677","Filename":"TENN-000001.warc.gz"}}

Relative data sizes

WARC WAT CDX

Toolkit: Goals

Arbitrary analysis of web archives
Lower barriers and costs to researchers
Easy and quick to learn
Scales up and down

Toolkit: Overview

Software

Apache Hadoop
Apache Pig

Web Archive Files

WARC
CDX
WAT

Why not just use a RDBMS?

Web data is {un,semi}-structured
Web data is heterogeneous
Web data is nasty
Duplicating data from web archive
Cost ($ € £ ¥)
Follow the industry leader(s)

Apache Hadoop

Distributed Computation Framework
Java
Open source / Apache Licensed
Inspired by Google MapReduce paper (2004)
Yahoo!, Facebook, Twitter, IBM, etc.
IBM Watson powered by Hadoop

Apache Hadoop, part 2

HDFS

Distributed storage
Durable, deafault 3x replication
Scalable: Yahoo! 60+PB HDFS

MapReduce

Distrbuted computation
You write Java functions
Hadoop distributes work across cluster
Tolerates failures

Apache Hadoop, part 3

Commodity Hardware
Horizontal scaling
Heterogeneous hardware

Different cpu, memory, disk

Increase capacity incrementally
Replace failed hardware when convenient
Yahoo! has 6,000 node Hadoop cluster

Oh noes!

Java!

Apache Pig

No Java required!
Scripting language
Similar to SQL
Target users: analysts
(Pretty) easy to learn
Extensible with custom functions

UDF: user-defined function
IA provides library of UDFs

Pig Example: HTML page titles

%default INPUT ''; REGISTER 'ia-tools.jar'; titles = LOAD '$INPUT' USING ArchiveJSONViewLoader( 'Envelope.WARC-Header-Metadata.WARC-Target-URI', 'Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.Head.Title') AS ( src:chararray, title:chararray ); dump titles;

Run pig script on my laptop

$ pig -p INPUT="test.wat.gz" html-page-titles.pig

Run pig script on Hadoop cluster

$ pig -p INPUT="/crawl/001/*.wat.gz" html-page-titles.pig

Simple counting...at scale

Consider a 2 billion web page collection

Generate Wayback Index for collection
How many URLs from each TLD?
How many URLs from each domain?
How many HTTP 2xx, 3xx, 4xx, 5xx?
etc.

Simple counting...at scale

REGISTER 'ia-tools.jar'; records = LOAD '*.cdx' AS (url,date,digest,mime,code,...); mimes = GROUP records BY mime; mimes = FOREACH mimes GENERATE group, COUNT(records); codes = GROUP records BY code; codes = FOREACH codes GENERATE group, COUNT(records); urlinfos = FOREACH records GENERATE DOMAIN(url) as domain, TLD(url) as tld; domains = GROUP urlinfos BY domain; domains = FOREACH domains GENERATE group, COUNT(urlinfos); tlds = GROUP urlinfos BY tld; tlds = FOREACH tlds GENERATE group, COUNT(urlinfos); ...

Closest Capture

Measure temporal coherency of a page

foo.com	→	bar.com
March 5, 2008	→	March 19, 2008

Example:

2008-01-03	foo.com	2008-01-02	bar.com
2008-02-22	foo.com	2008-02-19	bar.com
2008-03-05	foo.com	2008-02-20	bar.com
2009-05-30	foo.com	2008-03-19	bar.com
2009-06-10	foo.com	2010-11-13	bar.com

Closest Capture

edges1 = LOAD '$INPUT' AS (src:chararray,tstamp:long,dest:chararray); edges2 = LOAD '$INPUT' AS (src:chararray,tstamp:long,dest:chararray); edges = JOIN edges1 BY dest, edges2 BY src; edges = FOREACH edges GENERATE edges1::src AS src, edges1::tstamp AS tstamp, edges1::dest AS dest, (edges2::tstamp - edges1::tstamp) AS difftime; edges = GROUP edges BY (src,tstamp,dest); edges = FOREACH edges { pos_diffs = FILTER edges BY difftime >= 0; neg_diffs = FILTER edges BY difftime < 0; GENERATE FLATTEN(group), MIN(pos_diffs.difftime), MAX(neg_diffs.difftime); } STORE edges INTO '$OUTPUT';

closest-capture.pig

Web Page Similarity

Measure similarity of URL versions over time
Jaccard similarity

|A ⋂ B|

|A ⋃ B|

Example:

A = Aaron Binns	→	\| Aaron \|	→	1
B = Aaron Ximm	→	\| Aaron Binns Ximm \|	→	3

Web Page Similarity

The Data Chef

Jacob Perkins, infochimps.com
Pig script computes Jaccard of 2 graphs

Adapted to compute Jaccard of 2 bags of words
Web pages are bags of words...

2011-08-22	foo.com	Hello, welcome to foo.com!
2011-08-23	foo.com	Hello, welcome to foo.com!
2011-08-24	foo.com	Welcome to foo.com!
2011-08-25	foo.com	Sorry, but foo.com is closed, bye!

...compute Jaccard of pairs of web pages

Web Page Similarity

Example: cnet.com 2000-2001

2000-01-17	cnet.com
2000-01-18	cnet.com	...not much changed
2000-12-06	cnet.com	...not much changed
2001-07-12	cnet.com	Site redesign!

Step 1: Tokenize pages into bags of words
Step 2: Compute Jaccard of URL versions
Step 3: Profit!

Language Identification

Exercise left to the reader...

Script Identification

中文ThisکوردیsentenceአማርኛisالعربيةmixedЧӑвашлаtogether.

From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak 便宜, we just mean "cheap."

a az qalıb breyn rinq intellektual oyunu üzrə yarışın zona mərhələləri keçirilib miq un qalıqlarının dənizdən çıxarılması davam edir məhəmməd peyğəmbərin karikaturalarını çap edən qəzetin baş redaktoru iş otağında ölüb

آذربایجان دا انسان حاقلاری ائوی آچیلاجاق ب م ت ائلچيسي برمه موخاليفتي نين ليدئري ايله گؤروشه بيليب ترس شوونيسم فارس از آزادي ملتهاي تورکمن

ウェブ画像動画地図ニュースショッピングGmailもっと見るログイン

فارسی دری تاجیکیфорсӣ-тоҷикӣ

मानक हिन्दी)

हिंदी उर्दू, هندی اردو

ਪੰਜਾਬੀپنجابی

বাংলা

官话官話Guānhuà

Script Identification

REGISTER bacon.jar; text = LOAD 'test/script_tagger.txt' AS (words:chararray); tokens = FOREACH text GENERATE FLATTEN(script_tag(TOBAG(words))) AS (token:chararray,script:chararray); scripts = GROUP tokens BY script; scripts = FOREACH scripts { tokens = DISTINCT tokens; GENERATE group as script:chararray, tokens.(token); } dump scripts;

Script Identification

From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak 便宜, we just mean "cheap."

Pinyin for disambiguation (Language Log)

{('From my own observation, biànyi is seldom used in Chinese daily conversations. When we speak ',LATIN), ('便宜, ',CJK), ('we just mean "cheap."',LATIN)}

Resources

Heritrix web harvester
Wayback Machine
Archive Commons (Java)

Includes 'ia-tools.jar' from Pig examples

Aaron's Pig Experiments
Aaron's Blog...mostly Hadoop stuff
WAT File Format
Sample data from an Archive-It collection:
- CDX
- WAT