RAMP: A Hadoop Extension for Provenance Support

Debugging MapReduce workflows can be a difficult task: their execution is batch-oriented and, once completed, leaves only the data sets themselves to help in the debugging process. Data provenance, which captures how data elements are processed through the workflow, can aid in debugging by enabling backward tracing: finding the input subsets that contributed to a given output element. For example, erroneous input elements or processing functions may be discovered by backward-tracing suspicious output elements. Provenance and backward tracing also can be useful for drilling-down to learn more about interesting or unusual output elements.

RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for MapReduce workflows. RAMP captures fine-grained provenance by wrapping the RecordReader, Mapper, Combiner, Reducer, and RecordWriter. This wrapper-based approach is transparent to Hadoop, retaining Hadoop’s parallel execution and fault tolerance. Furthermore, in many cases users need not be aware of provenance capture while writing MapReduce jobs–wrapping is automatic, and RAMP stores provenance separately from the input and output data. Our performance experiments show that RAMP imposes reasonable time and space overhead during provenance capture and enables efficient backward tracing without requiring special indexing of provenance information.

Here are our demonstration proposal, presentation slides, and research paper describing this system.

The source code is available at github.com/hyunjung/hadoop-common and github.com/hyunjung/pig.

Example: Temporal Query Phrase Popularity from Pig Tutorial

Provenance of the output record (san jose, 1, 7)

B6D197CBA40EB1F8970917000855san jose police
B6D197CBA40EB1F8970917000741san jose area crime
C8BFFBA73DA51C20970916122706appliance parts san jose
C8BFFBA73DA51C20970916122821san jose appliance parts
E3B951514A7D3774970916123453san jose california
E3B951514A7D3774970916123406san jose california
E4867395E2FCEF5F970916124513san jose, california
E4867395E2FCEF5F970916125122san jose, california
E4867395E2FCEF5F970916125102san jose, california
4C67B46D197F9ABE970916125731san jose, ca
200CA2F9D2D0504B970916123937restaurants san jose cupertino
2262A03003F2C79D970916124614realtors san jose
2262A03003F2C79D970916124706realtors san jose
2262A03003F2C79D970916124845realtors san jose
02693894662434F0970916122205security jobs in san jose

Provenance of the output record (san francisco, 4, 19)

F1600C1B6A71CCBA970916004007+pediatrics +san +francisco
F200C3C155317F31970916005335san francisco raves
21D94FF35E6DDBAC970916003009"air fares" +"san francisco" +"orange county"
94D4D2B9D4B689B6970916004032san francisco chronicle
F3AE45818561264C970916120757rental car san francisco
5FB7059B23CDB8E4970916121345san francisco spine center
212EBB10F894EC3F970916122111san francisco bay areaa progressive directory
4A0D4434AD286E86970916124332commodore hotel san francisco
4A0D4434AD286E86970916124408commodore hotel san francisco
4A0D4434AD286E86970916123814san francisco hotels
43469E0FCF916981970916122838ymca san francisco embarcadero
43469E0FCF916981970916123737"ymca" + "san francisco"
43469E0FCF916981970916122655ymca san francisco embarcadero
43469E0FCF916981970916123639"ymca" + "san francisco"
43469E0FCF916981970916123619"ymca" + "san francisco"
43469E0FCF916981970916122808ymca san francisco embarcadero
43469E0FCF916981970916123852"ymca" + "san francisco"
43469E0FCF916981970916123725"ymca" + "san francisco"
0C959555E000D7B4970916124426san francisco and museums
11E239BE903A5BD2970916124501san francisco windsurfing
566A4FB46D2D76B1970916125842university of california, san francisco
46DCEC79B3E38648970916121812san francisco alcatras
46DCEC79B3E38648970916121840san francisco alcatras
46DCEC79B3E38648970916121815san francisco alcatras
45EEC4F331A6128D970916121856san francisco
4491550C17DCF8BB970916124623san francisco 49ers cheerleaders
5483C9398C19103D970916123020san francisco oral history project
6299C610910E0AEF970916124549san francisco bed & breakfasts
6299C610910E0AEF970916124150cartwright san francisco
78A1B5069C5253CD970916121312san francisco california bay
78A1B5069C5253CD970916121247san francisco california
78A1B5069C5253CD970916121216san francisco
A051C304794ABFA2970916122452san francisco rent board
7FB28AD06BCE58D7970916122319san francisco municipal court
7FB28AD06BCE58D7970916122503san francisco municipal court
C02735CF57E1E9D8970916120922the "portman hotel" "san francisco" + rates
D830578C86FAD501970916124544san francisco blues festival
D830578C86FAD501970916124606san francisco blues festival
DCD929C3599B050F970916122752movies, san francisco
DCD929C3599B050F970916122551san francisco, jobs, television, film
DCD929C3599B050F970916121624headhunters, san francisco
DCD929C3599B050F970916122708jobs, television, film, san francisco