Programming and Debugging Large-Scale Data Processing Workflows
                 Christopher Olston, Yahoo! Research

This talk gives an overview of some work on large-scale data
processing I have done with my Yahoo collaborators. The talk begins
with overviews of two data processing systems I helped develop: PIG,
a dataflow programming environment and Hadoop-based runtime, and NOVA,
a workflow manager for Pig/Hadoop. The bulk of the talk focuses on
debugging, and looks at what can be done before, during and after
execution of a data processing operation:

  * Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig
job to get a feel for what it will do, enabling certain kinds of
mistakes to be caught early and cheaply. The algorithm behind the
example generator performs a combination of sampling and synthesis to
balance several key factors---realism, conciseness and
completeness---of the example data it produces.

  * INSPECTOR GADGET is a framework for creating custom tools that
monitor Pig job execution. We have implemented a dozen user-requested
tools, ranging from data integrity checks to crash cause investigation
to performance profiling, each in just a few hundreds of lines of code.

  * IBIS is a system that collects metadata about what happened during
data processing, for post-hoc analysis. The metadata is collected from
multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and
processing elements at different granularities (e.g. tables
vs. records; relational operators vs. reduce task attempts) and offer
disparate ways of querying it. IBIS integrates this metadata and
presents a uniform and powerful query interface to users.