Skew and Failures during Parallel Data Processing
           Magda Balazinska, University of Washington

In the database group at the University of Washington, the Nuage
project studies various aspects of data intensive scalable
computing. In this talk, we present two of our recent results.

We first present SkewReduce, a new system implemented on top of Hadoop
that drastically improves load-balance in complex, user-defined
operations, where processing times depend not only on the total amount
of data but also on the data value distributions. SkewReduce is
designed to support a common class of applications that we call
"feature-extraction analysis", where this problem frequently
arises. Experiments on real data demonstrate that SkewReduce can
improve execution times by a factor of up to 8 compared to a naive
MapReduce implementation.

Second, we present FTOpt, a new approach for making online, parallel
query plans fault-tolerant: FTOpt provides intra-query fault-tolerance
without blocking. Additionally, it does so by using different
fault-tolerance techniques at different operators within a query
plan. Enabling each operator to use a different fault-tolerance
strategy leads to a space of fault-tolerance plans amenable to
cost-based optimization. FTopt comprises a protocol for
mixing-and-matching fault-tolerance techniques within a single query
plan and an optimizer for selecting the technique to use in order to
minimize the expected processing time with failures for the entire
query. Experiments show that with as little as one failure, the choice
of fault-tolerance approach can result in 70% difference in query
runtimes, that often hybrid query plans lead to the best performance,
and that our optimizer is able to select a winning plan.