Skew and Failures during Parallel Data Processing Magda Balazinska, University of Washington In the database group at the University of Washington, the Nuage project studies various aspects of data intensive scalable computing. In this talk, we present two of our recent results. We first present SkewReduce, a new system implemented on top of Hadoop that drastically improves load-balance in complex, user-defined operations, where processing times depend not only on the total amount of data but also on the data value distributions. SkewReduce is designed to support a common class of applications that we call "feature-extraction analysis", where this problem frequently arises. Experiments on real data demonstrate that SkewReduce can improve execution times by a factor of up to 8 compared to a naive MapReduce implementation. Second, we present FTOpt, a new approach for making online, parallel query plans fault-tolerant: FTOpt provides intra-query fault-tolerance without blocking. Additionally, it does so by using different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. FTopt comprises a protocol for mixing-and-matching fault-tolerance techniques within a single query plan and an optimizer for selecting the technique to use in order to minimize the expected processing time with failures for the entire query. Experiments show that with as little as one failure, the choice of fault-tolerance approach can result in 70% difference in query runtimes, that often hybrid query plans lead to the best performance, and that our optimizer is able to select a winning plan.