TITLE: 
Scale-Out Beyond Map-Reduce

ABSTRACT:
The amount and variety of data being collected in the enterprise is
growing at a staggering pace. The default now is to capture and store
any and all data, in anticipation of potential future strategic value,
and vast amounts of data are being generated by instrumenting key
customer and systems touch points. Until recently, data was gathered
for well-defined objectives such as auditing, forensics, reporting and
line-of-business operations; now, exploratory and predictive analysis
is becoming ubiquitous. These differences in data heterogeneity, scale
and usage are leading to a new generation of data management and
analytic systems, where the emphasis is on supporting a wide range of
large datasets to be stored uniformly and analyzed seamlessly using
whatever techniques are most appropriate, including traditional tools
like SQL and BI and newer tools, e.g., for machine learning. These new
systems are necessarily based on scale-out architectures for both
storage and computation. The terms Big Data and data science are often
used to refer to this class of systems and applications.

Hadoop has become a key building block in the new generation of
scale-out systems. Early versions of analytic tools over Hadoop, such
as Hive [1] and Pig [2] for SQL-like queries, were implemented by
translation into Map-Reduce computations. This approach has inherent
limitations, and the emergence of resource managers such as YARN [3]
and Mesos [4] has opened the door for newer analytic tools to bypass
the Map-Reduce layer. This trend is especially significant for
iterative computations such as graph analytics and machine learning,
for which Map-Reduce is widely recognized to be a poor fit. In fact,
the website of the machine learning toolkit Apache Mahout [5]
explicitly warns about the slow performance of some of the algorithms
on Hadoop.

In this talk, I will examine this architectural trend, and argue that
resource managers are a first step in re-factoring the early
implementations of Map-Reduce, and that more work is needed if we wish
to support a variety of analytic tools on a common scale-out
computational fabric. I will then present REEF, which runs on top of
resource managers like YARN and provides support for task monitoring
and restart, data movement and communications, and distributed state
management. Finally, I will illustrate the value of using REEF to
implement iterative algorithms for graph analytics and machine
learning.

This is joint work with the CISL team at Microsoft.