How to process real-time data at large scale with online(streaming) MapReduce

Kaixo maitea,

this week’s paper is about the handling of data at scale. And for that the MapReduce framework was invented, which gives a nice abstraction for working on data without caring about the underlying system which actually processes the data.

In the beginning I was unsure if I should choose this very paper for the introduction of the topic as it already goes one step further: MapReduce in an online (streaming) fashion. Original MapReduce only worked with batched jobs as it has a benefit to the system to know the input and output data of every step in the processing pipeline. I did choose it nevertheless as the paper gives a nice overview of MapReduce in general and then additionally introduces this newer (paper is from 2010) concept of doing the processing in an online fashion where all steps are run in parallel.

Easy understandable start into your journey with big data 😉

Software exists to create business value

I am Simon Frey, the author of the Weekly CS Paper Newsletter. And I have great news: You can work with me

As CTO as a Service, I will help you choose the right technology for your company, build up your team and be a deeply technical sparring partner for your product development strategy.

Checkout my website simon-frey.com to learn more or directly contact me via the button below.

Let’s work together!

Abstract:

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a
modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see “early returns” from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing.
HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.

Download Link:

https://www.usenix.org/legacy/event/nsdi10/tech/full_papers/condie.pdf

Additional Links:

MapReduce explained by Computerphile