Invented by Google and popularized by Hadoop, MapReduce enjoys the buzz in recent years. When talking about data analytics, people often think of MapReduce. It is scalable and easy to programming. With open source Hadoop ecosystem, people can quickly start crunching numbers on the clusters of cheap commodity computers. However, it is actually not the right thing for most of us. To see why, let’s first understand why Google invented MapReduce. In the paper “MapReduce: Simplified Data Processing on Large Clusters“, Jeffrey Dean and Sanjay Ghemawat observed
- Most Google’s computations (e.g. generating inverted indices, summaries of the number of pages crawled per host, the set of most frequent queries) are conceptually straightforward.
- The input data is usually large. All Google’s business is about BIG data at the internet scale.
- Computations are distributed across thousands of commodity machines connected by commodity networking, where hardware failure is not uncommon. Storage is provided by inexpensive IDE disks attached directly to individual machines.
Therefore, they proposed the MapReduce to deal the issues of how to parallelize the computation, distribute the data, and handle failures. Compared to other popular parallel computing approach (e.g. MPI, the de facto parallel computing standard on supercomputers), the main contributions of MapReduce are
- A simple parallel computing paradigm. Inspired by functional programming, the computing paradigm of MapReduce is really simple and easy to understand. Because of automatic parallelization, no explicit handling of data transfer and synchronization in programs, and no deadlock, this model is very attractive. Meanwhile it is sufficient for simple computations mentioned earlier. In contrast, the learning curve of much more comprehensive MPI is steep.
- Assuming the input data is too big to fit into the memory (combined from all nodes), MapReduce employs a data flow model, which also provides a simple I/O interface to access large amount of data in distributed file system. It also exploits data locality for efficiency. In most cases, we don’t need to worry about I/O at all. Although it sounds trivial, it is actually an important improvement for big data analytics. MPI initially ignored I/O at all and MPI-I/O was added into MPI-2 later. But it is not easy to use and difficult to achieve good performance.
- Fault tolerance. Because hardware failures are common in the large clusters of commodity PCs, MapReduce puts a lot of emphasis on fault tolerance. In MPI, it is mainly application developers’ job to make program fault tolerance.
All of these are very good designs for Google given their computations and environment. However, they are not good for many of us. Even though everyone is talking about big data, only very few companies handle the data size on par with Google. In fact, even most MapReduce jobs in Microsoft, Yahoo, Facebook are applied to relatively small data. In the paper “Nobody ever got fired for using Hadoop on a cluster“, Microsoft Research analyzed 174,000 jobs submitted to a production analytics cluster in Microsoft in a single month in 2011 and found that the median job input data set size was less than 14 GB. They also estimated that the median input data size of the Hadoop jobs on the production clusters at Yahoo is less than 12.5 GB. A 2012 paper from Facebook revealed that Facebook jobs follow a power-law distribution with small jobs dominating. From the graphs in the paper, it appears that at least 90% of the jobs have input sizes under 100 GB. If your data can fit into the distributed memory, why confine yourself to the slow MapReduce? Besides, not many people afford a cluster of thousand nodes. If you have only dozens of machines, why bother fault tolerance? For fault tolerance, MapReduce keeps writing to disk all the time, which drags down your application performance significantly.
A more severe problem is that MapReduce provides only a very LIMITED parallel computing paradigm. Not all problems fit in MapReduce. For data analytics, we really want to apply advanced statistical analysis and machine learning to obtain deep insights. Isn’t it a shame if we can only do simple aggregations? Unfortunately, it is either impossible or awkward to implement many statistics and machine learning algorithms in MapReduce. For non-trivial algorithms, programmers try hard to “MapReducize” them, often in a non-intuitive way.
Even though MapReduce has the aforementioned problems, I appreciate that it helps so many companies/practitioners to hug big data and analytics in such a short period. After years of practice, the community has realized these problems and try to address them in different ways. For example, Apache Spark aims on the speed by keeping data in memory. Apache Pig provides a DSL and Hive provides a SQL dialect on the top of MapReduce to ease the programming. Google Dremel/Cloudera Impala target on interactive analysis with SQL queries. Microsoft Dryad/Apache Tez provides a more general parallel computing framework that models computations in DAGs. Google Pregel/Apache Giraph concerns computing problems on large graphs. Apache Storm focuses on real time event processing. We will look into all of them in the Big Data Analytics series.