, , ,


It is right, you don’t read the title wrong. In most people’s mind, Hadoop was almost a synonym of Big Data. Adding the magic word to your resume means more opportunities and higher pay. How possible is its future misty? Let’s get things clear together.

First of all, I love Hadoop. From every sense, it fits very well in Professor Clayton Christensen’s legendary theory of disruptive innovation. Hadoop started with HDFS and MapReduce. Although they have lower performance compared to traditional MPP databases, they are simple and conveniently based on commodity hardware. More importantly, they target non-consumption: customers who historically lacked the money to buy Teradata, Vertical, Netezza, etc. After all, disruptive innovation is mainly a marketing challenge. Therefore, Hadoop became a buzz word and it is also moving to up-market. Today even big corporates that can afford big computer appliances are also trying out Hadoop.

But as said before, disruptive innovation is mainly a marketing challenge, not technology challenge. Even though Hadoop is still enjoying the popularity, its future is now misty because the market is changing. Hadoop started 10 years ago based on Google’s work of GFS and MapReduce. It was designed to run on-premises in data centers. Now everyone is moving to cloud. You can run Hadoop in the cloud. But it is not cost effective. Take AWS, the de facto cloud platform, as an example, you could run Hadoop with EMR service or set up your own Hadoop cluster with EC2. However, it is much cheaper to save your data into S3. Only when need, you spin up EC2 instances to analyze the data. A popular practice today is that data is saved as Parquet files in S3 and is analyzed by Spark with dynamic EC2 instances. Although Parquet is originated from Hadoop and Spark uses Hadoop library to access S3, people don’t want to spin a whole Hadoop cluster because HDFS, the foundation of Hadoop, is not cost effective on AWS. And this holds true for other cloud platforms such as Google Compute Engine and Microsoft Azure.

Another challenge of Hadoop is the distributor’s one-size-fits-all strategy. Initially, Hadoop is very simple, just HDFS and MapRduce. Because it is not good enough to meet customer’s demands, the community has developed a lot of new tools into the Hadoop ecosystem over the years. This is rational and necessary to win the competition. However, it also inevitably reaches the status of overshooting. Today, the architecture picture of Hadoop looks like a zoo hosting HDFS, YARN, MapReduce, Tez, Pig, Hive, Impala, Kudu, HBase, Accumulo, Flume, Sqoop, Falcon, Samza, etc. (no doubt you need a ZooKeeper 🙂 ). Rarely customers need all of them. If no need, why bother running a full blown Hadoop cluster? As in the example of Parquet + Spark, engineers only want to pick out what they need from Hadoop. Another example is Apache Samza, a stream processing engine that was originally designed on top of Hadoop YARN. But Netflix has recently contributed the feature of static partition assignments that allows Samza to be used without YARN. This cool feature enables Netflix to run Samza applications in AWS EC2 instances without any Hadoop/YARN dependency.

Overshooting also causes big impacts on resource allocation. Let’s look into HBase, the NoSQL engine of Hadoop. Compared to other popular NoSQL solutions such as MongoDB and Cassandra, HBase has a lot of functionality and unique features. It also has battle proven scalability and availability. Moreover, Apache Trafodion, built on top of HBase, even provides fully ACID SQL. However, it is only ranked at 15 on DB-Engines Ranking, way behind MongoDB and Cassandra. The biggest reason that HBase is left behind is that Hadoop distributors’ marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products. Besides, Hadoop distributors have moved to tiered pricing because Hadoop is so rich and complicated. Correspondingly, their salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers. Technically, HBase is also more complicated to setup and operate because of the dependency to other Hadoop services. If HBase were spin off from Hadoop and were prompted by a dedicated company, it should have been much more successful.

Finally, new and better technologies come out with the role of Hadoop 10 years ago. They are going to disrupting the incumbent (read Hadoop). Spark, Kafka, Mesos, Docker, etc. are better than their counterparts of Hadoop or fill in blank space. And they get endorsements from heavy weights like IBM. Hadoop distributors realize the challenges but follow a wrong strategy. They simply try to integrate all new cool stuffs into their distributions. As we already found out, such an one-size-fits-all integration approach won’t work. The world already changed. Why still use the old RedHat business model? Being a Big Data service provider is probably a more promising strategy than the old open source distribution approach. For example, Databricks is not only back on new Spark technology but also employs a cloud service business model.

As we are closing this study, we should also notice that a 2015 Gartner report says that 54 percent organization with zero plans to use Hadoop, plus another 20 percent that at best will get to experimenting with Hadoop in the next year. But this shortfall of interest hasn’t yet caught up with the top two Hadoop vendors, Cloudera and Hortonworks. This is probably the most dangerous thing to them as they are not aware that they are transforming from the disruptor to the disruptee.