MapReduce has been hailed as a revolutionary platform for large scale data processing. Many database vendors including both traditional relational DBMS companies and new NoSQL providers want to ride the wave and thus provide their connectors for Hadoop so that their database can be used as an input source and/or an output destination. A lot of them claim that you can now crunch numbers with MapReduce from their databases. Sound lovely, right? But do NOT do it!

For large scale data processing, it is well known that pushing computation to data is much more efficient than pulling data to computation. Analytics by MapReduce with data from an external database is exactly pulling the data to the computation, the last thing you want.

Does it mean that Hadoop connectors are useless at all? No. Actually they are great for  ETL jobs, where anyway you have to pull the data. HDFS are very efficient to ingest large volume data in any format, which is weakness of many data warehouse, especially relational databases. On the other hand, parallel data warehouses are a lot faster than MapReduce on common analytics, especially if the users have to do interactive analytics repeatedly on the same dataset for long period.

Of course, this is not necessary for one time quick-and-dirty analyses. Parallel data warehouses are also not suitable for complicated analysis such as machine learning. In these cases, it is better to analyze your data directly by MapReduce/Spark or a DSL on top of them.

Advertisements