For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.

Many people ask my opinions on different NoSQL databases and they also want to know the feature list and benchmark numbers. When you start building your next cool cloud application, there are dozens of NoSQL options to choose. It is natural to ask which one is fast, AP or CP, what’s the sharding and fault tolerance strategy, blah, blah. But you won’t find any comparison chart or benchmark numbers here. WHY?!! You may be yelling now: “We benchmarked Oracle, DB2, SQL Server, MySQL and PostgreSQL in old days. And it was very meaningful and helpful!” Please let me explain.

Yes, it makes sense to benchmark relational databases because SQL-based relational database products are largely indistinguishable. You are doing an apples-to-apples comparison. The benchmarks do help us to understand which implementation is suitable for which use case. But NoSQL solutions are different animals. They offer different data models (key-value, wide-columnar, objects/documents, graph, etc.), CP or AP, synchronous or asynchronous replication, in memory or durability, strong consistency or eventual consistency, etc. When comparing them, we are simply compare apples to oranges. The various benchmarks (or benchmarketings) with contradicting results may just confuse us further.

“But I have to make a choice!”, you cry. Well, a quick (and cheating) way is to check out NoSQL’s creators and their applications. All of NoSQL products are great in the sense that they do what they suppose do for their creators. In the old days, we build applications against database. Today, it seems that people build databases against applications because each company has so different requirements. It is why Google built BigTable, Amazon.com built Dynamo, Facebook built Cassandra as search, online shopping, and social networking are so different. This list can go on. For example, 10gen built MongoDB for their web platform originally. In other words, they built these NoSQLs for themselves. If your business align with one of theirs, congratulations and just go with the corresponding open source one.

But what if your idea is truly innovative and you are doing something wild that no existing solutions seems a good fit? In an environment of rapid technology advance and always changing user requisitions, it is not realistic to choose the “best” solution today. It is better to think of the minimization of business risk first rather than technical comparisons. When we are bold to hug the cutting-edge technologies including NoSQL, we have to be caution that cutting-edge technologies don’t turn bleeding-edge to us. It is always good to ask if we have a plan B that has minimal migration cost tomorrow if we have to.

If we look back, history may teach us something helpful. Before relational databases, the world of DBMS was somehow similar to today. There were many different data models, systems, and interfaces. Why did relational databases replace these dinosaurs? There are many reasons. Let’s just look from the point view of programmers as we are. With relational databases, I, a software engineer, virtually don’t care what is the backend. No matter it is MySQL, Oracle, or TeraData, all I face is ubiquitous relational data model and all I use is SQL. Yes, there are always some small differences on data types and SQL syntax among them. But it doesn’t take 10 man years to migrate from one to another.

Based on this observation, we probably should firstly choose a data model that is flexible and expressive. More importantly, this data model should be supported in multiple major solutions that are from both CP and AP schools. With this in mind, I am thinking of BigTable’ wide-columnar data model. As we know, key-value pairs are simplest yet most flexible data model. With the logic concept of column/column family, wide-columnar data model also enables us to encapsulate document and graph models. Crucially, this data model is supported by both HBase (CP) and Cassandra (AP). As we know, both HBase and Cassandra have very large community and are used in large-scale real-life systems. HBase provides strong consistency, tight integration with MapReduce, and in-database computation through coprocessors. Cassandra provides simple and symmetric architecture and also excellent multi-datacenter support. With some abstraction, we can easily switch from one to the other.

This is my two cents. What’s your opinion? Please feel free to leave your comment below.

Advertisements