For in-depth information on various Big Data technologies, check out my free e-book “Introduction to Big Data“.
In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data.
Google BigQuery is the public implementation of Dremel. BigQuery provides the core set of features available in Dremel to third party developers via a REST API. Impala is Cloudera’s open source SQL query engine that runs on Hadoop. It is modeled after Dremel and is Apache-licensed. Impala became generally available in May 2013. Drill is another open source project inspired by Dremel and is still incubating at Apache. Both Impala and Drill can query Hive tables directly. Impala actually uses Hive’s metastore.
Hive is basically a front end to parse SQL statements, generate and optimize logical plans, translate them into physical plans that are finally executed by a backend such as MapReduce or Tez. Dremel and its derivatives are different as they execute queries natively without translating them into MapReduce jobs. For example, the core Impala component is a daemon process that runs on each node of the cluster as the query planner, coordinator, and execution engine. Each node can accept queries. The planner turns a request into collections of parallel plan fragments. The coordinator initiates execution on remote nodes in the cluster. The execution engine reads and writes to data files, and transmits intermediate query results back to the coordinator node.
The two core technologies of Dremel are columnar storage for nested data and the tree architecture for query execution:
- Columnar Storage
- Data is stored in a columnar storage fashion to achieve very high compression ratio and scan throughput.
- Tree Architecture
- The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.
These are good ideas and have been adopted by other systems. For example, Hive 0.13 has the ORC file for columnar storage and can use Tez as the execution engine that structures the computation as a directed acyclic graph. Both (and other innovations) help a lot to improve the performance of Hive. However, the recent benchmark from Cloudera (the vendor of Impala) and the benchmark by AMPLab show that Impala still has the performance lead over Hive. It is well known that benchmarks are often biased due to the hardware setting, software tweaks, queries in testing, etc. But it is still meaningful to find out what possible design choice and implementation details cause this performance difference. And it may help both communities improve the offerings in the future. What follows is a list of possible reasons:
- As a native query engine, Impala avoids the startup overhead of MapReduce/Tez jobs. It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem. In contrast, Impala daemon processes are started at boot time, and thus are always ready to execute a query.
- Hadoop reuses JVM instances to reduce the startup overhead partially. However, it also introduces another problem. The nodes in the Cloudera benchmark have 384 GB memory. Such a big heap is actually a big challenge to the garbage collection system of the reused JVM instances. The stop-of-the-world GC pauses may add high latency to queries. On the other hand, Impala prefers such large memory.
- Impala process are multithreaded. Importantly, the scanning portion of plan fragments are multithreaded on SSD as well as making use of SSE4.2 instructions. The I/O and network systems are also highly multithreaded. Therefore, each single Impala node runs more efficiently by a high level local parallelism.
- Impala’s query execution is pipelined as much as possible. In case of aggregation, the coordinator starts the final aggregation as soon as the pre-aggregation fragments has started to return results. In contrast, sort and reduce can only start once all the mappers are done in MapReduce. Tez currently doesn’t support pipelined execution yet.
- MapReduce materializes all intermediate results. This feature enables better scalability and fault tolerance. However, it also significantly slows down the data processing. In contrast, Impala streams intermediate results between executors (of course, in tradeoff of the scalability). Tez allows different types of Input/Output including file, TCP, etc. But it seems that Hive doesn’t use this feature yet to avoid unnecessary disk writes.
- The reducer of MapReduce employs a pull model to get Map output partitions. For sorted output, Tez makes use of the MapReduce ShuffleHandler, which requires downstream Inputs to pull data over HTTP. With multiple reducers (or downstream Inputs) running simultaneously, it is highly likely that some of them will attempt to read from the same map node at the same time, inducing a large number of disk seeks and slowing the effective disk transfer rate.
- Hive’s query expressions are generated at compile time while Impala does runtime code generation for “big loops” using llvm that can achieve more optimized code.
- Tez allows complete control over the processing, e.g. stopping processing when limits are met. It is very useful for top-k calculation and straggler handling. Unfortunately, this feature is not used by Hive currently. BTW, Dremel calculates approximate results for top-k and count-distinct using one-pass algorithms. It is not clear if Impala does the same.
- During query execution, Dremel computes a histogram of tablet processing time. If a tablet takes a disproportionately long time to process, it is rescheduled to another server. If trading speed against accuracy is acceptable, Dremel can return the results before scanning all the data, which may reduce the response time significantly as a small fraction of the tables often take a lot longer. It is not clear if Impala implements a similar mechanism although straggler handling was stated on the roadmap.
As you see, some of these reasons are actually about the MapReduce or Tez. With the continuous improvements of MapReduce and Tez, Hive may avoid these problems in the future. Besides, the last two are the features of Dremel and it is not clear if Impala implements them.
In summary, Dremel and its derivatives provide us an inexpensive way to do interactive big data analytics. The Hadoop ecosystem is now a real threat to the traditional relational MPP data warehouse systems. The benchmark by AMPLab shows that Amazon Redshift (based on ParAccel by Actian) still has the performance lead over Impala but the gap is small. With continuous improvements (e.g. both Hive and Impala are working on cost based plan optimizer), we can expect SQL on Hadoop/HDFS at higher level in near feature.