Barrier execution mode and Delta Lake are two new Apache Spark features. Interestingly, they break apart from the root of Apache Spark. Let’s figure out together what they are and why they are developed. More importantly, will they be a success?
Essentially, Spark is a better implementation of MapReduce. In MapReduce/Spark, a task in a stage doesn’t depend on any other tasks in the same stage, and hence it can be scheduled independently. This fundamental assumption enables Spark to hide the complexity of scheduling from developers. It also brings in the elasticity with the helps of an underlying resource manager. Finally, fault tolerance becomes simple as the scheduler can rerun a task at any time (with the great help of the immutability of RDDs).
However, this nice yet simple parallel compute model meets new challenges when AI jobs show up. Spark has been very popular for data wrangling. After cleaning up the data with Spark, developers naturally want to build machine learning models with Spark too in the same pipeline. However, the MapReduce pattern is not suitable for machine learning in most cases. Many machine learning algorithms explore complex communication patterns. As a baby step, SPARK-24374 introduces the barrier execution mode. Using this new execution mode, Spark launches all training tasks together and restarts all tasks in case of task failures. Spark also introduces a new mechanism of fault tolerance for barrier tasks. When any barrier task failed in the middle, Spark would abort all the tasks and restart the stage. This mechanism can be used to support All-Reduce pattern to accelerate distributed TensorFlow training.
However, there are two major issues with this new barrier scheduling. First, it is far from a complete solution to support various communication patterns used in distributed machine learning. Shall the community keep introducing new scheduling and/or communication APIs for AI jobs? It is a hard question. An important merit of Spark is its clean and easy-to-use API that hides all the complexity of parallel computing, all based on the MapReduce model. In contrast, MPI supports all kinds of communication and synchronous patterns but exposes a very complicated interface. We might gradually lose the simplicity of Spark if we keep adding new compute models. Second, SPARK-24374 is a step back from elasticity and fault tolerance. In a large cluster, hardware and software failures happen almost everyday. A week-long deep learning job might have great challenges to finish in a large cluster as the scheduler would abort all the tasks and restart the stage when any barrier task fails. Also, resource are dynamically allocated in a large cluster to maximize the utilization. As the barrier scheduler starts all the tasks at the same time, we would lose many opportunities of leveraging elastic compute resources.
To meet the customer’s demand, Spark vendors have to add the support to AI jobs. The question is how to strike a balance between simplicity of programming model, flexible communication and synchronous patterns, elasticity and fault tolerance of compute infrastructure. The computer scientists have been struggling for 50 years to find such a perfect model.
On the other hand, we should also ask ourselves if we have to do everything in one system, especially when we use a product in the way that it was not designed for.
Alternatively, I suggest that we need to develop a new platform for distributed machine learning. Like Spark, it should do one thing, and do it well. Meanwhile, this new system should run side by side with Spark on the same infrastructure, which is managed by a general purpose resource manager. In addition, it can access RDDs and DataFrames from Spark (the lifecycle of RDDs and DataFrames should go beyond jobs). Therefore, we will get the best of two worlds.
Delta Lake is another interesting development by Databricks. It is a storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of existing data lake and is fully compatible with Apache Spark APIs.
I will focus on the business impact rather than the technical details of Delta Lake in this post as I speculate that Delta Lake was developed to meet the (new) business strategy of Databricks. Different from Hadoop, Spark has been a pure compute engine. Now Spark steps into the storage layer too with Delta Lake. To understand why, let’s look at Hadoop. After 10 years, people realize that what they want from Hadoop is neither HDFS nor MapReduce, but a data warehouse. With Hive and Impala (and many other SQL-on-Hadoop solutions), Hadoop becomes a legitimate data warehouse indeed.
To survive, Databricks needs to go beyond ETL and Spark could become a good data warehouse solution with a lot of innovations in SparkSQL. However, only innovations in compute engine are not sufficient to compete with market leaders. Any successful data warehouse must invest heavily in storage layer to gain performance advantage. On the other hand, in the era of cloud computing, the compute layer and the storage layer are decoupled. It doesn’t make sense to introduce a full-blown storage solution nowadays as Hadoop did 10 years ago. Therefore, Databricks wisely designs Delta Lake to work on existing data lake while adding many missing pieces in data warehouse puzzle, especially ACID transactions, metadata and schema management. I expect they will continue the efforts in these areas in next phase because the current implementations still have gap to be a real data warehouse.
If my prediction be correct, Databricks may become a strong cloud native data warehouse provider in a couple of years, competing with the market leader Snowflake Computing. Maybe they took a page from Snowflake’s playbook 🙂