As Kubernetes becomes the de facto solution for container orchestration, more and more people expect that it will be the orchestrator of data centers. For example, ZDNet predicted Kubernetes will rule the hyperscale data center in 2018. In a little over four years’ time, the project born from Google seems going to change everything. Tracing back to its root from Google Borg, Kubernetes is nicely designed to run web services. As StatefulSets became stable in 1.9, it is also able to manage stateful applications such as database, message queue, etc. To conquer enterprise data centers, however, there are still several missing pieces.
In the data centers of large corporates (e.g. banks, pharmaceutical, energy companies), there are a variety of workloads such as HPC (high-performance computing), HPA (high-performance analytics), and batch jobs. Compared to them, web services use only a small portion of compute resources. Unfortunately, Kubernetes has been weak to orchestrate these workloads so far.
There are many kinds of HPC workloads. For simplicity, let’s just consider Monte Carlo simulation here. It is a simple use case but consumes a lot of compute time in many enterprise data centers. A typical Monte Carlo simulation involves millions of tasks with complicated dependency. The scheduling algorithm is generally task driven. Since each task doesn’t run very long (seconds), the low latency of scheduling is critical. In contrast, the median k8s pod start up latency on large cluster could be as long as 25 seconds. Of which, 80% time is for deploying container images. Although one may argue that local cache of docker images should help, quick release of new versions/images is a norm today with agile development. Hiccups or even choking happen frequently therefore.
Even worse, there will be thousand machines that simultaneously request the docker image from the docker registry server when a HPC job starts. The central registry is not only the bottleneck but also may not survive the heavy volume of requests. Instead, a distributed registry solution is a better approach. For example, in NERSC’s Shifter project, docker images are converted to tgz files and transferred to Lustre parallel distributed file system.
Since Spark 2.3, we can submit spark jobs to Kubernetes. However, the current integration takes a static resource allocation approach. When submitting a job, the user needs to configure the number of executors, which will book the resources from kubernetes across the lifetime of job. Note that a Spark application often run several or many Spark jobs, which are decomposed into stages and tasks for scheduling. Each job and stage generally has different number of tasks and requires different amount of resources. But the user has to allocate the maximum number of executors up front. The static resource allocation approach will certainly waste a lot of CPU time and RAM.
Kubernetes’s batch job support is extremely simple, basically run to completion. However, enterprise batch jobs are way more complicated than that. For example, a batch job may execute in parallel across many hundred or even thousands of nodes using a message passing library to synchronize state. It may also require specialized resources like GPUs or require access to limited software licenses. Organizations may enforce policies around what types of resources can be used by whom to ensure projects are adequately resourced and deadlines are met. Therefore the capabilities like array jobs, configurable priority and preemption, user, group or service based quotas and a variety of other features are mandatory. There is a SIG kub-batch working on a batch scheduler for Kubernetes. But the road map and expected GA date are not available yet.