Awesome
Guide to Airflow on Kubernetes
Recently I spend quite some time diving into Airflow and Kubernetes. While there are reports of people using them together, I could not find any comprehensive guide or tutorial. Also, there are many forks and abandoned scripts and repositories. So you need to do some research.
Here I write down what I've found, in the hope that it is helpful to others. Because things move quickly, I've decided to put this on Github rather than in a blog post, so it can be easily updated. Please do not hesitate to provide updates, suggestions, fixes etc.
Decide what you want
There are some related, but different scenarios:
- Using Airflow to schedule jobs on Kubernetes
- Running Airflow itself on Kubernetes
- Do both at the same time
You can actually replace Airflow with X, and you will see this pattern all the time. E.g. you can use Jenkins or Gitlab (buildservers) on a VM, but use them to deploy on Kubernetes. Or you can host them on Kubernetes, but deploy somewhere else, like on a VM. And of course you can run them in Kubernetes and deploy to Kubernetes as well.
The reason that I make this distinction is that you typically need to perform some different steps for each scenario. And you may not need both.
[1] Scheduling jobs on Kubernetes
The simplest way to achieve this right now, is by using the kubectl commandline utility (in a BashOperator) or the python sdk. However, you can also deploy your Celery workers on Kubernetes. The Helm chart mentioned below does this.
Native Kubernetes integration
Work is in progress that should lead to native support by Airflow for scheduling jobs on Kubernetes. The wiki contains a discussion about what this will look like, though the pages haven't been updated in a while. Progress can be tracked in Jira (AIRFLOW-1314).
Development is being done in a fork of Airflow at bloomberg/airflow. This is still work in progress so deploying it should probably not be done in production.
KubernetesPodOperator (coming in 1.10)
A subset of functionality will be released earlier, according to AIRFLOW-1517.
In the next release of Airflow (1.10), a new Operator will be introduced that leads to a better, native integration of Airflow with Kubernetes. The cool thing about this Operator will be that you can define custom Docker images per task. Previously, if your task requires some python library or other dependency, you'll need to install that on the workers. So your workers end up hosting the combination of all dependencies of all your DAGs. I think eventually this can replace the CeleryExecutor for many installations.
Documentation for the new Operator can be found here.
[2] Running Airflow on Kubernetes
Airflow is implemented in a modular way. The main components are the scheduler, the webserver, and workers. You also probably want a database. If you want to distribute workers, you may want to use the CeleryExecutor. In that case, you'll probably want Flower (a UI for Celery) and you need a queue, like RabbitMQ or Redis.
Docker images
First thing you need is a Docker image that packages Airflow. Most people seem to use puckel/docker-airflow. This is a robust Docker image that is up to date and also has an automated build set up, pushing images to Dockerhub. This means that you do not need to build your own image when you are first starting out. The image has an entrypoint script that allows the container to fulfill the role of scheduler, webserver, flower, or worker.
While puckel/docker-airflow
is widely used, it isn't updated very often anymore, so it is lagging behind Airflow releases a bit. Many people are forking this repo and updating it themselves. At this point the Airflow community is lacking a canonical Docker image.
Another option may be the image by astronomer.io, though I cannot find the Dockerfile source, so at this point I'm hesitant to run it:
Helm
Next you need to create some Kubernetes manifests, or a Helm chart, to deploy the Docker image on Kubernetes. There is some work in this area, but it is not completely finished yet.
There are several helm charts to install Airflow on kubernetes:
It's a bit unfortunate that the community has not yet arrived at a canonical Chart, so you'll have to try your luck. The 'official' chart probably covers most deployment options, including Celery and non-Kubernetes options, while the others may be more opinionated (and focused).
To install the official chart:
helm install --namespace "airflow" --name "airflow" stable/airflow
Kubernetes Operator
There is work by Google on a Kubernetes Operator for Airflow. This name is quite confusing, as operator here refers to a controller for an application on Kubernetes, not an Airflow Operator that describes a task. So you would use this operator instead of using the Helm chart to deploy Kubernetes itself.
Deploying your DAGs
There are several ways to deploy your DAG files when running Airflow on Kubernetes.
- git-sync
- Persistent Volume
- Embedding in Docker container
Mounting Persistent Volume
You can store your DAG files on an external volume, and mount this volume into the relevant Pods (scheduler, web, worker). In this scenario, your CI/CD pipeline should update the DAG files in the PV. Since all Pods should have the same collection of DAG files, it is recommended to create just one PV that is shared. This ensures that the Pods are always in sync about the DagBag.
To share a PV with multiple Pods, the PV needs to have accessMode 'ReadOnlyMany' or 'ReadWriteMany'. If you are on AWS, you can use Elastic File System (EFS). If you are on Azure, you can use Azure File Storage (AFS).
If you mount the PV as ReadOnlyMany, you get reasonable security because the DAG files cannot be manipulated from within the Kubernetes cluster. You can then update the DAG files via another channel, for example, your build server.
RBAC
If your cluster has RBAC turned on, and you want to launch Pods from Airflow, you will need to bind the appropriate roles to the serviceAccount of the Pod that wants to schedule other Pods. Typically, this means that the Workers (when using CeleryExecutor) or the Scheduler (using LocalExecutor or the new KubernetesPodExecutor) need extra permissions. You'll need to grant the 'watch/create' verbs on Pods.
Relevant links
- Awesome Apache Airflow
- Airflow on AKS (end-to-end guide)
- Airflow scheduler healthcheck