Home

Awesome

:snowflake: :whale: Awesome AI, ML and Data Science on Kubernetes

Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc with some emphasis on Kubeflow, Seldon Core, Pachyderm, Banzai Pipeline, H2O, TensorFlow, CNTK, XGBoost, MXNet, PyTorch, ONNX, Argo, Airflow, Apache Beam, Apache Spark, Intel BigDL, Rook and Ambassador

"The wind and the waves are always on the side of the ablest navigator." Edmund Gibbon, Historian


Introduction

The entire computing industry has adopted Kubernetes as the way forward to running distibuted computing loads from on-premise servers to large multi-cloud deployments.

For Machine Learning and Data Science workloads that are often prototyped on informal tools like JupyterLab, the awareness of Kubernetes has lagged somewhat.

Kubernetes Advantages for ML

Kubernetes Issues to Consider for ML


For this list, I am focused on AI/ML/Data Science OSS Tools that run in an infinitely scaleable Kubernetes environment. Another list that shares some scaling orchestration resources in a more general way is Awesome Machine Learning Operations

You might also be looking for something far less specific and here are some suggestions:

Kubernetes

Spark

AI/ML

Other


Note: Although many OSS projects are another octopus arm of mega-tech-corps like Google and Microsoft, we all benefit and in particular many smaller OSS projects represent a lot of volunteer effort by many individuals supporting OSS represented in this list. If you agree and when you can, please consider giving feedback to the authors, perhaps even testing their code and filing issue reports/feature requests and it is pretty easy to hit the Star button for their project. If your favorite project is missing from this list, please let me know.


Kubernetes means Helmsman and originated with Google's Borg:

"Its development and design are heavily influenced by Google's Borg system, and many of the top contributors to the project previously worked on Borg. The original codename for Kubernetes within Google was Project Seven, a reference to Star Trek character Seven of Nine that is a 'friendlier' Borg. The seven spokes on the wheel of the Kubernetes logo is a nod to that codename." https://en.wikipedia.org/wiki/Kubernetes

Note: I will attempt to sprinkle in various salty sea references in this document with perhaps some borg ones as well to stay with the spirit of the Kubernetes naming genesis...

Helmsman of a Great Ship

"The duties of the ruler are like those of the helmsman of a great ship. From his lofty position, he makes slight movements with his hands, and the ship, of itself, follows his desires and moves. This is the way whereby the one may control the ten thousand and by quiescence may regulate activity." Han Fei


ML designed for Kubernetes (i.e. Native Kube)

"If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea." Antoine de Saint Exupery


Kubeflow Cloud Native platform for machine learning. https://github.com/kubeflow/kubeflow The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

Kubeflow Labs Train and Serve TensorFlow Models at Scale with Kubernetes and Kubeflow on Azure

H2O + Kubeflow H2O + Kubeflow Integration. This is a project for the integration of H2O.ai and Kubeflow. The integration of H2O and Kubeflow is an extremely powerful opportunity, as it provides a turn-key solution for easily deployable and highly scalable machine learning applications, with minimal input required from the user. Kubeflow is an open source project managed by Google and built on top of their Kubernetes engine. It is designed to alleviate some of the more tedious tasks associated with machine learning. Kubeflow helps orchestrate deployment of apps through the full cycle of development, testing, and production, and allows for resource scaling as demand increases.H2O 3’s goal is to reduce the time spent by data scientists on time-consuming tasks like designing grid search algorithms and tuning hyperparameters, while also providing an interface that allows newer practitioners an easy foothold into the machine learning space. https://github.com/h2oai/h2o-3


Seldon Core Seldon Core is an open source platform for deploying machine learning models on Kubernetes https://github.com/SeldonIO/seldon-core


Pachyderm Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you. https://github.com/pachyderm/pachyderm


Fabric for Deep Learning - FfDL, pronounced fiddle Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes. This repository contains the core services of the FfDL (Fabric for Deep Learning) platform. FfDL is an operating system "fabric" for Deep Learning. https://github.com/IBM/FfDL

FfDL is a collaboration platform for:


PolyAxon Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applications.Polyaxon deploys into any data center, cloud provider, or can be hosted and managed by Polyaxon, and it supports all the major deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc. Polyaxon makes it faster, easier, and more efficient to develop deep learning applications by managing workloads with smart container and node management. And it turns GPU servers into shared, self-service resources for your team or organization https://github.com/polyaxon/polyaxon


Datalayer Big Data Science on Kubernetes in the Cloud. https://datalayer.io Datalayer is building a Simple, Collaborative and Multi Cloud platform for Big Data Scientists. https://docs.datalayer.io


Machine Learning Container Templates from IntelAI - mlt aids in the creation of containers for machine learning jobs. It does so by making it easy to use container and kubernetes object templates.


ML that is adapted for Kubernetes

“Impossible” is a word that humans use far too often." Seven of Nine


Pipeline.AI PipelineAI: Real-Time Enterprise AI Platform https://pipeline.ai - Quickstart for Kubernetes: https://github.com/PipelineAI/pipeline/tree/master/docs/quickstart/kubernetes


Dask Dask natively scales Python. https://github.com/dask/dask Dask provides advanced parallelism for analytics. Dask Example Notebooks Dask Tutorial

Dask Kubernetes Dask Kubernetes deploys Dask workers on Kubernetes clusters using native Kubernetes APIs. It is designed to dynamically launch short-lived deployments of workers during the lifetime of a Python process. Currently, it is designed to be run from a pod on a Kubernetes cluster that has permissions to launch other pods. However, it can also work with a remote Kubernetes cluster (configured via a kubeconfig file), as long as it is possible to open network connections with all the workers nodes on the remote cluster. https://github.com/dask/dask-kubernetes Helm chart for Dask Helm chart for Dask. We've moved development to stable/dask Dask docker images Docker images for dask-distributed. images are built primarily for the dask-distributed Helm Chart but they should work for more use cases.

Dask-ML Dask-ML provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn. https://github.com/dask/dask-ml

Dask-XGBoost Distributed training with XGBoost and Dask.distributed This repository enables you to perform distributed training with XGBoost on Dask.array and Dask.dataframe collections.


Helm Charts Apache Kafka Kubernetes Helm charts for Apache Kafka and Kafka Connect and other components for data streaming and data integration. Stream-reactor and Kafka Connectors any environment variable beginning with CONNECT is used to build the Kafka Connect properties file, the Connect cluster is started with this file in distributed mode.


Bigdata Playground A complete example of a big data application using : Kubernetes, Apache Spark SQL/Streaming/MLib, Apache Flink, Kafka Streams, Apache Beam, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL - The aim is to create a disposable Hadoop/HBase/Spark/Flink/Beam/ML stack where you can test your jobs locally or to submit them to the Yarn resource manager. We are using Docker to build the environment and Docker-Compose to provision it with the required components (Next step using Kubernetes). Along with the infrastructure, We are check that it works with 4 projects that just probes everything is working as expected. The boilerplate is based on a sample search flight web application.


Pipeline and Data Flow


Banzai Pipeline Pipeline enables developers to go from commit to scale in minutes by turning Kubernetes into a feature rich application platform integrating CI/CD, centralized logging, monitoring, enterprise-grade security and autoscaling.


Argo Get stuff done with container-native workflows for Kubernetes. https://github.com/argoproj/argo Argo is designed from the ground up for containers without the overhead and limitations of legacy VM and server-based environments. Argo is cloud agnostic and can run on any kubernetes cluster. Argo with Kubernetes puts a cloud-scale supercomputer at your fingertips.

Argo Events Argo Events is an open source event-based dependency manager for Kubernetes. The core concept of the project are sensors which are implemented as Kubernetes-native Custom Resource Definition that define a set of dependencies (signals) and actions (triggers). The sensor's triggers will only be fired after it's signals have been resolved. Sensors can trigger once or repeatedly.


Airflow Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

ETL best practices with airflow What you will find here are interesting examples, usage patterns and ETL principles that I thought are going to help people use airflow to much better effect. https://github.com/gtoonstra/etl-with-airflow

kube-airflow kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster.

Airflow Operator Airflow Operator is a custom Kubernetes operator that makes it easy to deploy and manage Apache Airflow on Kubernetes. Apache Airflow is a platform to programmatically author, schedule and monitor workflows. Using the Airflow Operator, an Airflow cluster is split into 2 parts represented by the AirflowBase and AirflowCluster custom resources.


Beam

beam-operator This operator manages Apache Beam instances on Kubernetes, simplifying creation and administration.

Scheduled Apache Beam jobs using Kubernetes Cronjobs Scheduled Dataflow pipelines using Kubernetes Cronjobs

Google Cloud Dataflow Template Pipelines These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines. Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.


Rook Storage Orchestration for Kubernetes https://github.com/rook/rook Rook is an open source cloud-native storage orchestrator for Kubernetes, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments. Rook turns storage software into self-managing, self-scaling, and self-healing storage services. It does this by automating deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management. Rook uses the facilities provided by the underlying cloud-native container management, scheduling and orchestration platform to perform its duties. Rook integrates deeply into cloud native environments leveraging extension points and providing a seamless experience for scheduling, lifecycle management, resource management, security, monitoring, and user experience.


OpenEBS OpenEBS is containerized block storage written in Go for cloud native and other environments w/ per container (or pod) QoS SLAs, tiering and replica policies across AZs and environments, and predictable and scalable performance. https://github.com/openebs/openebs OpenEBS enables the use of containers for mission-critical, persistent workloads. OpenEBS is containerized storage and related storage services.

OpenEBS allows you to treat your persistent workload containers, such as DBs on containers, just like other containers. OpenEBS itself is deployed as just another container on your host and enables storage services that can be designated on a per pod, application, cluster or container level, including:

Our vision is simple: let storage and storage services for persistent workloads be fully integrated into the environment and hence can be managed automatically that it almost disappears into the background as just yet another infrastructure service that just works.

Open EBS Maya OpenEBS Maya extends Kubernetes capabilities to orchestrate CAS containers. OpenEBS Maya extends the capabilities of Kubernetes to orchestrate CAS (aka Container Native) Storage Solutions like OpenEBS Jiva, OpenEBS cStor, etc. Maya (meaning Magic), seamlessly integrates into the Kubernetes Storage Workflow and helps provision and manage the CAS based Storage Volumes.

OpenEBS Helm Charts


Can Spark still be Master of the Sea?

"Would'st thou," so the helmsman answered, "Learn the secret of the sea? Only those who brave its dangers Comprehend its mystery!" Henry Wadsworth Longfellow

"Gentile or Jew
 O you who turn the wheel and look to windward,
 Consider Phlebas, who was once handsome and tall as you" 

The Waste Land by T.S. Eliot


Note: Kubernetes support was integrated into Spark with the 2.3 release. It is still incomplte and missing several important features, but it is a top priority for the Spark team.


Spark Operator Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes. Spark Operator aims to make specifying and running Spark applications as easy and idiomatic as running other workloads on Kubernetes. It uses Kubernetes custom resources for specifying, running, and surfacing status of Spark applications. For a complete reference of the custom resource definitions, please refer to the API Definition. For details on its design, please refer to the design doc. It requires Spark 2.3 and above that supports Kubernetes as a native scheduler backend.


Multi cloud Spark application service on PKS An Integrated and collaborative cloud environment for building and running Spark applications on PKS/Kubernetes. This project provides a streamlined way of deploying, scaling and managing Spark applications. Spark 2.3 added support for Kubernetes as a cluster manager. This project leverages Helm charts to allow deployment of common Spark application recipes - using Apache Zeppelin and/or Jupyter for interactive, collaborative workloads. It also automates logging of all events across batch jobs and Notebook driven applications to log events to shared storage for offline analysis. This project is a collaborative effort between SnappyData and Pivotal.


Sparknetes Spark on kubernetes. Based on official documentation of spark 2.3 at https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html


HDFS on Kubernetes Repository holding helm charts for running Hadoop Distributed File System (HDFS) on Kubernetes. See charts/README.md for how to run the charts. See tests/README.md for how to run integration tests for HDFS on Kubernetes.


Apache Spark Helm Chart This chart will do the following:


Helm Chart for Spark Operator This is the Helm chart for the Spark-on-Kubernetes Operator. Prerequisites: The Operator requires Kubernetes version 1.8 and above because it relies on garbage collection of custom resources. If customization of driver and executor pods (through mounting custom configMaps and volumes) is desired, then the Mutating Admission Webhook needs to be enabled and it only became beta in Kubernetes 1.9.

The chart can be installed by running:

$ helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
$ helm install incubator/sparkoperator

By default, the operator is installed in a namespace called "spark-operator". It would be created if it does not exist.


Kubernetes official examples - (Not up to date) Following this example, you will create a functional Apache Spark cluster using Kubernetes and Docker. You will setup a Spark master service and a set of Spark workers using Spark's standalone mode Spark on GlusterFS example (Also, not up to data) This guide is an extension of the standard Spark on Kubernetes Guide and describes how to run Spark on GlusterFS using the Kubernetes Volume Plugin for GlusterFS - The setup is the same in that you will setup a Spark Master Service in the same way you do with the standard Spark guide but you will deploy a modified Spark Master and a Modified Spark Worker ReplicationController, as they will be modified to use the GlusterFS volume plugin to mount a GlusterFS volume into the Spark Master and Spark Workers containers. Note that this example can be used as a guide for implementing any of the Kubernetes Volume Plugins with the Spark Example. There is also a video available that provides a walkthrough for how to set this solution up


Intel BigDL BigDL: Distributed Deep Learning Library for Apache Spark https://github.com/intel-analytics/BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. To makes it easy to build Spark and BigDL applications, a high level Analytics Zoo is provided for end-to-end analytics + AI pipelines.

BigDL-core Core HW bindings and optimizations for BigDL

Getting Started

Deep Learning Tutorials on Apache Spark using BigDL Step-by-step Deep Learning Tutorials on Apache Spark using BigDL. The tutorials are inspired by Apache Spark examples, the Theano Tutorials and the Tensorflow tutorials.

ElephantScale BigDL Tutorials

LetNet/BigDL Workshop Notebooks for LetNet/BigDL deep learning workshop. The workshop environment is available for download in a Docker container.

Intel Analytics Zoo Distributed Tensorflow, Keras and BigDL on Apache Spark - https://github.com/intel-analytics/analytics-zoo - Analytics Zoo provides a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large Hadoop/Spark cluster for distributed training or inference.


Spark on OKD

Rad Analytics Spark Operator ConfigMap and CRD based approach for managing the Spark clusters in Kubernetes or OpenShift.


OpenShift Spark This repository contains several files for building Apache Spark focused container images, targeted for usage on OpenShift Origin. tutorial-sparkpi-java-vertx A Java implementation of SparkPi using Vert.x 3 - This application is an example tutorial for the radanalytics.io community. It is intended to be used as a source-to-image (s2i) application.

Some Utils and Accessories


Helm Chart for Elastic-Fluentd-Kibana logging Helm chart to deploy a working logging solution using the ElasticSearch - Fluentd - Kibana stack on Kubernetes


Draft A tool for developers to create cloud-native applications on Kubernetes https://github.com/Azure/draft Draft makes it easier for developers to build applications that run on Kubernetes by doing two main things:

Draft Pack Repository Plugin Draft Pack Repository Plugin (or draft pack-repo for short) enables users to fetch, list, add and remove pack repositories to bootstrap all of their internal and external projects. It is incredibly opinionated on how to fetch, list, add and remove these repositories whereas Draft core does not care about these concepts. This also enables the Draft community to come up alternative forms of pack repositories by implementing their own plugin for fetching down these packs, so it made sense to initially spike the tooling as an entirely separate project.


Brigade Event-based Scripting for Kubernetes. https://github.com/Azure/brigade Script simple and complex workflows using JavaScript. Chain together containers, running them in parallel or serially. Fire scripts based on times, GitHub events, Docker pushes, or any other trigger. Brigade is the tool for creating pipelines for Kubernetes.

The Brigade Technology Stack

The design introduction introduces Brigade concepts and architecture.

Related Projects

Gateways

Kashti Kashti is a dashboard for your Brigade pipelines. https://github.com/Azure/kashti Brigade provides event-driven scripting for Kubernetes. With a simple JavaScript file, you can build elaborate pipelines composed of multiple containers running in parallel or serially. Among other possible applications, Brigade can be used to build highly flexible CI/CD pipelines. Kashti is a web dashboard for Brigade, helping you easily visualize and inspect your Brigade builds. Kashti gives you a deep view into your Brigade projects, scripts, and jobs.

Brigade Kubernetes Gateway Send Kubernetes events into a Brigade pipeline. This is a Brigade gateway that listens to the Kubernetes event stream and triggers events inside of Brigade.

Brigadier: The JS library for Brigade Brigadier is the events and jobs library for Brigade. This is the core of the Brigadier library, but the Kubernetes runtime is part of Brigade itself. To run a brigade.js file in Kubernetes, it will need to be executed within Brigade. This library is useful for:


ksonnet


The End Justifies... whatever is here


podder-ai has some tangentialy conncected material that appears to be working towards a coherent Kubernetes ML system. Kubeb Kubeb (Cubeb or Cubeba) provide CLI to build and deploy a web application to Kubernetes environment Kubeb use Docker & Helm chart for Kubernetes deployment pipeline-framework pipeline-framework is a platform to schedule and monitor workflows based on Apache Airflow pipeline-generator Generator code for pipeline-framework. pipeline-framework-sample This is sample project of pipeline-framework, started from pipeline-generator, and sample task is based on poc-base-sample.poc-base-sampleSample task implementation for Podder.ai pipeline framework. How to implement a task using poc-base repository poc-base Boilerplate project for creating python task using the Pipeline-framework.