Home

Awesome

<img src="./logo.png" width="50%">

Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps is a set of open-source tools carefully chosen to ease user experience of conducting machine learning experiments and deploying machine learning models. Read our announcement blog post with more background here.

We also provide a step by step set-up guide and some other getting started tutorials.

In this repository, we provide these applications as Terraform modules with which the user will be able to install them into a Kubernetes cluster. The tools we provide are the following:

Architecture diagram

Other repositories

Repositories diagram

Modules

Jupyter Hub

With the Jupyter Hub, we enabled a multi-user environment in which each of them can spawn a Jupyter server to do their experiments. Users can work on different environments being able to install any library necessary to meet their needs.

We provide a default Jupyter server image that comes with most of the data science packages installed. Users can use their own Jupyter server images as well.

Configuration

Below we provide a lists of the configurable parameters available and their default values.

Parameter (* required parameter)DescriptionDefault
jupyterhub_namespaceNamespace to install jupyterhbjhub

Proxy configuration

The proxy receives the requests from the client’s browser and forwards all requests to the Hub. In the JupyterHub docs you can find a more in-depth explanation.

* Required parameters

ParameterDescriptionDefault
jhub_proxy_https_enabledIndicator to set whether HTTPS should be enabled or not on the proxyfalse
jhub_proxy_https_hostsYou domains in list form. Required for automatic HTTPS[]
jhub_proxy_secret_token *A 32-byte cryptographically secure randomly generated string used to secure communications between the hub and the configurable-http-proxy (for example, generated by openssl rand -hex 32)nil
jhub_proxy_https_letsencrypt_contact_emailThe contact email to be used for automatically provisioned HTTPS certificates by Let’s Encrypt""

Authentication configuration

JupyterHub’s OAuthenticator has support for enabling your users to authenticate via a third-party OAuth2 identity provider such as GitHub.

You can configure authentication using GitHub accounts and restrict what users are authorized based on membership in a GitHub organization.

See details on how to set up a GitHub Oauth here.

If you choose not to use GitHub to authenticate users, the DummyAuthenticator will be used as default. The Dummy Authenticator lets any user log in with the given password.

The dummy password is: a-shared-secret-password.

* Required parameters ** Required when oauth_github_enable is enabled

ParameterDescriptionDefault
oauth_github_enableDefines whether the authentication will be handled by github oauth. Required when oauth_github_enable is enabledfalse
oauth_github_client_id **Github client id used on GitHubOAuthenticator.""
oauth_github_client_secret **Github secret used to authenticate with github.""
oauth_github_admin_usersList of github user names to allow as administrator[]
oauth_github_callback_urlThe URL that people are redirected to after they authorize your GitHub App to act on their behalf""
oauth_github_allowed_organizationsList of Github organization to restrict access to the members[""]

User configuration

Single user configuration refers to the default settings for each user logged in the JupyterHub.

A user can choose a Docker image to spawn a new Jupyter server. Each Docker image can have different libraries and environments installed. We use the singleuser_profile_list parameter to set up a list of default images available to the user. This parameter receives a list of maps that describes the image details such as the image location and description.

See an example:

[{
  display_name = "Prefect"
  description  = "Notebook with prefect installed"
  default      = true
  kubespawner_override = {
    image = "drtools/prefect:notebook-prefect"
  }
}]

You must pass the image pull secret if you provide an image located in a private container registry. The image pull secret parameter is defined as below:

default = [{
    name = ""
}]
ParameterDescriptionDefault
singleuser_profile_listList of images which the user can select to spawn a server
singleuser_image_pull_secretsList of image secretsnil
singleuser_image_pull_policyImage pull policyAlways
singleuser_memory_guaranteeHow much memory will be guarateed to the user1G
singleuser_storage_capacityHow much storage capacity a user will have1G
singleuser_storage_mount_pathStorage mount path/home/jovyan/persistent

Prefect

...

ParameterDescriptionDefault
namespaceNamespace name to deploy the application`prefect
prefect_version_tagConfigures the default tag for prefect imageslatest

Agent

According to Prefect docs, Agents are lightweight processes for orchestrating flow runs. Agents run inside a user's architecture, and are responsible for starting and monitoring flow runs. During operation the agent process queries the Prefect API for any scheduled flow runs, and allocates resources for them on their respective deployment platforms.

ParameterDescriptionDefault
agent_enableddetermines if the Prefect Kubernetes agent is deployedTrue
agent_prefect_labelsDefines what scheduling labels (not K8s labels) should be associated with the agent[""]
agent_image_nameDefines the prefect agent image nameprefecthq/prefect
agent_image_tagDefines agent image tag"
agent_image_pull_policyDefines the image pull policyAlways

Postgresql

ParameterDescriptionDefault
postgresql_databaseDefines the postgresql database nameprefect
postgresql_usernameDefines the username to authenticate withprefect
postgresql_existing_secretConfigures which secret should be referenced for access to the database.""
postgresql_service_portConfigures the port that the database should be accessed at5432
postgresql_external_hostnameDefines the address to contact an externally managed postgres database instance at""
postgresql_use_subchartDetermines if a this chart should deploy a user-manager postgres database or use an externally managed postgres instancetrue
postgresql_persistence_enabledEnables a PVC that stores the database between deployments. If making changes to the database deployment, this PVC will need to be deleted for database changes to take effect. This is especially notable when the authentication password changes on redeploys.false
postgresql_persistence_sizeDefines the persistence storage size for postgres8G
postgresql_init_userDefines the initial db usernamepostgres

Dask

...

ParameterDescriptionDefault
namespaceNamespace name to deploy the applicationdask
worker_nameDask worker nameworker
worker_replicasDefault number of workers3
worker_image_repositoryContaine image repositorydaskdev/dask
worker_image_tagContainer image tag2.30.0
worker_image_pull_policyContainer image pull policy.IfNotPresent
worker_image_dask_worker_commandask worker command. E.g dask-cuda-worker for GPU worker.dask-worker
worker_image_pull_secretContainer image pull secrets[{name: ""}]
worker_environment_variablesEnvironment variables. See values.yaml for example values.[{}]

Feast

...

ParameterDescriptionDefault
namespaceNamespace name to deploy the applicationfeast
feast_core_enabledDefines whether to install feast coreTrue
feast_online_serving_enabledDefines whether to install feast serverTrue
feast_jupyter_enabledDefines whether to install feast jupyther hubFalse
feast_jobservice_enabledDefines whether to install feast job serviceTrue
feast_posgresql_enabledDefines whether to enable postgresqlTrue
feast_postgresql_password *Postgress password""
feast_kafka_enabledDefines whether to enable kafkaFalse
feast_redis_enabledDefines whether to enable redisTrue
feast_redis_use_passwordDefines whether to enable redis passwordFalse
feast_prometheus_enabledDefines whether to install prometheysFalse
feast_prometheus_statsd_exporter_enabledDefines whether to enable statsd exporterFalse
feast_grafana_enabledDefines whether to enable grafanaTrue

MLFlow

...

ParameterDescriptionDefault
namespaceNamespace name to deploy the applicationmlflow
db_hostDatabase host address``
db_usernameDatabase usernamemlflow
db_password *Database password``
database_nameDatabase namemlflow
db_portDatabase port5432
default_artifact_rootlocal or remote filepath to store model artifacts. It is mandatory when specifying a database backend store/tmp
image_pull_policyDocker image pull policyIfNotPresent
image_repositoryDocker image repositorydrtools/mlflow
image_tagDocker image tag1.13.1
service_typeKubernetes service typeNodePort
docker_registry_serverDocker Registry Server``
docker_auth_keyBase64 Enconded combination of {registry_username}:{registry_password}. Can be found in ~/.docker/config.json``
docker_private_repoWhether the MLFlow's image comes from a private repository or not. If true, docker_registry_server and docker_auth_key will be requiredfalse

Note: The variables docker_registry_server and docker_auth_key are optional and should only be used when pulling MLFlow's image from a private repository.

Seldon

ParameterDescriptionDefault
namespaceNamespace name to deploy the applicationmlflow
istio_enabledWhether to install istio as ingress controllertrue
usage_metrics_enabledWhether to enable usage metricstrue

Exposing Services

In order to access the services from outside the cluster, we need to expose them. Usually, this is done through Kubernetes Ingress resources. In this project, since we rely on Seldon to expose our prediction endpoints, we use Ambassador API Gateway as our ingress controller. Seldon Core works well with Ambassador, allowing a single ingress to be used to expose ambassador and running machine learning deployments can then be dynamically exposed through seldon-created ambassador configurations.

Ambassador

Ambassador is a Kubernetes-native API Gateway built on the Envoy Proxy. In addition to the classical routing capabilities of an ingress, it can perform sophisticated traffic management functions, such as load balancing, circuit breakers, rate limits, and automatic retries. Also, it has support for independent authentication systems, such as the ORY ecosystem.

Exposing a service in Ambassador

Ambassador is designed around a declarative, self-service management model. The core resource used to support application development teams who need to manage the edge with Ambassador is the Mapping resource. This resource allows us to define custom routing rules to our services. This routing configuration can achieved by applying a custom Kubernetes Resource like the following

# mapping.yaml
---
apiVersion: getambassador.io/v2
kind:  Mapping
metadata:
  name:  httpbin-mapping
spec:
  prefix: /httpbin/
  service: httpbin.httpbin_namespace

By applying this configuration with kubectl apply -f httpbin-mapping.yaml.

Terraform

Since this project uses Terraform to manage resources and, with the current version, it's still not possible to apply custom Kubernetes resource definitions, we need to add this YAML file inside the services annotation. One way to do this is by using Service's Metadata field

resource "kubernetes_service" "httpbin" {
  metadata {
    ...
    annotations = {
      "getambassador.io/config" = <<YAML
---
apiVersion: getambassador.io/v2
kind: Mapping
name: httpbin-mapping
service: httpbin.httpbin_namespace
prefix: /httpbin/
YAML
    }
  }
}

This will produce the same behaviour as applying the custom yaml file described above.

Authentication

Since we're exposing our services in the Internet, we need an Authentication and Authorization system to prevent unwanted users to accessing our services. Ambassador API Gateway can control the access by using an External Authentication Service resource (AuthService). An AuthService is an API that has a verification endpoint, which determines if the user can access this resource (returning 200 or not, 401). In this project, we rely on ORY ecosystem to enable authentication. ORY is an open-source ecosystem of services with clear boundaries that solve authentication and authorization.

Session Lifespan

The session lifespan of authenticated users can be managed through the /ory/kratos/values.yaml file. By default, the session lifespan is 24h, but it is currently set to 30 days.

kratos:
  config:
  ...
    session:
      cookie:
        domain: ${cookie_domain}
      lifespan: 720h

ORY Oathkeeper

ORY Oathkeeper is an Identity and Access Proxy. It functions as a centralized way to manage different Authentication and Authorization methods, and inform the gateway, whether the HTTP request is allowed or not. The Oathkeeper serves perfectly as an Ambassador's External AuthService.

Zero-Trust and Unauthorized Resources

Oathkeeper is rooted in the principle of "never trust, always verify,". This means that if no additional configuration is provided, the Oathkeeper will always block the incoming request. In practice, all endpoints exposed in Ambassador will be blocked by external requests, until further configuration is made.

Access Rules

To configure an access rule to ORY Oathkeeper, the file access-rule-oathkeeper.yaml is used. Example:

Allow all incoming requests

- id: oathkeeper-access-rule
  match:
    url: <{http,https}>://${hostname}/allowed-service/<**>
    methods:
      - GET
  authenticators:
    - handler: anonymous
  authorizer:
    handler: allow
  mutators:
    - handler: noop
  credentials_issuer:
    handler: noop

This configuration will register all the incoming requests as a guest user, thus, not performing any credentials validation.

Authorize on KRATOS

- id: httpbin-access-rule
  match:
    url: <{http,https}>://${hostname}/blocked-service/<**>
    methods:
      - GET
  authenticators:
    - handler: cookie_session
  authorizer:
    handler: allow
  mutators:
    - handler: id_token
  credentials_issuer:
    handler: noop
  errors:
    - handler: redirect
      config:
        to: http://${hostname}/auth/login

This configuration will force authenticating all incoming requests by checking a cookie_session, which configuration is specified in config-oathkeeper.yaml