Awesome

MLflow Examples

MLflow examples - basic and advanced.

This repo consists of two sets of code artifacts:

Regular Python scripts using open source MLflow
Databricks notebooks using Databricks MLflow

Last updated: 2024-07-12

Examples

Python examples

sklearn - Scikit-learn model - train and score.
- Canonical example that shows multiple ways to train and score.
- Options to log ONNX model, autolog and save model signature.
- Train locally or against a Databricks cluster.
- Score real-time against a local web server or Docker container.
- Score batch with mlflow.load_model or Spark UDF>
sparkml - Spark ML model - train and score. ONNX too.
Keras/Tensorflow - train and score. ONNX working too.
- Keras with TensorFlow 2.x
  - keras_tf_wine - Wine quality dataset
  - keras_tf_mnist - MNIST dataset
- keras_tf1 - Keras with TensorFlow 1.x - legacy
xgboost - XGBoost (sklearn wrapper) model - train and score.
catboost - Catboost (using sklearn) model - train and score. ONNX working too.
pytorch - Pytorch - train and score. ONNX too.
onnx_sklearn - ONNX - Sklearn to ONNX train and score.
h2o - H2O model - train and score - with AutoML. ONNX too.
model_registry - Jupyter notebook sampling the Model Registry API.
e2e-ml-pipeline - End-to-end ML pipeline - training to real-time scoring.
reproduce - Reproduce an existing run.
nested_runs - Create a nested run with specified number of levels.
scoring_server_benchmarks - Scoring server performance benchmarks.

The sklearn and Spark ML examples also demonstrate:

Different ways to run a project with the mlflow CLI
Real-time server scoring with docker containers
Running a project against a Databricks cluster

Scala examples - uses the MLflow Java client

hello_world - Hello World - no training or scoring.
sparkml - Scala train and score - Spark ML and XGBoost4j
mleap - Score an MLeap model with MLeap runtime (no Spark dependencies).
onnx - Score an ONNX model (that was created in Scikit-learn) in Java.

Databricks

Databricks notebooks - current.
Notebook CICD - Lighweight CICD example with Databricks notebook. Legacy.

Docker

docker/docker-server - MLflow tracking server and MySQL database containers.

Setup

Use Python 3.8.

For Python environment use either:
- Miniconda with conda.yaml.
- Virtual environment with PyPi.
Install Spark 3.4.0.
For ONNX install see: python/sklearn/conda.yaml.

Miniconda

Install miniconda3: https://conda.io/miniconda.html
Create the environment: conda env create --file conda.yaml
Source the environment: source activate mlflow-examples

Virtual Environment

Create a virtual environment.

python -m venv mlflow-examples
source mlflow-examples/bin/activate

pip install the libraries in conda.yaml.

MLflow Server

You can either run the MLflow tracking server directly on your laptop or with Docker.

Docker

See docker/docker-server/README.

Laptop Tracking Server

You can either use the local file store or a database-backed store. See MLflow Storage documentation.

Note that new MLflow 1.4.0 Model Registry functionality seems only to work with the database-backed store.

First activate the virtual environment.

cd $HOME/mlflow-server
source $HOME/virtualenvs/mlflow-examples/bin/activate

File Store

Start the MLflow tracking server.

mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri $PWD/mlruns --default-artifact-root $PWD/mlruns

Database-backed store - MySQL

Install MySQL
Create an mlflow user with password.
Create a database mlflow

Start the MLflow Tracking Server

mlflow server --host 0.0.0.0 --port 5000 \
  --backend-store-uri mysql://MLFLOW_USER:MLFLOW_PASSWORD@localhost:3306/mlflow \
  --default-artifact-root $PWD/mlruns

Database-backed store - SQLite

mlflow server --host 0.0.0.0 --port 5000 \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root $PWD/mlruns

Examples

Most of the examples use a DecisionTreeRegressor model with the wine quality data set.

As such, the python/sparkml and scala/sparkml are isomorphic as they are simply language variants of the same Spark ML algorithm.

Setup

Before running an experiment

export MLFLOW_TRACKING_URI=http://localhost:5000

Data

Data is in the data folder.

wine-quality-white.csv contains the training data.

Real-time scoring prediction data

The prediction files contain the first three records of wine-quality-white.csv.
The format is standard MLflow JSON-serialized Pandas DataFrames split orientation format described here.
Data in predict-wine-quality.json is directly derived from wine-quality-white.csv.
- The values are a mix of integers and doubles.
Apparently if you score predict-wine-quality.json against an MLeap SageMaker container, you will get errors as the server is unable to handle integers (bug).
Hence predict-wine-quality-float.json whose data is all doubles.