Awesome

MLSpec

A project to standardize the intercomponent schemas for a multi-stage ML Pipeline

Background

The machine learning industry has embraced the concept of cloud-native architectures, made up of multiple component parts loosely coupled together. One of the issues with this approach, however, has been that while the steps of a machine learning pipeline have been fairly well articulated in a wide variety of publications, the specifications for how to wire together these steps remains highly varied, and make it difficult to build any standard tools that might simplify or formalize machine learning operations.

This project is about establishing community driven standards that automated tooling can consume and output. Ideally, this enables the next opportunity around standardized ML software engineering practices.

Existing Multi-Stage ML Workflows

The below provide inspiration as projects which focus on ML and batch solutions:

From these papers, we feel the following steps summarize all the steps in an end-to-end machine learning workflow.

Proposed standards

We propose a standard around the following components.

Workflow orchestration – what are the standard endpoints that each step in an ML workflow require (e.g. /ok, /varz, /metrics, etc)
Model - …
Logging – what is the NCSA standard log for each inference request?
Other…

End-to-End Complete Lifecycle

We feel that over time, every stage of an ML lifecycle will need some form of metadata management. The below represent a collection of these steps:

Codify Objectives - Detail the model outputs, possible errors and minimum success for launching in code; a simple DSL that can be used to verify success/failure programmatically for automated deployment
Data Ingestion - What tools/connectors (e.g. ODBC, Spark, HDFS, CSV, etc) were used for pulling in data; what queries were used (including signed datasets); sharding strategies; May include labelling or synthetic data generation/simulation.
Data Analysis - Set of descriptive statistics on the included features and configurable slices of the data. Identification of outliers.
Data Transformation - What data conversions and feature wrangling (e.g. feature to-integer mappings) were used; what outliers were programmatically eliminated
Data Validation - What validation was applied to the data based on a versioned, succinct description of the expected properties of the data; schema can also be used to prevent bad behavior, such as training on deprecated data; mechanisms to generate the first version of the schema (e.g. select * from foo limit 30) that can be used to drive other platform components, e.g., automatic feature-engineering or data-analysis tools.
Data Splitting (including partitioning) - How the data is split into training, validation, hold back & debugging sets and records and gets results of validation for statistics of each set; metadata here may be be used to detect leakage of training data into testing data and/or overfit
Model Training/Tuning - Metadata about how the model is packaged and the distribution strategy; hyperparameters searched and results of the search; results of any conversions to other model serving format (e.g. TF -> ONNX); techniques used to quantize/compress/prune model and the results
Model Evaluation/Validation - Result of evaluation and validation of model to ensure they meet original codified objectives before serving them to users; computation of metrics on slices of data, both for improving performance and avoiding bias (e.g. gender A gets significantly better results than gender B); source of data used for validation
Test - Results of final confirmation for model on the hold back data set; MUST BE SEPARATE STEP FROM #8; source of data used for final test
Model Packaging - Metadata about model package; includes adding additional security constraints, monitoring agents, signing, etc.; descriptions of the necessary infrastructure (e.g. P100s, 16 GB of RAM, etc)
Serving - Results of rolling model out to production
Monitoring - Live queryable metadata that provides liveness checking and ML-specific signals that need action, such as significant deviation from previous model performance or degradation of the model performance over time; ideally includes rollback strategy (e.g. if this model is failing, use model last-year.last-month.pkl)
Logging - NCSA-style record per inference request, including a cryptographically secure record of the version of the pipeline (including features) and data used to train.

Table of contents for MLSpec repo

common
- object
  
  General notes applicable to multiple objects in the system. How they are identified and named, basic operations, etc.
data
- datastore
  
  Data storages
- datapath
  
  Data references
- artifact
  
  Data produced by runs
- dataset
  
  Named and versioned data in storage
pipelines
- pipeline
  
  DAG for executing computation on data and training and deploying models
- module
  
  Reusable definition of computation, includes script, set of expected inputs, outputs, etc.
experiment_tracking
- run
  
  Tracked execution of pipeline or single script on compute
model_packaging
- models
  
  Trained models
logging_proto
monitoring_proto
metadata_file
- metadata
  
  The metadata file used to recreate the ML workflow