Home

Awesome

:bulb: A Guide to Production Level Deep Learning :clapper: :scroll: :ferry:

🇨🇳 Translation in Chinese

:label: NEW: Machine Learning Interviews

:label: Note: This repo is under continous development, and all feedback and contribution are very welcome :blush:

Deploying deep learning models in production can be challenging, as it is far beyond training models with good performance. Several distinct components need to be designed and developed in order to deploy a production level deep learning system (seen below):

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/components.png" title="" width="95%" height="95%"> </p>

This repo aims to be an engineering guideline for building production-level deep learning systems which will be deployed in real world applications.

The material presented here is borrowed from Full Stack Deep Learning Bootcamp (by Pieter Abbeel at UC Berkeley, Josh Tobin at OpenAI, and Sergey Karayev at Turnitin), TFX workshop by Robert Crowe, and Pipeline.ai's Advanced KubeFlow Meetup by Chris Fregly.

Machine Learning Projects

Fun :flushed: fact: 85% of AI projects fail. <sup>1</sup> Potential reasons include:

1. ML Projects lifecycle

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/lifecycle.png" title="" width="95%" height="95%"></p>

2. Mental Model for ML project

The two important factors to consider when defining and prioritizing ML projects:

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/prioritize.png" title="" width="90%" height="90%"> </p>

Full stack pipeline

The following figure represents a high level overview of different components in a production level deep learning system:

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/infra_tooling.png" title="" width="100%" height="100%"> </p> In the following, we will go through each module and recommend toolsets and frameworks as well as best practices from practitioners that fit each component.

1. Data Management

1.1 Data Sources

1.2 Data Labeling

1.3. Data Storage

1.4. Data Versioning

1.5. Data Processing

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/airflow_pipe.png" title="" width="65%" height="65%"> </p>

2. Development, Training, and Evaluation

2.1. Software engineering

2.2. Resource Management

2.3. DL Frameworks

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/frameworks.png" title="" width="95%" height="95%"> </p>

2.4. Experiment management

2.5. Hyperparameter Tuning

2.6. Distributed Training

3. Troubleshooting [TBD]

4. Testing and Deployment

4.1. Testing and CI/CD

Machine Learning production software requires a more diverse set of test suites than traditional software:

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/testing.png" title="" width="75%" height="75%"> </p>

4.2. Web Deployment

4.5 Service Mesh and Traffic Routing

4.4. Monitoring:

Are we done?

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/post-deploy.png" title="" width="65%" height="65%"> </p>

4.5. Deploying on Embedded and Mobile Devices

4.6. All-in-one solutions

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/infra-cmp.png" title="" width="100%" height="100%"> </p>

Tensorflow Extended (TFX)

[TBD]

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/tfx_config.png" title="" width="95%" height="95%"> </p>

Airflow and KubeFlow ML Pipelines

[TBD]

<p align="center"> <img src="https://github.com/alirezadir/Production-Level-Deep-Learning/blob/master/images/kubeflow_pipe.png" title="" width="45%" height="45%"> </p>

Other useful links:

Contributing

References:

<a name="fsdl">[1]</a>: Full Stack Deep Learning Bootcamp, Nov 2019.

<a name="pipe">[2]</a>: Advanced KubeFlow Workshop by Pipeline.ai, 2019.

<a name="pipe">[3]</a>: TFX: Real World Machine Learning in Production