Awesome
Machine learning pipeline
This repo provides an example of how to incorporate popular machine learning tools such as DVC, MLflow, and Hydra in your machine learning project. I use my project on predicting aggressive tweets as an example.
Find the article on how to use MLflow and Hydra here
Find the article on how to use DVC here
DVC
DVC is a data version control tool. To install DVC, run
pip install dvc
Hydra
With Hydra, you can compose your configuration dynamically. To install Hydra, simply run
pip install hydra-core --upgrade
MLflow
MLflow is a platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. Install MLflow with
pip install mlflow
Structure's explanation
- src: file for source code
- mlruns: file for mlflow runs
- configs: to keep config files
- outputs: results from the runs of Hydra. Each time you run your function nested inside Hydra's decoration, the output will be saved here. If you want to change the directory in mlflow folder, use
import mlflow
import hydra
from hydra import utils
mlflow.set_tracking_uri('file://' + utils.get_original_cwd() + '/mlruns')
src/preprocessing.py
: file for preprocessingsrc/train_pipeline.py
: training's pipelinesrc/train.py
: file for training and saving modelsrc/predict.py
: file for prediction and loading model
How to pull the data with DVC
Pull the data from Google Drive
dvc pull
How to run this file
To run the configs and see how these experiments are displayed on MLflow's server, clone this repo and run
python src/train.py
Once the run is completed, you can access to MLflow's server with
mlflow ui
Access http://localhost:5000/ from the same directory that you run the file, you should be able to see your experiment like this