Awesome
kedro-starters-sklearn
This repository provides the following starter templates for Kedro 0.18.14.
sklearn-iris
trains a Logistic Regression model using Scikit-learn.sklearn-mlflow-iris
adds experiment tracking feature using MLflow.
sklearn-iris
template
Iris dataset
Iris dataset is included and used in default.
- Modification: for each species, setosa is encoded to 0, versicolor is encoded to 1, and virginica samples were removed.
- Split: for each species, first 25 samples were included in train.csv, and last 25 samples were included in test.csv.
How to use
-
Install dependencies.
pip install 'kedro==0.18.14' pandas scikit-learn
-
Generate your Kedro starter project from
sklearn-iris
directory.kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-iris
As explained by Kedro's documentaion, enter project_name, repo_name, and python_package.
Note: As your Python package name, choose a unique name and avoid a generic name such as "test" or "sklearn" used by another package. You can see the list of importable packages by running
python -c "help('modules')"
. -
Change the current directory to the generated project directory.
cd /path/to/project/directory
-
Run the project.
kedro run
Option to use Kaggle Titanic dataset
- Download Kaggle Titanic dataset
- Replace
train.csv
andtest.csv
in/path/to/project/directory/data/01_raw
directory - Modify
/path/to/project/directory/base/parameters.yml
to set parameters appropriate for the dataset (commented out in default)
sklearn-mlflow-iris
template
This template integrates MLflow to Kedro using PipelineX. Even without writing MLflow code. You can:
- configure MLflow Tracking
- log inputs and outputs of Python functions set up as Kedro nodes as parameters (e.g. features used to train the model) and metrics (e.g. F1 score).
- log execution time for each Kedro node and DataSet loading/saving as metrics.
- log artifacts (e.g. models, execution time Gantt Chart visualized by Plotly,
parameters.yml
file)
In this template, MLflow logging is configured in Python code at src/<python_package>/mlflow/mlflow_config.py
See here for details.
How to use
-
Install dependencies.
pip install 'kedro==0.18.14' pandas scikit-learn mlflow 'pipelinex>=0.7.7' plotly
-
Generate your Kedro starter project from
sklearn-mlflow-iris
directory.kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-iris
-
Follow the same steps as
sklearn-iris
template.
Access MLflow web UI
To access the MLflow web UI, launch the MLflow server.
mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri sqlite:///mlruns/sqlite.db --default-artifact-root ./mlruns
<p align="center">
<img src="_doc_images/mlflow_ui_metrics.png">
Logged metrics shown in MLflow's UI
</p>
<p align="center">
<img src="_doc_images/mlflow_ui_gantt.png">
Gantt chart for execution time, generated using Plotly, shown in MLflow's UI
</p>