Home

Awesome

Machine learning to SQL

pypi GitHub Repo stars GitHub last commit Pypi downloads interpret Python Version GitHub

Table of Contents

<img src="https://github.com/kaspersgit/ml_2_sql/blob/main/docs/media/ml2sql_logo.png?raw=true" align="right" alt="ML2SQL">

  1. What is it?
  2. Getting Started
  3. Input
  4. Output
  5. Remarks
  6. Troubleshooting
<br>

What is it?

An automated machine learning cli tool which trains, graphs performance and saves the model in SQL. Using interpretable ML models (from interpretml) to train models which are explainable and interpretable, so called 'glassbox' models. With the outputted model in SQL format which can be used to put a model in 'production' in an SQL environment. This tool can be used by anybody, but is aimed for people who want to do a quick analysis and/or deploy a model in an SQL system.

Description of the GIF

Philosophy:

Note

</br>

Getting started

<details> <summary><strong>Set up</strong></summary> <br>
  1. Make sure you have python >= 3.8
  2. pip install ml2sql
  3. ml2sql init (creates folder structure for in- and output)
<br> </details> <details> <summary><strong>Quick Demo</strong></summary> <br>
  1. ml2sql run
  2. Follow the instructions and select the demo data and related config
  3. Check the output in the newly created folder
<br> </details> <details> <summary><strong>Quick Usage with Your Own Data</strong></summary> <br>
  1. Save csv file containing target and all features in the input/data/ folder (more info on input data)
  2. Run: ml2sql run
  3. Select your CSV file
  4. Select Create a new config and choose Automatic option (a config file will be made and can be edited later) (more info on config json)
  5. Select newly created config
  6. Choose a model (EBM is advised)
  7. Give a name for this model
  8. The output will be saved in the folder trained_models/<current_date>_<your_model_name>/
  9. The .sql file in the model folder will contain a SQL written model
<br> </details> <details> <summary><strong>Testing a Trained Model on a New Dataset</strong></summary> <br>
  1. Make sure the new dataset has the same variables as the dataset the model was trained on (same features and target)
  2. Save dataset in the input/data/ folder (more info on input data)
  3. Run: ml2sql check-model
  4. Follow the instructions on screen
  5. The output will be saved in the folder trained_models/<selected_model>/tested_datasets/<selected_dataset>/
<br> </details> </br>

Input

Data

The csv file containing the data has to fulfill some basic assumptions:

Additional information

Configuration json (example)

This file will inform the script which column is the target, which are the features and several other parameters for pre and post training. You can copy and edit a config file from the already existing example in input/configuration/ or select Create a new config file in the second step when running ml2sql run.

Configs are saved in input/configuration/.

<details> <summary><strong>Configuration file content</strong></summary>

features

List with names of the columns which should be used as features

model_params

Dictionary of parameters that can be used with model of choice (optional). Check the model's documentation:

sql_split options:

sql_decimals options:

file_type options (optional):

pre_params

cv_type options (optional):

max_rows options:

time_sensitive_column options (optional):

target

Name of target column (required)

</details> </br>

Output (example)

The output consists of 4 parts:

Correlation matrices

Can be found in the created model's folder under /feature_info

Pearson Correlation Matrix

Cramer's V Correlation Matrix

Feature importance

Can be found in the created model's folder under /feature_importance

For EBM and logistic/linear regression

For Decision tree

Model performance

Can be found in the created model's folder under /performance

For Classification Models:

  1. Confusion Matrix
  1. ROC Curve and Precision-Recall Curve
  1. Calibration Plot
  1. Probability Distribution Plot

For Regression Models:

  1. Scatter Plot of Predicted vs. True Values
  1. Quantile Error Plot
  1. Regression Metrics Table

The model

Can be found in the created model's folder under /model

<br>

Remarks

Notes

TODO list

Check docs/TODO.md for an extensive list of planned features and improvements. Feel free to open an issue in case a feature is missing or not working properly.

Troubleshooting

Feel free to open an issue if anything unexpected happens