Home

Awesome

lazytransform

Automatically transform all categorical, date-time, NLP variables in your data set to numeric in a single line of code for any data set any size.

Update (Aug 2022)

<ol> <li><b>lazytransform is very easy to install on Kaggle and Colab notebooks using this command:</b>.
!pip install lazytransform --ignore-installed --no-cache --no-deps
</li><br> <li><b>lazytransform as of version 0.91 has two Super Learning Optimized (SULO) Ensembles named "SuloClassifier" and "SuloRegressor"</b>. The estimators are "super-optimized" in the sense that they perform automatic GridSearchCV so you can use them for all kinds of multi-label multi-class and Imbalanced data set problems with just default parameters to get great results in Kaggle competitions. </li> Take a look at the amazing benchmarking results notebook here for SuloClassifier:

Notebook

SuloClassifier

We ran a similar benchmarking result in SuloRegressor against XGBoost and LightGBM Regressors and it held its own against them. Take a look at the benchmarking result:

Notebook

SuloRegressor

Table of Contents

<ul> <li><a href="#introduction">What is lazytransform</a></li> <li><a href="#uses">How to use lazytransform</a></li> <li><a href="#install">How to install lazytransform</a></li> <li><a href="#usage">Usage</a></li> <li><a href="#tips">Tips</a></li> <li><a href="#api">API</a></li> <li><a href="#maintainers">Maintainers</a></li> <li><a href="#contributing">Contributing</a></li> <li><a href="#license">License</a></li> </ul> <p>

Introduction

What is lazytransform?

lazytransform is a new python library for automatically transforming your entire dataset to numeric format using category encoders, NLP text vectorizers and pandas date time processing functions. All in a single line of code!

lazytransformer

Uses

lazytransform has two important uses in the Data Science process. It can be used in Feature Engg to transform features or add features (see API below). It can also be used in MLOps to train and evaluate models in data pipelines with multiple models being trained simultaneusly using the same train/test split and the same feature engg steps. This ensures that there is absolutely zero or minimal data leakage in your MLOps pipelines.

1. Using lazytransform as a simple pandas data transformation pipeline

<p>The first method is probably the most popular way to use lazytransform. The transformer within lazytransform can be used to transform and create new features from categorical, date-time and NLP (text) features in your dataset. This transformer pipeline is fully scikit-learn Pipeline compatible and can be used to build even more complex pipelines by you based on `make_pipeline` statement from `sklearn.pipeline` library. <a href="https://github.com/AutoViML/lazytransform/blob/main/Featurewiz_LazyTransform_Demo1.ipynb">Let us see an example</a>:<p>

<a href="https://ibb.co/xfMQnNz"><img src="https://i.ibb.co/ZYhCQ0c/lazy-code1.png" alt="lazy-code1" border="0"></a>

2. Using lazytransform as a sklearn pipeline with sklearn models or XGBoost or LightGBM models

<p>The second method is a great way to create an entire data transform and model training pipeline with absolutely no data leakage. `lazytransform` allows you to send in a model object (only the following are supported) and it will automatically transform, create new features and train a model using sklearn pipelines. <a href="https://github.com/AutoViML/lazytransform/blob/main/Featurewiz_LazyTransform_Demo2.ipynb">This method can be seen as follows</a>:<br>

<a href="https://ibb.co/T1WNhzT"><img src="https://i.ibb.co/0KszJPX/lazy-code2.png" alt="lazy-code2" border="0"></a>

3. Using lazytransform in GridSearchCV to find the best model pipeline

<p>The third method is a great way to find the best data transformation and model training pipeline using GridSearchCV or RandomizedSearchCV along with a LightGBM or XGBoost or scikit-learn model. This is explained very clearly in the <a href="https://github.com/AutoViML/lazytransform/blob/main/LazyTransformer_with_GridSearch_Pipeline.ipynb">LazyTransformer_with_GridSearch_Pipeline.ipynb</a> notebook in the same github here. Make sure you check it out!

<a href="https://ibb.co/WGvnqjs"><img src="https://i.ibb.co/xXqhPd3/lazy-gridsearch.png" alt="lazy-gridsearch" border="0"></a><br />

<p> The following models are currently supported: <ol> <li>All sklearn models</li> <li>All MultiOutput models from sklearn.multioutput library</li> <li>XGboost models</li> <li>LightGBM models</li> </ol> However, you must install and import those models on your own and define them as model variables before passing those variables to lazytransform.

Install

<p>

Prerequsites:

<ol> <li><b>lazytransform is built using pandas, numpy, scikit-learn, category_encoders and imbalanced-learn libraries.</b> It should run on most Python3 Anaconda installations without additional installs. You won't have to import any special libraries other than "imbalanced-learn" and "category_encoders".</li> </ol> The best method to install lazytransform is to use conda:<p>
conda install -c conda-forge lazytransform

<a href="https://ibb.co/fXnbPd6"><img src="https://i.ibb.co/qDWzPYq/conda-install.png" alt="conda-install" border="0"></a><br> The second best installation method is to use "pip install".

pip install lazytransform 

Alert! When using Colab or Kaggle Notebooks, you must use a slightly modify installation process below. If you don't do this, you will get weird errors in those platforms!

pip install lazytransform --ignore-installed --no-deps
pip install category-encoders --ignore-installed --no-deps

To install from source:

cd <lazytransform_Destination>
git clone git@github.com:AutoViML/lazytransform.git

or download and unzip https://github.com/AutoViML/lazytransform/archive/master.zip

conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd lazytransform
pip install -r requirements.txt

Usage

<p> You can invoke `lazytransform` as a scikit-learn compatible fit and transform or a fit and predict pipeline. See syntax below.<p>
from lazytransform import LazyTransformer
lazy = LazyTransformer(model=None, encoders='auto', scalers=None, 
        date_to_string=False, transform_target=False, imbalanced=False,
        combine_rare=False, verbose=0)

if you are not using a model in pipeline, you must use fit and transform

X_trainm, y_trainm = lazy.fit_transform(X_train, y_train)
X_testm = lazy.transform(X_test)

If you are using a model in pipeline, use must use fit and predict only

lazy = LazyTransformer(model=RandomForestClassifier(), encoders='auto', scalers=None, 
        date_to_string=False, transform_target=False, imbalanced=False,
        combine_rare=False, verbose=0)
lazy.fit(X_train, y_train)
lazy.predict(X_test)

Tips

<b>Tips for using SuloClassifier and SuloRegressor for High Performance:</b>

  1. First try it with base_estimator as None and all other params as either None or False
  2. Compare it against a competitor model such as XGBoost or RandomForest and see whether it beats them.
  3. If not, then set weights = True for Sulo models and then imbalanced=True and see whether that works.
  4. If a competitor is still beating Sulo, then use that model as base_estimator while leaving all other params above untouched.
  5. Finally change the n_estimators from default=None to 5.
  6. Finally increase n_estimators to 7 and then 10 and see. By now, Sulo should be beating all other models.
  7. The more you increase the number of estimators, the more performance boost you will get until at some point it drops off. Keep increasing until then.

API

<p> lazytransform has a very simple API with the following inputs. You need to create a sklearn-compatible transformer pipeline object by importing LazyTransformer from lazytransform library. <p> Once you import it, you can define the object by giving several options such as:

Arguments

<b>Caution:</b> X_train and y_train must be pandas Dataframes or pandas Series. DO NOT send in numpy arrays. They won't work.

<p> To view the text pipeline, the default display is 'text', do:<br>
from sklearn import set_config
set_config(display="text")
lazy.xformer
<p> To view the pipeline in a diagram (visual format), do:<br>
from sklearn import set_config
set_config(display="diagram")
lazy.xformer
# If you have a model in the pipeline, do:
lazy.modelformer

<a href="https://imgbb.com/"><img src="https://i.ibb.co/Bn7V4px/lazy-pipe.png" alt="lazy-pipe" border="0"></a>

To view the feature importances of the model in the pipeline, you can do:

lazy.plot_importance()

<a href="https://ibb.co/jhpsVtJ"><img src="https://i.ibb.co/cJmVbqY/lazy-feat-imp.png" alt="lazy-feat-imp" border="0"></a>

Maintainers

Contributing

See the contributing file!

PRs accepted.

License

Apache License 2.0 © 2020 Ram Seshadri

Note of Gratitude

This libray would not have been possible without the following great libraries:

<ol> <li><b>Category Encoders library:</b> Fantastic library https://contrib.scikit-learn.org/category_encoders/index.html</li> <li><b>Imbalanced Learn library:</b> Another fantastic library https://imbalanced-learn.org/stable/index.html</li> <li><b>The amazing `lazypredict`</b> was an inspiration for `lazytransform`. You can check out the library here: https://github.com/shankarpandala/lazypredict </li> <li><b>The amazing `Kevin Markham`</b> was another inspiration for lazytransform. You can check out his classes here: https://www.dataschool.io/about/ </li> </ol>

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.