Home

Awesome

<img alt="autoimpute-logo" class="autoimpute-logo" height="250" width="500" src="https://kearnz.github.io/autoimpute-tutorials/img/home/autoimpute-logo-transparent.png">

Autoimpute

PyPI version Build Status Documentation Status MIT license Python 3.8+

<span style="font-size:1.5em;">Autoimpute is a Python package for analysis and implementation of <b>Imputation Methods!</b></span>

<span style="font-size:1.5em;">View our website to explore Autoimpute in more detail. New tutorials coming soon!</span>

<span style="font-size:1.5em;">Check out our docs to get the developer guide to Autoimpute.</span>

Conference Talks

Notes on Development

Creators and Maintainers

Joseph Kearney – @kearnz
Shahid Barkat - @shabarka
Arnab Bose (Advisor) - @bosearnab
See the Authors page to get in touch!

Installation

Development

git clone -b dev --single-branch https://github.com/kearnz/autoimpute.git
cd autoimpute
python setup.py install

Motivation

Most machine learning algorithms expect clean and complete datasets, but real-world data is messy and missing. Unfortunately, handling missing data is quite complex, so programming languages generally punt this responsibility to the end user. By default, R drops all records with missing data - a method that is easy to implement but often problematic in practice. For richer imputation strategies, R has multiple packages to deal with missing data (MICE, Amelia, TSImpute, etc.). Python users are not as fortunate. Python's scikit-learn throws a runtime error when an end user deploys models on datasets with missing records, and few third-party packages exist to handle imputation end-to-end.

Therefore, this package aids the Python user by providing more clarity to the imputation process, making imputation methods more accessible, and measuring the impact imputation methods have in supervised regression and classification. In doing so, this package brings missing data imputation methods to the Python world and makes them work nicely in Python machine learning projects (and specifically ones that utilize scikit-learn). Lastly, this package provides its own implementation of supervised machine learning methods that extend both scikit-learn and statsmodels to mutiply imputed datasets.

Main Features

Imputation Methods Supported

UnivariateMultivariateTime Series / Interpolation
MeanLinear RegressionLinear
MedianBinomial Logistic RegressionQuadratic
ModeMultinomial Logistic RegressionCubic
RandomStochastic RegressionPolynomial
NormBayesian Linear RegressionSpline
CategoricalBayesian Binary Logistic RegressionTime-weighted
Predictive Mean MatchingNext Obs Carried Backward
Local Residual DrawsLast Obs Carried Forward

Todo

Example Usage

Autoimpute is designed to be user friendly and flexible. When performing imputation, Autoimpute fits directly into scikit-learn machine learning projects. Imputers inherit from sklearn's BaseEstimator and TransformerMixin and implement fit and transform methods, making them valid Transformers in an sklearn pipeline.

Right now, there are three Imputer classes we'll work with:

from autoimpute.imputations import SingleImputer, MultipleImputer, MiceImputer
si = SingleImputer() # pass through data once
mi = MultipleImputer() # pass through data multiple times
mice = MiceImputer() # pass through data multiple times and iteratively optimize imputations in each column

Which to use, and When?

Imputations can be as simple as:

# simple example using default instance of MiceImputer
imp = MiceImputer()

# fit transform returns a generator by default, calculating each imputation method lazily
imp.fit_transform(data)

Or quite complex, such as:

# create a complex instance of the MiceImputer
# Here, we specify strategies by column and predictors for each column
# We also specify what additional arguments any `pmm` strategies should take
imp = MiceImputer(
    n=10,
    strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"},
    predictors={"salary": "all", "gender": ["salary", "education", "weight"]},
    imp_kwgs={"pmm": {"fill_value": "random"}},
    visit="left-to-right",
    return_list=True
)

# Because we set return_list=True, imputations are done all at once, not evaluated lazily.
# This will return M*N, where M is the number of imputations and N is the size of original dataframe.
imp.fit_transform(data)

Autoimpute also extends supervised machine learning methods from scikit-learn and statsmodels to apply them to multiply imputed datasets (using the MiceImputer under the hood). Right now, Autoimpute supports linear regression and binary logistic regression. Additional supervised methods are currently under development.

As with Imputers, Autoimpute's analysis methods can be simple or complex:

from autoimpute.analysis import MiLinearRegression

# By default, use statsmodels OLS and MiceImputer()
simple_lm = MiLinearRegression()

# fit the model on each multiply imputed dataset and pool parameters
simple_lm.fit(X_train, y_train)

# get summary of fit, which includes pooled parameters under Rubin's rules
# also provides diagnostics related to analysis after multiple imputation
simple_lm.summary()

# make predictions on a new dataset using pooled parameters
predictions = simple_lm.predict(X_test)

# Control both the regression used and the MiceImputer itself
mice_imputer_arguments = dict(
    n=3,
    strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"},
    predictors={"salary": "all", "gender": ["salary", "education", "weight"]},
    imp_kwgs={"pmm": {"fill_value": "random"}},
    visit="left-to-right"
)
complex_lm = MiLinearRegression(
    model_lib="sklearn", # use sklearn linear regression
    mi_kwgs=mice_imputer_arguments # control the multiple imputer
)

# fit the model on each multiply imputed dataset
complex_lm.fit(X_train, y_train)

# get summary of fit, which includes pooled parameters under Rubin's rules
# also provides diagnostics related to analysis after multiple imputation
complex_lm.summary()

# make predictions on new dataset using pooled parameters
predictions = complex_lm.predict(X_test)

Note that we can also pass a pre-specified MiceImputer (or MultipleIputer) to either analysis model instead of using mi_kwgs. The option is ours, and it's a matter of preference. If we pass a pre-specified MiceImputer, anything in mi_kwgs is ignored, although the mi_kwgs argument is still validated.

from autoimpute.imputations import MiceImputer
from autoimpute.analysis import MiLinearRegression

# create a multiple imputer first
custom_imputer = MiceImputer(n=3, strategy="pmm", return_list=True)

# pass the imputer to a linear regression model
complex_lm = MiLinearRegression(mi=custom_imputer, model_lib="statsmodels")

# proceed the same as the previous examples
complex_lm.fit(X_train, y_train).predict(X_test)
complex_lm.summary()

For a deeper understanding of how the package works and its available features, see our tutorials website.

Versions and Dependencies

A note for Windows Users:

License

Distributed under the MIT license. See LICENSE for more information.

Contributing

Guidelines for contributing to our project. See CONTRIBUTING for more information.

Contributor Code of Conduct

Adapted from Contributor Covenant, version 1.0.0. See Code of Conduct for more information.