IMBENS (imported as imbens) is a Python library for quick implementation, modification, evaluation, and visualization of ensemble learning from class-imbalanced data. Currently, IMBENS includes over 15 ensemble imbalanced learning algorithms (SMOTEBoost, SMOTEBagging, RUSBoost, EasyEnsemble, SelfPacedEnsemble, etc) and 19 over-/under-sampling methods (SMOTE, ADASYN, TomekLinks, etc) from imbalance-learn.

<h2 align="left">🌈 IMBENS Highlights</h2>

✂️ Use IMBENS for class-imbalanced classification with <5 lines of code:

# Train an SPE classifier
from imbens.ensemble import SelfPacedEnsembleClassifier
clf = SelfPacedEnsembleClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict with an SPE classifier
y_pred = clf.predict(X_test)

🤗 Citing IMBENS

🍻 We appreciate your citation if you find our work helpful! The BibTeX entry:

  title={IMBENS: Ensemble Class-imbalanced Learning in Python},
  author={Liu, Zhining and Kang, Jian and Tong, Hanghang and Chang, Yi},
  journal={arXiv preprint arXiv:2111.12776},

👯‍♂️ Contribute to IMBENS

Join us and become a contributor! Please refer to the contributing guidelines.

<h2 align="left">📚 Table of Contents</h2>


It is recommended to use pip for installation.
Please make sure the latest version is installed to avoid potential problems:

$ pip install imbalanced-ensemble            # normal install
$ pip install --upgrade imbalanced-ensemble  # update if needed

Or you can install imbalanced-ensemble by clone this repository:

$ git clone https://github.com/ZhiningLiu1998/imbalanced-ensemble.git
$ cd imbalanced-ensemble
$ pip install .

imbalanced-ensemble requires following dependencies:

<!-- ## Highlights - &#x1F34E; ***Unified, easy-to-use API design.*** All ensemble learning methods implemented in IMBENS share a unified API design. Similar to sklearn, all methods have functions (e.g., `fit()`, `predict()`, `predict_proba()`) that allow users to deploy them with only a few lines of code. - &#x1F34E; ***Extended functionalities, wider application scenarios.*** *All methods in IMBENS are ready for **multi-class imbalanced classification**.* We extend binary ensemble imbalanced learning methods to get them to work under the multi-class scenario. Additionally, for supported methods, we provide more training options like class-wise resampling control, balancing scheduler during the ensemble training process, etc. - &#x1F34E; ***Detailed training log, quick intuitive visualization.*** We provide additional parameters (e.g., `eval_datasets`, `eval_metrics`, `training_verbose`) in `fit()` for users to control the information they want to monitor during the ensemble training. We also implement an [`EnsembleVisualizer`](https://imbalanced-ensemble.readthedocs.io/en/latest/api/visualizer/_autosummary/imbens.visualizer.ImbalancedEnsembleVisualizer.html) to quickly visualize the ensemble estimator(s) for providing further information/conducting comparison. See an example [here](https://imbalanced-ensemble.readthedocs.io/en/latest/auto_examples/basic/plot_basic_example.html#sphx-glr-auto-examples-basic-plot-basic-example-py). - &#x1F34E; ***Wide compatiblilty.*** IMBENS is designed to be compatible with [scikit-learn](https://scikit-learn.org/stable/) (sklearn) and also other compatible projects like [imbalanced-learn](https://imbalanced-learn.org/stable/). Therefore, users can take advantage of various utilities from the sklearn community for data processing/cross-validation/hyper-parameter tuning, etc. --> <!-- ## Background Class-imbalance (also known as the long-tail problem in multi-class) is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families. Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. Imbalanced learning (IL) aims to tackle the class imbalance problem to learn an unbiased model from imbalanced data. This is usually achieved by changing the training data distribution by resampling or reweighting. However, naive resampling or reweighting may introduce bias/variance to the training data, especially when the data has class-overlapping or contains noise. Ensemble imbalanced learning (EIL) is known to effectively improve typical IL solutions by combining the outputs of multiple classifiers, thereby reducing the variance introduce by resampling/reweighting. -->

List of implemented methods

Currently (v0.1.3, 2021/06), 16 ensemble imbalanced learning methods were implemented:
(Click to jump to the document page)

Note: imbalanced-ensemble is still under development, please see API reference for the latest list.

5-min Quick Start with IMBENS

Here, we provide some quick guides to help you get started with IMBENS.
We strongly encourage users to check out the example gallery for more comprehensive usage examples, which demonstrate many advanced features of IMBENS.

A minimal working example

Taking self-paced ensemble [1] as an example, it only requires less than 10 lines of code to deploy it:

>>> from imbens.ensemble import SelfPacedEnsembleClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=1000, n_classes=3,
...                            n_informative=4, weights=[0.2, 0.3, 0.5],
...                            random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(
...                            X, y, test_size=0.2, random_state=42)
>>> clf = SelfPacedEnsembleClassifier(random_state=0)
>>> clf.fit(X_train, y_train)
>>> clf.predict(X_test)  

Visualize ensemble classifiers

The imbens.visualizer sub-module provide an ImbalancedEnsembleVisualizer. It can be used to visualize the ensemble estimator(s) for further information or comparison. Please refer to visualizer documentation and examples for more details.

Fit an ImbalancedEnsembleVisualizer

from imbens.ensemble import SelfPacedEnsembleClassifier
from imbens.ensemble import RUSBoostClassifier
from imbens.ensemble import EasyEnsembleClassifier
from sklearn.tree import DecisionTreeClassifier

# Fit ensemble classifiers
init_kwargs = {'estimator': DecisionTreeClassifier()}
ensembles = {
    'spe': SelfPacedEnsembleClassifier(**init_kwargs).fit(X_train, y_train),
    'rusboost': RUSBoostClassifier(**init_kwargs).fit(X_train, y_train),
    'easyens': EasyEnsembleClassifier(**init_kwargs).fit(X_train, y_train),

# Fit visualizer
from imbens.visualizer import ImbalancedEnsembleVisualizer
visualizer = ImbalancedEnsembleVisualizer().fit(ensembles=ensembles)

Plot performance curves

fig, axes = visualizer.performance_lineplot()

Plot confusion matrices

fig, axes = visualizer.confusion_matrix_heatmap()

Customizing training log

All ensemble classifiers in IMBENS support customizable training logging. The training log is controlled by 3 parameters eval_datasets, eval_metrics, and training_verbose of the fit() method. Read more details in the fit documentation.

Enable auto training log

clf.fit(..., train_verbose=True)
┃             ┃                          ┃            Data: train             ┃
┃ #Estimators ┃    Class Distribution    ┃               Metric               ┃
┃             ┃                          ┃  acc    balanced_acc   weighted_f1 ┃
┃      1      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.838      0.877          0.839    ┃
┃      5      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.924      0.949          0.924    ┃
┃     10      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.954      0.970          0.954    ┃
┃     15      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.979      0.986          0.979    ┃
┃     20      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.990      0.993          0.990    ┃
┃     25      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.994      0.996          0.994    ┃
┃     30      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.988      0.992          0.988    ┃
┃     35      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.999      0.999          0.999    ┃
┃     40      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.995      0.997          0.995    ┃
┃     45      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.995      0.997          0.995    ┃
┃     50      ┃ {0: 150, 1: 150, 2: 150} ┃ 0.993      0.995          0.993    ┃
┃    final    ┃ {0: 150, 1: 150, 2: 150} ┃ 0.993      0.995          0.993    ┃

Customize granularity and content of the training log

            'granularity': 10,
            'print_distribution': False,
            'print_metrics': True,
<details><summary> Click to view example output </summary>
┃             ┃            Data: train             ┃
┃ #Estimators ┃               Metric               ┃
┃             ┃  acc    balanced_acc   weighted_f1 ┃
┃      1      ┃ 0.964      0.970          0.964    ┃
┃     10      ┃ 1.000      1.000          1.000    ┃
┃     20      ┃ 1.000      1.000          1.000    ┃
┃     30      ┃ 1.000      1.000          1.000    ┃
┃     40      ┃ 1.000      1.000          1.000    ┃
┃     50      ┃ 1.000      1.000          1.000    ┃
┃    final    ┃ 1.000      1.000          1.000    ┃

Add evaluation dataset(s)

              'valid': (X_valid, y_valid)
<details><summary> Click to view example output </summary>
┃             ┃            Data: train             ┃            Data: valid             ┃
┃ #Estimators ┃               Metric               ┃               Metric               ┃
┃             ┃  acc    balanced_acc   weighted_f1 ┃  acc    balanced_acc   weighted_f1 ┃
┃      1      ┃ 0.939      0.961          0.940    ┃ 0.935      0.933          0.936    ┃
┃     10      ┃ 1.000      1.000          1.000    ┃ 0.971      0.974          0.971    ┃
┃     20      ┃ 1.000      1.000          1.000    ┃ 0.982      0.981          0.982    ┃
┃     30      ┃ 1.000      1.000          1.000    ┃ 0.983      0.983          0.983    ┃
┃     40      ┃ 1.000      1.000          1.000    ┃ 0.983      0.982          0.983    ┃
┃     50      ┃ 1.000      1.000          1.000    ┃ 0.983      0.982          0.983    ┃
┃    final    ┃ 1.000      1.000          1.000    ┃ 0.983      0.982          0.983    ┃

Customize evaluation metric(s)

from sklearn.metrics import accuracy_score, f1_score
            'acc': (accuracy_score, {}),
            'weighted_f1': (f1_score, {'average':'weighted'}),
<details><summary> Click to view example output </summary>
┃             ┃     Data: train      ┃     Data: valid      ┃
┃ #Estimators ┃        Metric        ┃        Metric        ┃
┃             ┃  acc    weighted_f1  ┃  acc    weighted_f1  ┃
┃      1      ┃ 0.942      0.961     ┃ 0.919      0.936     ┃
┃     10      ┃ 1.000      1.000     ┃ 0.976      0.976     ┃
┃     20      ┃ 1.000      1.000     ┃ 0.977      0.977     ┃
┃     30      ┃ 1.000      1.000     ┃ 0.981      0.980     ┃
┃     40      ┃ 1.000      1.000     ┃ 0.980      0.979     ┃
┃     50      ┃ 1.000      1.000     ┃ 0.981      0.980     ┃
┃    final    ┃ 1.000      1.000     ┃ 0.981      0.980     ┃

About imbalanced learning

Class-imbalance (also known as the long-tail problem) is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families. Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. Imbalanced learning aims to tackle the class imbalance problem to learn an unbiased model from imbalanced data.

For more resources on imbalanced learning, please refer to awesome-imbalanced-learning.


IMBENS was initially developed on top of imbalanced-learn, but has undergone heavy developments to implement many important imbalanced ensemble techniques. The infrastructure also underwent significant refactoring to support advanced ensemble learning features that are essential to practical usability (fine-grained training control, parallel computing, multi-class support, training logs, visualization, etc).


