Home

Awesome

Cubist

PyPI Version GitHub Build codecov License PyPI - Python Version PyPI - Downloads

A Python package for fitting Quinlan's Cubist v2.07 regression model. Inspired by and based on the R wrapper for Cubist. Designed after and inherits from the scikit-learn framework.

Installation

pip install --upgrade cubist

Background

Cubist is a regression algorithm develped by John Ross Quinlan for generating rule-based predictive models. This has been available in the R world thanks to the work of Max Kuhn and his colleagues. It is introduced to Python with this package and made scikit-learn compatible for easy use with existing model pipelines. Cross-validation and control over whether Cubist creates a composite model is also enabled here.

Advantages

Unlike other ensemble models such as RandomForest and XGBoost, Cubist generates a set of rules, making it easy to understand precisely how the model makes it's predictive decisions. Thus tools such as SHAP and LIME are unnecessary as Cubist doesn't exhibit black box behavior.

Like XGBoost, Cubist can perform boosting by the addition of more models (called committees) that correct for the error of prior models (i.e. the second model created corrects for the prediction error of the first, the third for the error of the second, etc.).

In addition to boosting, the model can perform instance-based (nearest-neighbor) corrections to create composite models, thus combining the advantages of these two methods. Note that with instance-based correction, model accuracy may be improved at the expense of computing time (this extra step takes longer) and some interpretability as the linear regression rules are no longer completely followed. It should also be noted that a composite model might be quite large as the full training dataset must be stored in order to perform instance-based corrections for inferencing. A composite model will be used when auto=False with neighbors set to an integer between 1 and 9. Cubist can be allowed to decide whether to take advantage of composite models with auto=True.

Use

from sklearn.datasets import fetch_california_housing
from cubist import Cubist
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
model = Cubist() # <- model parameters here
model.fit(X, y)
model.predict(X)
model.score(X, y)

Sample Output

Sample Cubist output for Iris dataset

The above image is a sample of the verbose output produced by Cubist. It first reports the total number of cases (rows) and attributes (columns) in the training dataset. Below that it summarizes the model by committee (if used but not in this sample) and rule where each rule is definined by an if..then statement along with metrics for this rule in the training data and the linear regression equation used for each rule. The if section of each rule identifies the training input columns and feature value ranges for which this rule holds true. The then statement shows the linear regressor for this rule. The model performance is then summarized by the average and relative absolute errors as well as with the Pearson correlation coefficient r. Finally, the output reports the usage of training features in the model and rules as well as the time taken to complete training.

Model Parameters

The following parameters can be passed as arguments to the Cubist() class instantiation:

Considerations

Model Attributes

The following attributes are exposed to understand the Cubist model results:

Benchmarks

There are many literature examples demonstrating the power of Cubist and comparing it to Random Forest as well as other bootstrapped/boosted models. Some of these are compiled here: Cubist in Use. To demonstrate this, some benchmark scripts are provided in the respectively named folder.

Literature for Cubist

Publications Using Cubist

To Do