Home

Awesome

BoostARoota

A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers)

Why Create Another Algorithm?

Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.

I am proposing and demonstrating a feature selection algorithm (called BoostARoota) in a similar spirit to Boruta utilizing XGBoost as the base model rather than a Random Forest. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets. While the spirit is similar to Boruta, BoostARoota takes a slightly different approach for the removal of attributes that executes much faster.

Installation

Easiest way is to use pip:

$ pip install boostaroota

Usage

This module is built for use in a similar manner to sklearn with fit(), transform(), etc. In order to use the package, it does require X to be one-hot-encoded(OHE), so using the pandas function pd.get_dummies(X) may be helpful as it determines which variables are categorical and converts them into dummy variables. This package does rely on pandas under the hood so data must be passed in as a pandas dataframe.

Assuming you have X and Y split, you can run the following:

from boostaroota import BoostARoota
import pandas as pd

#OHE the variables - BoostARoota may break if not done
x = pd.getdummies(x)
#Specify the evaluation metric: can use whichever you like as long as recognized by XGBoost
  #EXCEPTION: multi-class currently only supports "mlogloss" so much be passed in as eval_metric
br = BoostARoota(metric='logloss')

#Fit the model for the subset of variables
br.fit(x, y)

#Can look at the important variables - will return a pandas series
br.keep_vars_

#Then modify dataframe to only include the important variables
br.transform(x)

It's really that simple! Of course, as we build more functionality there may be a few more Keep in mind that since you are OHE, if you have a numeric variable that is imported by python as a character, pd.get_dummies() will convert those numeric into many columns. This can cause your DataFrame to explode in size, giving unexpected results and high run times.

###New as of 1/22/2018, can insert any sklearn tree-based learner into BoostARoota Please be aware that this hasn't been fully tested out for which parameters (cutoff, iterations, etc) are optimal. Currently, that will require some trial and error on the user's part.

For example, to use another classifer, you will initialize the object and then pass that object into the BoostARoota object like so:


from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()

br = BoostARoota(clf=clf)
new_train = br.fit_transform(x, y)

You can also view a complete demo here.

Usage - Choosing Parameters

The default parameters are optimally chosen for the widest range of input dataframes. However, there are cases where other values could be more optimal.

How it works

Similar in spirit to Boruta, BoostARoota creates shadow features, but modifies the removal step.

  1. One-Hot-Encode the feature set
  2. Double width of the data set, making a copy of all features in original dataset
  3. Randomly shuffle the new features created in (2). These duplicated and shuffled features are referred to as "shadow features"
  4. Run XGBoost classifier on the entire data set ten times. Running it ten times allows for random noise to be smoothed, resulting in more robust estimates of importance. The number of repeats is a parameter than can be changed.
  5. Obtain importance values for each feature. This is a simple importance metric that sums up how many times the particular feature was split on in the XGBoost algorithm.
  6. Compute "cutoff": the average feature importance value for all shadow features and divide by four. Shadow importance values are divided by four (parameter can be changed) to make it more difficult for the variables to be removed. With values lower than this, features are removed at too high of a rate.
  7. Remove features with average importance across the ten iterations that is less than the cutoff specified in (6)
  8. Go back to (2) until the number of features removed is less than ten percent of the total.
  9. Method returns the features remaining once completed.

Algorithm Performance

BoostARoota is shorted to BAR and the below table is utilizing the LSVT dataset from the UCI datasets. The algorithm has been tested on other datasets. If you are interested in the specifics of the testing please take a look at the testBAR.py script. The basics are that it is run through 5-fold CV, with the model selection performed on the training set and then predicting on the heldout test set. It is done this way to avoid overfitting the feature selection process.

All tests are run on a 12 core (hyperthreaded) Intel i7. - Future iterations will compare run times on a 28 core Xeon, 120 cores on Spark, and running xgboost on a GPU.

Data SetTargetBoruta TimeBoostARoota TimeBoostARoota LogLossBoruta LogLossAll Features LogLossBAR >= All
LSVT0/150.289s0.487s0.56170.69500.7311Yes
HR0/133.704s0.485s0.10460.10030.1047Yes
Fraud0/138.619s1.790s0.43330.43530.4333Yes

As can be seen, the speed up from BoostARoota is around 100x with substantial reductions in log loss. Part of this speed up is that Boruta is running single threaded, while BoostARoota (on XGB) is running on all 12 cores. Not sure how this time speed up works with larger datasets as of yet.

This has also been tested on Kaggle's House Prices. With nothing done except running BoostARoota and evaluated on RMSE, all features scored .15669, while BoostARoota scored 0.1560.

Future Functionality (i.e. Current Shortcomings)

The text file FS_algo_basics.txt details how I was thinking through the algorithm and what additional functionality was thought about during the creation.

Updates

Want to Contribute?

This project has found some initial successes and there are a number of directions it can head. It would be great to have some additional help if you are willing/able. Whether it is directly contributing to the codebase or just giving some ideas, any help is appreciated. The goal is to make the algorithm as robust as possible. The primary focus right now is on the components under Future Implementations, but are in active development. Please reach out to see if there is anything you would like to contribute in that part to make sure we aren't duplicating work.

A special thanks to Progressive Leasing for sponsoring this research.