Home

Awesome

Finding Influential Training Samples for Gradient Boosted Decision Trees

This repository implements the LeafRefit and LeafInfluence methods described in the paper Finding Influential Training Samples for Gradient Boosted Decision Trees.

The paper deals with the problem of finding infuential training samples using the Infuence Functions framework from classical statistics recently revisited in the paper "Understanding Black-box Predictions via Influence Functions" (code). The classical approach, however, is only applicable to smooth parametric models. In our paper, we introduce LeafRefit and LeafInfuence, methods for extending the Infuence Functions framework to non-parametric Gradient Boosted Decision Trees ensembles.

Requirements

We recommend using the Anaconda Python distribution for easy installation.

Python packages

The following Python 2.7 packages are required:

Note: versions of the packages specified below are the versions with which the experiments reported in the paper were tested.

The create_influence_boosting_env.sh script creates the influence_boosting Conda environment with the required packages installed. You can run the script by running the following in the influence_boosting directory:

bash create_influence_boosting_env.sh

CatBoost

The code in this repository uses CatBoost for an implementation of GBDT. We tested our package with CatBoost version 0.6 built from GitHub. Installation instructions are available in the documentation.

Note: if you are using the influence_boosting environment described above, make sure to install CatBoost specifically for this environment.

export_catboost

Since CatBoost is written in C++, in order to use CatBoost models with our Python package, we also include export_catboost, a binary that exports a saved CatBoost model to a human-readable JSON.

This repository assumes that a program named export_catboost is available in the shell. To ensure that, you can do the following:

Note: since CatBoost's treatment of categorical features can be fairly complicated, export_catboost currently supports numerical features only.

Example

An example experiment showing the API and a use-case of Influence Functions can be found in the influence_for_error_fixing.ipynb notebook.

Note: in this notebook, CatBoost parameters are loaded from the catboost_params.json file. In particular, the task_type parameter is set to CPU by default. If you have a GPU with CUDA available on your machine and compiled CatBoost with GPU support, you can change this parameter to GPU in order to train CatBoost faster on GPU. The majority of the experiments in the paper were conducted using the GPU mode.