Home

Awesome

CORELS

Certifiably Optimal RulE ListS

CORELS is a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm provides the optimal solution, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. Our approach produces optimal rule lists on practical problems in seconds. This framework is a novel alternative to CART and other decision tree methods.

CORELS is a custom branch-and-bound algorithm for optimizing rule lists.

Web UI can be found at: https://corels.eecs.harvard.edu/

R Package can be found at: https://cran.r-project.org/package=corels

Python package can be found at: https://pypi.org/project/corels/

Overview

C/C++ dependencies

e.g., install libmpc via homewbrew (this will also install gmp and mpfr):

brew install libmpc

Sample command

Run the following from the src/ directory.

make
./corels -r 0.015 -c 2 -p 1 ../data/compas_train.out ../data/compas_train.label ../data/compas_train.minor

Usage

./corels [-b] [-n max_num_nodes] [-r regularization] [-v verbosity] -c (1|2|3|4) -p (0|1|2)
         [-f logging_frequency] -a (1|2|3) [-s] [-L latex_out] data.out data.label [data.minor]

Data format

For examples, see compas.out and compas.label in data/. Also see compas.minor (optional). Note that our data parsing is fragile and errors will occur if the format is not followed exactly.

Arguments

[-b] Breadth-first search (BFS). You must specify a search policy; use exactly one of (-b | -c).

[-n max_num_nodes] Maximum trie (cache) size. Stop execution if the number of nodes in the trie exceeds this number. Default value corresponds to -n 100000.

[-r regularization] Regularization parameter (optional). The default value corresponds to -r 0.01 and can be thought of as adding a penalty equivalent to misclassifying 1% of data when increasing a rule list's length by one association rule.

[-v verbosity] Verbosity. Default value corresponds to -v 0. If verbosity is at least 10, then print machine info. If verbosity is at least 1000, then also print input antecedents and labels.

-c (1|2|3|4) Best-first search policy. You must specify a search policy; use exactly one of (-b | -c). We include four different prioritization schemes:

-p (0|1|2) Symmetry-aware map (optional).

[-f logging_frequency] Logging frequency. Default value corresponds to -f 1000.

-a (0|1|2) Exclude a specific algorithm optimization. Default value corresponds to -a 0.

[-s] Calculate upper bound on remaining search space size (optional). This adds a small overhead; the default behavior does not perform the calculation. With -s, we dynamically and incrementally calculate floor(log10(Gamma(Rc, Q))), where Gamma(Rc, Q) is the upper bound (see Theorem 7 in Section 3.6 of our paper).

[-L latex_out] Latex output. Include this flag to generate a latex representation of the output rule list.

data.out File path to training data. See Data format, above.

data.label File path to training labels. See Data format, above.

data.minor File path to a bit vector to support use of the equivalent points bound (optional, see Theorem 20 in Section 3.14 of our paper).

Example dataset

The files in data/ were generated from ProPublica's COMPAS recidivism dataset, specifically, the file compas-scores-two-years.csv.

We include one training set (N = 6489) and one test set (N = 721) from a 10-fold cross-validation experiment. There are four training data files:

The corresponding test data files are: compas_test.csv, compas_test-binary.csv, compas_test.out, and compas_test.labels.

Example rule list

if (age=23−25) and (priors=2−3) then predict yes
else if (age = 18 − 20) then predict yes
else if (sex = male) and (age = 21 − 22) then predict yes
else if (priors > 3) then predict yes
else predict no

This rule list predicts two-year recidivism for the ProPublica two-year recidivism dataset, and was found by CORELS. Its prefix corresponds to the first four rules and its default rule corresponds to the last rule.

Optimization algorithm and objective

CORELS is a custom branch-and-bound algorithm that minimizes the following objective defined for a rule list d, training data x and corresponding labels y:

R(d, x, y) = misc(d, x, y) + c * length(d).

This objective is a regularized empirical risk that consists of a loss and a regularization term that penalizes longer rule lists.

Let p be d's prefix. The following objective lower bound drives our branch-and-bound procedure:

b(p, x, y) = misc(p, x, y) + c * length(d) <= R(d, x, y)

where misc(p, x, y) is the prefix misclassification error (due to mistakes made by the prefix, but not the default rule).

Data structures

Related work

CORELS builds directly on:

In particular, CORELS uses a library by Yang et al. for efficiently representing and operating on bit vectors. See the files src/rule.h and src/rule.c.