Awesome
Efficient Inference of Optimal Decision Trees
Paper link: http://florent.avellaneda.free.fr/dl/AAAI20.pdf
The source code is designed to be compiled and executed on GNU/Linux.
Dependencies
- g++
- cmake
- minisat : http://minisat.se/
- xdot : https://github.com/jrfonseca/xdot.py
- CLI11 : https://github.com/CLIUtils/CLI11
Example of installing dependencies for Ubuntu
The following instructions have been tested on Ubuntu 19.04
sudo apt install g++ cmake minisat2 xdot libzip-dev libboost-dev
Build
cmake .
make
Running
Infer a decision tree with the algorithm DT_depth for the dataset "mouse":
./InferDT -d data/mouse.csv infer
Infer a decision tree with the algorithm DT_size for the dataset "car":
./InferDT data/car.csv infer
Run a 10-cross-validation on the dataset "mouse" with the algorithm DT_size:
./InferDT data/mouse.csv bench
Infer a decision tree with the algorithm DT_size for the training set balance-scale.csv.train1 and evaluate the model with the testing set balance-scale.csv.test1:
./InferDT data/balance-scale.csv.train1.csv infer -t data/balance-scale.csv.test1.csv
Print a help message:
./InferDT --help
Benchmarks
We ran experiments on Ubuntu with Intel Core CPU i7-2600K @ 3.40 GHz.
Verwer and Zhang Datasets
The datasets we used are extracted from the paper of Verwer and Zhang and are available at (https://github.com/SiccoVerwer/binoct).
Dataset | S | B | Time DT_depth | Accuracy DT_depth | k | n for DT_size | Time DT_size | Accuracy DT_size | Accuracy BinOCT* |
---|---|---|---|---|---|---|---|---|---|
iris | 150 | 114 | 18 ms | 92.9 % | 3 | 10.6 | 30 ms | 93.2 % | 98.4 % |
monks1 | 124 | 17 | 24 ms | 90.3 % | 4.4 | 17 | 80 ms | 95.5 % | 87.1 % |
monks2 | 169 | 17 | 190 ms | 70.2 % | 5.8 | 47.8 | 9.1 sec | 74.0 % | 63.3 % |
monks3 | 122 | 17 | 30 ms | 78.1 % | 4.8 | 23.4 | 210 ms | 82.6 % | 93.5 % |
wine | 178 | 1276 | 600 ms | 89.3 % | 3 | 7.8 | 1.2 sec | 92.0 % | 88.9 % |
balance-scale | 625 | 20 | 50 sec | 93.0 % | 8 | 268 | 183 sec | 92.6 % | 77.5 % |
S: Number of examples in the dataset
B: Number of Boolean features in the dataset
Time DT_depth: Time used by our algorithm with parameter "-d"
Accuracy DT_depth: Accuracy of our algorithm with parameter "-d"
k: Depth of inferred decision trees
n: Number of nodes in inferred decision trees
Time DT_size: Time used by our algorithm without parameter "-d"
Accuracy DT_size: Accuracy of our algorithm without parameter "-d"
Accuracy BinOCT: Accuracy of algorithm from https://github.com/SiccoVerwer/binoct
Mouse
We used dataset Mouse that the authors Bessiere, Hebrard and O'Sullivan shared with us. Each entry in rows DT_size and DT_depth corresponds to the average over 100 runs. The first columns correspond to the name of each algorithm used. The next three columns correspond to inferring a decision tree from the whole dataset. The last column corresponds the 10-fold cross-validations.
Algorithm | Time | k | n | Accuracy |
---|---|---|---|---|
DT2 from paper Minimising Decision Tree Size as Combinatorial Optimisation | 577 sec | 4 | 15 | 83.8 % |
DT1 from paper Learning Optimal Decision Trees with SAT | 12.9 sec | 4 | 15 | 83.8 % |
DT_size: our algorithm without parameter "-d" | 70 ms | 4 | 15 | 83.5 % |
DT_depth: our algorithm with parameter "-d" | 20 ms | 4 | 31 | 85.8 % |
Other
In this section we perform x 10-cross-validations made randomly and record the average.
Dataset | S | B | Time DT_depth | Accuracy DT_depth | k | n for DT_size | Time DT_size | Accuracy DT_size | x |
---|---|---|---|---|---|---|---|---|---|
zoo | 101 | 136 | 47 ms | 91.7 % | 4 | 20 | 200 ms | 91 % | 200 |
BodyMassIndex | 500 | 172 | 53 sec | 85 % | 6.6 | 109 | 6.4 h | 85.4 % | 1 |
lungCancerDataset | 59 | 72 | 3.8 ms | 89.6 % | 2.6 | 7 | 7.5 ms | 90.6 % | 200 |
iris | 113 | 114 | 30 ms | 93.6 % | 3.6 | 14 | 120 ms | 94.0 % | 200 |
monks1 | 124 | 17 | 30 ms | 97.0 % | 4.9 | 15 | 140 ms | 99.5 % | 200 |
monks2 | 169 | 17 | 330 ms | 88.0 % | 6 | 64 | 9.3 sec | 88.7 % | 100 |
monks3 | 92 | 17 | 125 ms | 79.6 % | 5.8 | 35.5 | 2.5 sec | 81.5 % | 800 |
balance-scale | 625 | 20 | 200 sec | 71.8 % | 9 | 276 | 11 h | 73.6 % | 1 |
wine | 178 | 1276 | 4.5 sec | 91.5 % | 3 | 12.2 | 14.5 sec | 91.2 % | 200 |
S: Number of examples in the dataset
B: Number of Boolean features in the dataset
k: Average depth of inferred decision trees
n: Average number of nodes in inferred decision trees
x: Number of 10-cross-validation performed