Home

Awesome

🔥 BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML

<b>Discover the semantic order between values by utilizing BERT models in a zero-shot setting. Yes, without labeled data!</b>

<img src="https://github.com/marscod/BERT-Sort/blob/main/BERT-Sort_Poster_Bahrami_et_al_AutoML_2022.png" alt="BERT-Sort Bahrami et al., AutoML 2022" title="BERT-Sort Bahrami et al., AutoML 2022" style="display: inline-block; margin: 0 auto; width:40%;" class="center">

....2 Min video, Full-video and Poster....

BERT-Sort Paper is available at https://proceedings.mlr.press/v188/bahrami22a

Demo

A demonstration of the process (normalized score for visualization) for sorting 4 values of months' abbreviations ['Mar','Jan','May','Feb'].

<img src="https://github.com/marscod/BERT-Sort/blob/main/Demo1.gif" width="600px"/>

This repository provides artifacts for reproducing the results of BERT-Sort paper.

The artifacts include the following items.

Benchmarks Folder

This folder includes 10 data sets that consists of both raw data set and encoded data set where it is encoded through BERT-Sort Encoder with MLM initialization of <img src="https://latex.codecogs.com/svg.latex?&space;M_{1..4}"/>.

In each data set folder, there are original files and encoded data sets with 4 different MLMs. For instance, bank/bank.csv is the original file for raw data set and bank/bank.csv_bs__roberta.csv is encoded raw data set with BERT-Sort Encoder which is initiated with RoBERTa MLM. Both raw and encoded data sets have been used to evaluate the proposed approach in 5 AutoML platforms.

Output Folder

This folder includes the configuration files, ground truth and evaluation results. Each folder in output contains a configuration file as config.json with a set of keys of ['model', 'mask', 'separator', 'eta', 'lower', 'target_files', 'ground_truth', 'default_grouping', 'default_zeta', 'preprocess']. For instance, 'outputs/out_bert_base_uncased/config.json' includes all hyperparameters, configuration, ground truth of 42 features, task specification (regression/classification) for BERT-base_uncased MLM.

The key of target_files represents task information such as data set filename, a URL reference, type of task (classification or regression for AutoML evaluation), type of evaluation metric (F1 or RMSE).

The key of ground_truth is a dictionary where the keys are representing the feature name (if any) or feature index, and the values are a list of ranked ordinal values.

Each MLM folder includes a set of dumped pickles (*.pkl) which includes: i) input values, ii) OrdinalEncoder output, iii) intermediate steps and iv) final evaluation results of BERT-Sort process for each data set.

This folder also includes i) all_outputs.csv(detailed-evaluation), and summary.csv (summary of each data set) for evaluation results of BERT-Sort on 10 data sets with 42 distinct features per MLM. For instance, out_bert_base_uncased/all_outputs.csv corresponds to detailed-results of BERT-base_uncased MLM on all 42 features. A heatmap plot of all_outputs.csv is available at out_bert_base_uncased/all_outputs.png.

AuoML Folder

This folder includes all AutoML evaluation results based on i) raw data set, ii) encoded data set through BERT-Sort. Each experiment is located in a file with one of the two following structures.

Raw data set Format

automl/<auoml_name>/<data set name>_<seed>_<time_limitaion>.txt

BERT-Sort Encoded Data Set Format

automl/<auoml_name>/<data set name>_<seed>_<time_limitaion>_bs_<model_name>.csv.txt

Each file includes <data set name> <seed> <training time> <prediction time> <score>

The following seeds have been used to split both raw data sets and encoded data sets where we used sklearn.model_selection.train_test_split.

four_seeds = [108, 180, 234, 309]  # used in Table 5, Table 6, and Figure 8 (4 first values as seeds, 1 seed: 108)
five_seeds =  ['108', '180', '234', '309', '533'] # used in Figure 8

Experiment Artifacts

You may find all results of Table 5 and Table 6 in /automl/<AUTOML>/<DATASET>_<SEED>_m5_*<METHOD>.txt (i.e., Nursery_108_m5_EncodedBERT.csv.txt refers to Nursery data set with seed 108 and encoded value through EncodedBERT approach. <SEEDS> include [108, 180, 234, 309],and <METHOD> includes 5 different datasets of ['Raw', 'EncodedBERT', 'bs_roberta', 'OrdinalEncoder', 'GroundTruth'].

Similarly, you can find the encoded versions of each data set per encoded method in benchmark folder: benchmarks/<DATA SET>/FILE_<METHOD>.csv (i.e., uci_Pittsburgh_Bridges/bridges.data.version2.txt.csv_EncodedBERT.csv

Reproducibility Checklist

The reproducibility checklist is available here.

Reproducing AutoML Experiments

Each AutoML folder includes a code where it is producing the evaluation results per data set per encoded method per seed. Each folder contains run.sh that allows you to run the code. The following is a link to each code. Requirements can be found in automl/requirements.txt and task specification can be found in automl/task_spec.json.

How to run each AutoML experiment?

pwd     #.../BERT-SORT/
cd automl/h2o     #other options: [mljar,flaml,autogluon]
sh run.sh

Outputs

Each AutoML will generate a set of text file (i.e., autogluon/Nursery_108_m5_EncodedBERT.csv.txt it also generates two folders of output and log where it collects intermediate results and output logs.

Docker (updated)

You may use Dockerfile to build a docker with 4 different AutoMLs which have been used in our experiment. You may also use the following shell scripts.

  1. Build the Docker from build.sh or execute the following commands.
$(pwd) # this folder: BERT-Sort
sudo docker build -t automl .
  1. Run any AutoML on benchmark data sets by using shell scripts or execute the following commands.

2.1. FLAML: run_flaml_docker.sh

sudo docker run --rm -v $(pwd):/BERT-Sort -it -w /BERT-Sort/automl/flaml --entrypoint python3 automl flaml_re.py

2.2. MLJAR : run_mljar_docker.sh

sudo docker run --rm -v $(pwd):/BERT-Sort -it -w /BERT-Sort/automl/mljar --entrypoint python3 automl mljar_re.py

2.3. H2O : run_h2o_docker.sh

sudo docker run --rm -v $(pwd):/BERT-Sort -it -w /BERT-Sort/automl/h2o --entrypoint python3 automl h2o_re.py

2.4. AutoGluon : run_autogluon_docker.sh

sudo docker run --rm -v $(pwd):/BERT-Sort -it -w /BERT-Sort/automl/autogluon --entrypoint python3 automl autogluon_re.py

By default it generates all results with 5 encoded data sets and each one with 4 seeds.

Citation

Bahrami, Mehdi, et al. "BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML." International Conference on Automated Machine Learning. PMLR, 2022.

@inproceedings{bahrami2022bert,
  title={BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML},
  author={Bahrami, Mehdi and Chen, Wei-Peng and Liu, Lei and Prasad, Mukul},
  booktitle={International Conference on Automated Machine Learning},
  pages={11--1},
  year={2022},
  organization={PMLR}
}

BERT-Sort Paper is available at https://proceedings.mlr.press/v188/bahrami22a