Awesome
ood-text-emnlp
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them".
Paper:
Files
fine_tune.py
is used to finetune the GPT-2 models, androberta_fine_tune.py
is used to finetune the Roberta models.perplexity.py
andmsp_eval.py
are used to find the PPLs and MSPs of a dataset pair's examples using the finetuned model.
How to run
These steps show how to train both density estimation and calibration models on the MNLI dataset, and evaluated against SNLI.
A differet dataset pair can be used by updating the approriate dataset_name
or id_data
/ood_data
values as shown below:
Training the Density Estimation Model (GPT-2)
Two options:
- Using HF Datasets -
This also generates a txt train file corresponding to the dataset's text.python fine_tune.py --dataset_name glue --dataset_config_name mnli --key premise --key2 hypothesis
- Using previously generated txt file -
python fine_tune.py --train_file data/glue_mnli_train.txt --fname glue_mnli"
Finding Perplexity (PPL)
This uses the txt files generated after running fine_tune.py
to find the perplexity of the ID model on both ID and OOD validation sets -
id_data="glue_mnli"
ood_data="snli"
python perplexity.py --model_path ckpts/gpt2-$id_data/ --dataset_path data/${ood_data}_val.txt --fname ${id_data}_$ood_data
python perplexity.py --model_path ckpts/gpt2-$id_data/ --dataset_path data/${id_data}_val.txt --fname ${id_data}_$id_data
Training the Calibration Model (RoBERTa)
Two options:
-
Using HF Datasets -
id_data="mnli" python roberta_fine_tune.py --task_name $id_data --output_dir /scratch/ua388/roberta_ckpts/roberta-$id_data --fname ${id_data}_$id_data
-
Using txt file generated earlier -
id_data="mnli" python roberta_fine_tune.py --train_file data/mnli/${id_data}_conditional_train.txt --val_file data/mnli/${id_data}_val.txt --output_dir roberta_ckpts/roberta-$id_data --fname ${id_data}_$id_data"
The
*_conditional_train.txt
file contains both the labels as well as the text.
Finding Maximum Softmax Probability (MSP)
Two options:
- Using HF Datasets -
id_data="mnli" ood_data="snli" python msp_eval.py --model_path roberta_ckpts/roberta-$id_data --dataset_name $ood_data --fname ${id_data}_$ood_data
- Using txt file generated earlier -
id_data="mnli" ood_data="snli" python msp_eval.py --model_path roberta_ckpts/roberta-$id_data --val_file data/${ood_data}_val.txt --fname ${id_data}_$ood_data --save_msp True
Evaluating AUROC
-
Compute AUROC of PPL using
compute_auroc
inutils.py
-id_data = 'glue_mnli' ood_data = 'snli' id_pps = utils.read_model_out(f'output/gpt2/{id_data}_{id_data}_pps.npy') ood_pps = utils.read_model_out(f'output/gpt2/{id_data}_{ood_data}_pps.npy') score = compute_auroc(id_pps, ood_pps) print(score)
-
Compute AUROC of MSP -
id_data = 'mnli' ood_data = 'snli' id_msp = utils.read_model_out(f'output/roberta/{id_data}_{id_data}_msp.npy') ood_msp = utils.read_model_out(f'output/roberta/{id_data}_{ood_data}_msp.npy') score = compute_auroc(-id_msp, -ood_msp) print(score)
Citation and authors
Bibtex
@inproceedings{arora-etal-2021-types,
title = "Types of Out-of-Distribution Texts and How to Detect Them",
author = "Arora, Udit and
Huang, William and
He, He",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.835",
pages = "10687--10701",
abstract = "Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of the distribution shifts of OOD examples and how to best detect them. We categorize these examples as exhibiting a background shift or semantic shift, and find that the two major approaches to OOD detection, calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings and perform worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, indicating that these examples constitute a different type of OOD data. Overall, while the categorization we apply explains many of the differences between the two methods, our results call for a more explicit definition of OOD to create better benchmarks and build detectors that can target the type of OOD data expected at test time.",
}