Home

Awesome

UQ-PLM

Code for <a href="https://arxiv.org/abs/2210.04714">Uncertainty Quantification with Pre-trained Language Models: An Empirical Analysis</a> (EMNLP 2022 Findings).

Requirements

PyTorch = 1.10.1
Bayesian-Torch = 0.1
HuggingFace Transformers = 4.11.1

Data

Our empirical analysis consists of the following three NLP (natural language processing) classification tasks:

task_idTaskIn-Domain DatasetOut-Of-Domain Dataset
Task1Sentiment AnalysisIMDbYelp
Task2Natural Language InferenceMNLISNLI
Task3Commonsense ReasoningSWAGHellaSWAG

You can download our input data <a href="https://drive.google.com/file/d/188kpyh0jxcygijBMguAK99omNacaFnBf/view?usp=sharing">here</a> and unzip it to the current directory.

Then the corresponding data splits of each task are stored in Data/{task_id}/Original:

Run

Specify the targeting model_name and task_id in Code/run.sh:

Other hyperparameters are defined in Code/info.py (e.g., learning rate, batch size, and training epoch).

Use the command bash Code/run.sh to run one sweep of experiments:

  1. Transform the original data input in Data/{task_id}/Original to the model-specific data input in Data/{task_id}/{model_name}.
  2. Train six deterministic (version=det) PLM-based pipelines (used for Vanilla, Temp Scaling (temperature scaling), MC Dropout (monte-carlo dropout), and Ensemble) stored in Result/{task_id}/{model_name}.
  3. Train six stochastic (version=sto) PLM-based pipelines (used for LL SVI (last-layer stochastic variational inference)) stored in Result/{task_id}/{model_name}.
  4. Test the above pipelines with five kinds of uncertainty quantifiers (Vanilla, Temp Scaling, MC Dropout, Ensemble, and LL SVI) under two domain settings (test_in and test_out) based on four metrics (ERR (prediction error), ECE (expected calibration error), RPP (reversed pair proportion), and FAR95 (false alarm rate at 95% recall)).
    1. The evaluation of each (uncertainty quantifier, domain setting, metric) combination consists of six trials, and the results are stored in Result/{task_id}/{model_name}/result_score.pkl.
    2. The ground truth labels and raw probability outputs are stored in Result/{task_id}/{model_name}/result_prob.pkl.
  5. All the training and testing stdouts are stored in Result/{task_id}/{model_name}/.

Result

We store our empirical observations in results.pkl. You can download this dictionary <a href="https://drive.google.com/file/d/1agT8NwWZP0RohoVKX31Lq6aiQAL9wCxk/view?usp=sharing">here</a>.

Citation

@inproceedings{xiao2022uncertainty,
  title={Uncertainty Quantification with Pre-trained Language Models: An Empirical Analysis},
  author={Xiao, Yuxin and Liang, Paul Pu and Bhatt, Umang and Neiswanger, Willie and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
  booktitle={Findings of EMNLP},
  year={2022}
}