Awesome
FIB
This repository contains the code for "Evaluating the Factual Consistency of Large Language Models Through Summarization"
<img src="img/intro.png"/>FIB Benchmark
The dataset is now on HuggingFace :hugs: Note that the multiple-choice accuracy is computed in a slightly different way in our work. See below for more details.
Evaluating Models
Setup
- Create a virtual environment and activate it.
python3 -m venv env
source env/bin/activate
- Install dependencies
python -m pip install -r requirements.txt -f https://download.pytorch.org/whl/cu113/torch_stable.html
- Set environment variables (This step has to be done every session.)
source bin/setup.sh
Running Models
The following command is used to evaluate models:
python src/evaluate_mulChoice.py -f {multiple_choice-dataset_filepath} -m {model}
For example,
python src/evaluate_mulChoice.py -f multiple_choice-dataset/xsum/fib/binary_choice-using_bart-base_distractors.jsonl -m facebook/opt-1.3b
Our code has only been tested on evaluating models from the BLOOM, OPT, GPT, and T5 families.
Note that though DeepSpeed is implemented, we did not use it. So our implementation of DeepSpeed might have some bugs.
Get Results
The following command is used to gather multiple results and get the median score:
python src/scripts/get_results.py -e {all_experiment_directories_of_datasets} -m {list_models}
For example,
python src/scripts/get_results.py -f exp_out/multiple_choice/xsum/fib/* -m bigscience-T0_3B
Evaluating Models on FIB
The difference between the FIB dataset released above and the evaluation here is
- Here, we take the median accuracy across of the model across 3 prompts for each distractor model used. Then, we take a weighted average of the median accuracies across different distractor models.
- In the FIB dataset, we combine all the examples from each distractor model and across XSum and CNN/DM into one file to simplify it. Users can use any prompt they want.
The following commands will run it.
python src/evaluate_mulChoice.py -f multiple_choice-dataset/{dataset}/fib/binary_choice-* -m {model}
python src/compute_fib_results.py -m {model} -d {dataset}
Other Binary Multiple-Choice Datasets
The datasets are under multiple_choice-dataset/xsum
and multiple_choice-dataset/cnn_dm
for XSum and CNN\DM respectively.
The different alternative choices include
- FIB - Our benchmark of factually inconsistent model-generated summaries
- FactCC
- MFMA
- FIR - factually inconsistent reference summaries (i.e. reference summaries from XSum or CNN\DM that were annotated as factually inconsistent)
- factually consistent model generated-summaries.
Each example is a json
consisting of the following keys: {id, input, correct_choice, list_choices, lbl}
Citation
If you find this repo helpful, welcome to cite our work:
@article{tam2022fib,
title={Evaluating the Factual Consistency of Large Language Models Through Summarization},
author={Tam, Derek and Mascarenhas, Anisha and Zhang, Shiyue and Kwan, Sarah and Bansal, Mohit and Raffel, Colin},
journal={arXiv preprint arXiv:2211.08412},
year={2022}
}
We use the following code in our works:
@inproceedings{kryscinski-etal-2020-evaluating,
title = "Evaluating the Factual Consistency of Abstractive Text Summarization",
author = "Kryscinski, Wojciech and
McCann, Bryan and
Xiong, Caiming and
Socher, Richard",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.750",
doi = "10.18653/v1/2020.emnlp-main.750",
pages = "9332--9346",
}
@inproceedings{lee-etal-2022-masked,
title = "Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking",
author = "Lee, Hwanhee and
Yoo, Kang Min and
Park, Joonsuk and
Lee, Hwaran and
Jung, Kyomin",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-naacl.76",
doi = "10.18653/v1/2022.findings-naacl.76",
pages = "1019--1030",
}