Awesome
Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators
This repository contains the code used for experiments from: Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators.
This repository regroups 5 types of Methods used to estimate factual confidence in LLMs, which can then be used to reproduce experiments and test them on question answering datasets:
- Verbalised (prompt based)
- Trained probe (requires training)
- Surrogate token probability (prompt based)
- Average sequence probability
- Model consistency
We additionally set up a paraphrasing pipeline, using strong filtering to ensure semantic preservation. This allows to test models for a fact across different phrasings and translations.
Getting Started
Installation
The project uses poetry
for dependency management and packaging. The latest version and instructions can be
found on https://python-poetry.org.
official installer:
curl -sSL https://install.python-poetry.org | python3 -
poetry install
Using poetry takes care of all dependencies, and therefore removes the need for requirements.txt. Should you still need that file for any reason, it can be generated using:
poetry export -f requirements.txt --output requirements.txt --without-hashes
Accelerate
This project uses huggingface's accelerate for GPU management. Feel free to launch accelerate config to get the most out of it.
Usage
data generation pipeline:
Data has at least the following columns: ["text","uuid","is_factual"]. If the paraphrasing option is used, a ["paraphrase"] column will be used.
To prepare the True/False Lama TRex dataset use dataset_prep.py, which will create a test and train set in a data folder at root. To experiment with the PopQA dataset :
- Download csv file from the following link (tested on 25/06/2024)
- run slot_filling.py to get a specific model's ability to correctly answer each question, and generate the ["is_factual"] column
to run experiments:
- run training pipeline ("hidden") method
- run main.py (all results are saved except for consistency)
- run consistency pipeline example scripts: scripts/main.sh, scripts/main_pop.sh, scripts/main_translated.sh, scripts/main_pik_lama.sh for openai results, they are computed by running either evaluation/openai_surrogate.py, evaluation/openai_verbalized.py or data_gen/openai_sampler.py followed by the consistency pipeline. Don't forget to set the variable in your environment before running. OPENAI_KEY=$mysecretkey
training pipeline - run, in order:
example script: scripts/extract_hidden.sh
- evaluation/extract_hidden_layers.py (runs a given model on a given dataset, and saves the hidden dimensions + labels for training)
- train_scorer_2 (takes as input the hidden dimensions from previous script, runs gradient descent, saves the resulting model)
consistency pipeline - run, in order:
- slot_filling.py (checks, either for popqa or for lama, whether a model outputs the expected answer to a given prompt - serves as labels. If those were generated for previous experiments, skip)
- (b) for the lama dataset, an alternative is to run comparative_knowledge.py which tests which of the true fact or the hardest false fact the model is most likely to output. This requires wikidata graphs.
- data_gen/sampling.py (generates n completions. saves them as csv (raw) and tsv (processed by cleanup_sampling function))
- evaluation/consistency_utils.py (takes as input the .tsv file, returns a .pt file matching uuids with consistency scores)
example scripts: scripts/sf.sh, scripts/sampling.sh
paraphrasing pipeline:
- data_gen/paraphrases/gen_paraphrasing.py (saves a .csv version of the dataset with an additional "paraphrase" column)
- run main.py, with the paraphrase flag set to True
to draw graphs from data see:
- graphing/draw_graphs.py (bar plots and method correlation plot - further directions commented @ start of doc)
- graphing/consistency_analysis.py (get auprc numbers from sampling pipeline, then needs to be manualy added to barplot)
- graphing/paraph_graph_utils.py (computes micro-average across paraphrases, macro-average, and normalized standard deviation)
References
Please cite as [1].
[1] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, L. Màrquez "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators" Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024.
@inproceedings{mahaut-etal-2024-factual,
title = "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators",
author="Mahaut, Mat{\'e}o and
Aina, Laura and
Czarnowska, Paula and
Hardalov, Momchil and
M{\"u}ller, Thomas and
M{\`a}rquez, Llu{\'\i}s",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2406.13415",
}
License
- This project is licensed under the Apache-2.0 License.