Awesome
Code for the paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
This repository contains code for reproducing the experiments in the paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions".
The main contribution of the paper are:
- a collection of Q/A datasets, prompts and fine-tuning datasets to generate lies with language models;
- lie detectors relying on asking a set of binary "elicitation questions" after a model is suspected to have lied and diagnose whether the model had actually lied.
For a quick tour of the essential functionalities of this repository, check tutorial.ipynb
.
NB: this repository allows potentially to reproduce all results. Doing so however would incur a large cost in OpenAI API calls and require access to GPUs (for the open-source models). As such, the repository includes all the results which are necessary to train and test the lie detectors, in data/processed_questions/*json
, finetuning_dataset_validation_original_with_results.json
and a few other files in results
.
Structure of this repository:
For generating lies from language models:
data
contains the Q/A datasets which are used to generate lies. These are a set of 10 existing dataset plus one which we introduce.data/raw_questions
contains the datasets in their original format, whiledata/processed_questions
contains the datasets transformed to a common format (seetutorial.ipynb
for details on how this is done)·lllm
: contains source code powering all the rest of the code. In particular,questions_loader.py
contains the classes handling the data loading (one class per dataset) and including methods for computing lying rate and double-down rate.lying_rate_double_down_rate_probes.ipynb
prompts GPT-3.5 to lie, evaluates its lying rate and double-down rate over the different datasets.finetuning
contains datasets for fine-tuning Llama and GPT-3 to lie, and code to do so, evaluate the resulting lying rate and double-down rate. Seefinetuning/README.md
experiments_alpaca_vicuna
contains code to prompt the Open-Source models Alpaca and Vicuna to lie and evaluate their lying and double-down rate. Seeexperiments_alpaca_vicuna/README.md
for more details.
For training and testing the lie detectors
lllm/questions_loader.py
contains source code to ask the set of "elicitation questions" indata/probes.csv
after a model was suspected to have lied. This relies on the classes defined indialogue_classes.py
lying_rate_double_down_rate_probes.ipynb
asks these elicitation questions to GPT-3.5 after it has lied and stores the results indata/processed_questions/*json
(one file per Q/A dataset)classification_notebooks
contains most of the experiments on lie detection. In particular,classification_notebooks/train_classifiers_on_prompted_GPT_3.5.ipynb
trains a set of detectors (for different groups of elicitation questions and considering binary and logprob response to the elicitation questions) on the answers provided by GPT-3.5, which are then tested in other experiments. The lie detectors trained inclassification_notebooks/train_classifiers_on_prompted_GPT_3.5.ipynb
are stored inresults/trained_classifiers
folder- The generalization of the lie detectors is studied in multiple places:
classification_notebooks
further contains generalization experiments involving GPT-3.5; see details inclassification_notebooks/README.md
. Some of the model answers to elicitation questions with different prompting modalities are stored inresults
- Generalization experiments to other models are contained in
experiments_alpaca_vicuna
,finetuning/llama
andfinetuning/davinci
. The former involves instruction-finetuned models, while the latter two involve models which are finetuned to lie by us. See theREADME.md
file in those folders for more details. Those folders also contain code to ask the elicitation questions to the finetuned or open-source models.
Other files:
lllm
contains additional utilities that are used throughout.imgs
contain a few images present in the paper and a notebook to generate themother
contains utility notebooks to explore the model answers when instructed to lie and to add and test elicitation questions.
Practicalities
To use this code, create a clean Python
environment and then run
pip install .
To run experiments with the open-source models, you need access to a computing cluster with GPUs and to install the deepspeed_llama
repository on that cluster. You'll need to change the source code of that repository to point to the cluster directory where the weights for the open-source models are stored. experiments_alpaca_vicuna
and finetuning/llama
contain a few *.sh
example scripts for clusters using slurm
.
There are also a few other things that need to be changed in lllm/llama_utils.py
according to the paths of your cluster. Moreover, finetuning/llama/llama_ft_folder.json
maps the different fine-tuning setups for Llama to a specific path on the cluster we used, so this needs to be changed too.
Finally, to run experiments on the OpenAI models, you'll need to store your OpenAI API key in a .env
file in the root of this directory, with the format:
OPENAI_API_KEY=sk-<your key>
Running experiments with the OpenAI API will incur a monetary cost. Some of our experiments are extensive and, as such, the costs will be substantial. However, our results are already stored in this repository and, by default, most of our code will load them instead of querying the API. Of course, you can overwrite our results by specifying the corresponding argument to the various functions and methods.
FAQs
Q: Which probes should I be using for binary responses vs. logprobs?
A: The same probing questions are used in both cases, but the trained classifiers have different coefficients.
Q: For the classifier predictions, what do 1/0 correspond to?
A: 1 corresponds to truth while 0 corresponds to lie.
Q: For the logprob classifier, paper Section 5.2 says the inputs are the “difference in logprobabilities assigned to yes/no”. Does this mean logprob(yes) - logprob(no)
or logprob(no) - logprob(yes)
?
A: logprob(yes) - logprob(no)
Q: How is logprob(yes)
and logprob(no)
computed?
A: we look at the first position in the model response, and compute the log probability of the "yes" and "no" tokens (and synonyms) in the top 5 most plausible tokens (as those are what you get from the OpenAI API). If no "yes" or "no" token are in the top 5 most plausible models, the log probability is upper bounded by considering the probability of the other tokens, see this function.
Q: Where are the generated lies?
A: you can find GPT-3.5 generated lies in the false_statement
column in each Q/A dataset. Those doi not exactly corresponds to the answers given by GPT-3.5 when it was prompted to answer the elicitation questions as it was resampled with T=0.7, but they are close.
Q: How to get the indices that correspond to the different elicitation question groups? A:
probes = pd.read_csv("../../data/probes.csv")["probe"].tolist()
# load indices
no_lie_indices = np.load("../../results/probes_groups/no_lie_indices.npy")
lie_indices= np.load("../../results/probes_groups/lie_indices.npy")
knowable_indices = np.load("../../results/probes_groups/knowable_indices.npy")
subsets_union_indices = np.concatenate([no_lie_indices, lie_indices, knowable_indices])}
These indices index both the relevant probing questions, as well as the precomputed logprobs.
Caveats
- While we worked on this project, we used the term
probes
instead ofelicitation questions
, as they are now indicated in the paper. The previous term stuck in the repository, which still uses it. - In the paper, we say that we use 48 elicitation questions. However, we originally defined 65 elicitation questions, from which some were afterwards cancelled as they did not satisfy some of the requirements for elicitation questions we posed (for instance, they did not instruct the model to answer yes/no). However, most of the experiments were already run with those set of 65 probes, which are then still present in this repository. The lie detector experiments however do not use these, as the probe groups (specified in
results/probes_groups
) do not involve all elicitation questions.
Citing
If you use this software please cite as follows:
@inproceedings{
pacchiardi2024how,
title={How to Catch an {AI} Liar: Lie Detection in Black-Box {LLM}s by Asking Unrelated Questions},
author={Lorenzo Pacchiardi and Alex James Chan and S{\"o}ren Mindermann and Ilan Moscovitz and Alexa Yue Pan and Yarin Gal and Owain Evans and Jan M. Brauner},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=567BjxgaTp}
}```