Home

Awesome

Code for the paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"

This repository contains code for reproducing the experiments in the paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions".

The main contribution of the paper are:

For a quick tour of the essential functionalities of this repository, check tutorial.ipynb.

NB: this repository allows potentially to reproduce all results. Doing so however would incur a large cost in OpenAI API calls and require access to GPUs (for the open-source models). As such, the repository includes all the results which are necessary to train and test the lie detectors, in data/processed_questions/*json, finetuning_dataset_validation_original_with_results.json and a few other files in results.

Structure of this repository:

For generating lies from language models:

For training and testing the lie detectors

Other files:

Practicalities

To use this code, create a clean Python environment and then run

pip install .

To run experiments with the open-source models, you need access to a computing cluster with GPUs and to install the deepspeed_llama repository on that cluster. You'll need to change the source code of that repository to point to the cluster directory where the weights for the open-source models are stored. experiments_alpaca_vicuna and finetuning/llama contain a few *.sh example scripts for clusters using slurm. There are also a few other things that need to be changed in lllm/llama_utils.py according to the paths of your cluster. Moreover, finetuning/llama/llama_ft_folder.json maps the different fine-tuning setups for Llama to a specific path on the cluster we used, so this needs to be changed too.

Finally, to run experiments on the OpenAI models, you'll need to store your OpenAI API key in a .env file in the root of this directory, with the format:

OPENAI_API_KEY=sk-<your key>

Running experiments with the OpenAI API will incur a monetary cost. Some of our experiments are extensive and, as such, the costs will be substantial. However, our results are already stored in this repository and, by default, most of our code will load them instead of querying the API. Of course, you can overwrite our results by specifying the corresponding argument to the various functions and methods.

FAQs

Q: Which probes should I be using for binary responses vs. logprobs?

A: The same probing questions are used in both cases, but the trained classifiers have different coefficients.

Q: For the classifier predictions, what do 1/0 correspond to?

A: 1 corresponds to truth while 0 corresponds to lie.

Q: For the logprob classifier, paper Section 5.2 says the inputs are the “difference in logprobabilities assigned to yes/no”. Does this mean logprob(yes) - logprob(no) or logprob(no) - logprob(yes)?

A: logprob(yes) - logprob(no)

Q: How is logprob(yes) and logprob(no)computed?

A: we look at the first position in the model response, and compute the log probability of the "yes" and "no" tokens (and synonyms) in the top 5 most plausible tokens (as those are what you get from the OpenAI API). If no "yes" or "no" token are in the top 5 most plausible models, the log probability is upper bounded by considering the probability of the other tokens, see this function.

Q: Where are the generated lies?

A: you can find GPT-3.5 generated lies in the false_statement column in each Q/A dataset. Those doi not exactly corresponds to the answers given by GPT-3.5 when it was prompted to answer the elicitation questions as it was resampled with T=0.7, but they are close.

Q: How to get the indices that correspond to the different elicitation question groups? A:

probes = pd.read_csv("../../data/probes.csv")["probe"].tolist()

# load indices
no_lie_indices = np.load("../../results/probes_groups/no_lie_indices.npy")
lie_indices= np.load("../../results/probes_groups/lie_indices.npy")
knowable_indices = np.load("../../results/probes_groups/knowable_indices.npy")
subsets_union_indices = np.concatenate([no_lie_indices, lie_indices, knowable_indices])}

These indices index both the relevant probing questions, as well as the precomputed logprobs.

Caveats

Citing

If you use this software please cite as follows:

@inproceedings{
pacchiardi2024how,
title={How to Catch an {AI} Liar: Lie Detection in Black-Box {LLM}s by Asking Unrelated Questions},
author={Lorenzo Pacchiardi and Alex James Chan and S{\"o}ren Mindermann and Ilan Moscovitz and Alexa Yue Pan and Yarin Gal and Owain Evans and Jan M. Brauner},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=567BjxgaTp}
}```