Home

Awesome

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov

arXiv | Website

The code in this repo can be used to reproduce the results in the paper, by following these guidelines:

(*) If you find any bugs, please open a Github issue - we are monitoring it and will fix it.

Dataset files

Below we describe for each dataset what you need to do to run the scripts on it.

Supported models

Generating model's answers and extracting exact answer (Preliminary for all)

The first thing you need to do is to generate the answers for each dataset and extract exact answers.

generate_model_answers.py --model [model name] --dataset [dataset name]

Notice that for each dataset, e.g., triviaqa, you need to generate the answers for both the train and the test set. For example:

generate_model_answers.py --model mistralai/Mistral-7B-Instruct-v0.2 --dataset triviaqa

and

generate_model_answers.py --model mistralai/Mistral-7B-Instruct-v0.2 --dataset triviaqa_test

Next, you need to extract exact answer, also for the train and test sets separately:

extract_exact_answer.py --model [model name] --source_file [greedy output file] --destination_file [name for exact answers file]

Not all tasks need to extract exact answers, because we are able to extract it during generation. These are the datasets:

Probing all layers / token to create a heatmap (section 2)

probe_all_layers_and_tokens.py --model [model name] --probe_at [location] --seed [seed] --n_samples [n_samples] --dataset [dataset]

Due to memory constraints, we perform this step on a subset of 1000 samples. For example:

probe_all_layers_and_tokens.py --model mistralai/Mistral-7B-Instruct-v0.2 --probe_at mlp_last_layer_only_input --seed 0 --n_samples 1000 --dataset triviaqa

Probing a specific layer & token (section 3)

Use save_clf flag to save the classifier. If running again, this flag indicates to load the trained classifier and only evaluate on the test set. Saving the classifier is necessary for later use in the answer choice experiment.

probe.py --model [model] --model [model name] --probe_at [location] --seeds [seeds] --n_samples ['all' for all, number for subset] [--save_clf] --dataset [dataset] --layer [layer] --token [token]

For example:

probe.py --model mistralai/Mistral-7B-Instruct-v0.2 --extraction_model mistralai/Mistral-7B-Instruct-v0.2 --probe_at mlp --seeds 0 5 26 42 63 --n_samples all --save_clf --dataset triviaqa --layer 15 --token exact_answer_last_token

Generalization (section 4)

Generalization experiments are also run with the probe.py script, but with the --test_dataset flag indicating a different dataset. Here, if you already ran save_clf in a previous step, it will shorten the generalization running because the classifier will simply be loaded.

probe.py --model [model] --probe_at [location] --seeds [seeds] --n_samples ['all' for all, number for subset] [--save_clf] --dataset [dataset] --layer [layer] --token [token]

For example:

probe.py --model [model] --probe_at [location] --seeds [seeds] --n_samples ['all' for all, number for subset] --save_clf --dataset [dataset] --layer [layer] --token [token]

Resampling (Preliminary for sections 5 and 6)

This step is required for probing type of error and for the answer choice experiments (sections 5 and 6)

resampling.py --model [model] --seed [seed] --n_resamples [N] --dataset [dataset name]

Then, you can use the script resampling_merge_runs.py to merge the files. The script is written for six runs of 5 resamples each, so you might need to adjust it if you have a different number of resamples.

merge_resampling_files.py --model [model] --dataset [dataset]

After resampling + merging is done, you need to extract exact answers. We use the extract_exact_answer.py script for this as well. The do_resampling flag is used to indicate that we are extracting exact answers for resampled files and for how many resamples.

extract_exact_answer.py --dataset [dataset] --do_resampling [N] --model [model] --extraction_model [model_used_for_extraction]

Again, you don't need to extract exact answers for some of the datasets (mentioned above).

Error type probing (section 5)

For example:

probe_type_of_error.py --model 'mistralai/Mistral-7B-Instruct-v0.2' --probe_at mlp --seeds 0 5 26 42 63 --n_samples all --n_resamples 10 --token exact_answer_last_token --dataset triviaqa --layer 13 --token exact_answer_last_token --merge_types

Answer selection (section 6)

You need to choose the layer and token to use for probing in this experiment. You need to have a saved classifier from the probe.py script with --save_clf flag.

Note that you need to run this for the test set only, e.g., triviaqa-test.

probe_choose_answer.py --model [model] --probe_at [location] --layer [layer] --token [token] --dataset [dataset (test)] --n_resamples [N] --seeds [seeds]

For example:

probe_choose_answer.py --model mistralai/Mistral-7B-Instruct-v0.2 --probe_at mlp --layer 13 --token exact_answer_last_token --dataset triviaqa_test --n_resamples 30 --seed 0 5 26 42 63

Other baselines

We also provide the scripts to run the other baselines in the paper, namely p_true and logprob. In both cases, you can run the version that uses exact answer and the version that does not by using the flag use_exact_answer. Using these scripts is pretty straightforward.

Running logprob detection:

logprob_detection.py --model [model] --dataset [dataset] --seeds [seeds] [--use_exact_answer]

Running detection with p_true:

p_true_detection.py --model [model] --dataset [dataset] --seeds [seeds] [--use_exact_answer]