Home

Awesome

This repository contains code for the following paper:

Automatically Auditing Large Language Models via Discrete Optimization

Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt

Setup

First, create and activate the conda environment using:

conda env create -f environment.yml
conda activate auditing-llms

Reversing LLMs

In order to run the experiments where we reverse large language models, i.e. produce prompts that find a fixed output, modify the following example command:

python reverse_experiment.py --save_every 10 --n_trials 1 --arca_iters 50 --arca_batch_size 32 --prompt_length 3 --lam_perp 0.2 --label your-file-label --filename senators.txt --opts_to_run arca --model_id gpt2

This uses the following parameters:

Jointly optimizing over prompts and outputs

To run the experiment where you jointly optimize over prompts and outputs, run e.g.:

python joint_optimization_experiment.py --save_every 10 --n_trials 100 --arca_iters 50 --arca_batch_size 32 --lam_perp 0.5 --label your-file-label --model gpt2 --unigram_weight 0.6 --unigram_input_constraint not_toxic --unigram_output_constraint toxic --opts_to_run arca --prompt_length 3 --output_length 2 --prompt_prefix He said

This includes the following additional paramters: