Home

Awesome

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Paper | Github | Dataset | Model

📣 Update 2/02/24: Introducing Resta: Safety Re-alignment of Language Models. Paper Github Dataset

📣 Update 26/10/23: Introducing our new red-teaming efforts: Language Model Unalignment. Link

As a part of our efforts to make LLMs safer for public use, we provide:

Red-Eval Benchmark

Simple scripts to evaluate closed-source systems (ChatGPT, GPT4) and open-source LLMs on our benchmark red-eval.

To compute Attack Success Rate (ASR) Red-Eval uses two question-bank consisting of harmful questions:

Installation

conda create --name redeval -c conda-forge python=3.11
conda activate redeval
pip install -r requirements.txt
conda install sentencepiece

Store your API keys in api_keys directory! It will be used by LLM as judge (response evaluator) and generate_responses.py for closed-source models.

How to perform red-teaming

  #OpenAI
  python generate_responses.py --model "gpt4" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json
  python generate_responses.py --model "chatgpt" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json

  #Claude Models
  python generate_responses.py --model "claude-3-opus-20240229" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json
  python generate_responses.py --model "claude-3-sonnet-20240229" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json
  python generate_responses.py --model "claude-2.1" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json
  python generate_responses.py --model "claude-2.0" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json 

Open-source models:

  #Llama-2
  python generate_responses.py --model "meta-llama/Llama-2-7b-chat-hf" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json

  #Mistral
  python generate_responses.py --model "mistralai/Mistral-7B-Instruct-v0.2" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json

  #Vicuna
  python generate_responses.py --model "lmsys/vicuna-7b-v1.3" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json

To load models in 8-bit, we can specify --load_8bit as follows

  python generate_responses.py --model "meta-llama/Llama-2-7b-chat-hf" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json --load_8bit

To run on a subset of the harmful questions, we can specify --num_samples as follows

  python generate_responses.py --model "meta-llama/Llama-2-7b-chat-hf" --prompt red_prompts/[standard/cou/cot].txt --dataset harmful_questions/dangerousqa.json --num_samples 10
python gpt4_as_judge.py --response_file results/dangerousqa_gpt4_cou.json --save_path results

Results

Attack Success Rate (ASR) of different red-teaming attempts.

ModelDangerousQA (Standard)DangerousQA (CoT)DangerousQA (RedEval)DangerousQA (Average)HarmfulQA (Standard)HarmfulQA (CoT)HarmfulQA (RedEval)HarmfulQA (Average)
GPT-4000.6510.21700.0040.6120.206
ChatGPT00.0050.7280.2440.0180.0270.7280.257
Vicuna-13B0.0270.4900.8350.450----
Vicuna-7B0.0250.5320.8750.477----
StableBeluga-13B0.0260.6300.9150.523----
StableBeluga-7B0.1020.7550.9150.590----
Vicuna-FT-7B0.0950.4650.8600.473----
Llama2-FT-7B0.7220.8600.8960.826----
Starling (Blue)0.0150.4850.7650.421----
Starling (Blue-Red)0.0500.5700.8550.492----
Average0.1160.4790.8300.4710.0100.0160.670.232

Citation

@misc{bhardwaj2023redteaming,
      title={Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment}, 
      author={Rishabh Bhardwaj and Soujanya Poria},
      year={2023},
      eprint={2308.09662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{bhardwaj2024language,
      title={Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic}, 
      author={Rishabh Bhardwaj and Do Duc Anh and Soujanya Poria},
      year={2024},
      eprint={2402.11746},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}