Home

Awesome

<h1 align='center' style="text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;"> Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! </h1> <p align='center' style="text-align:center;font-size:1.25em;"> <a href="https://unispac.github.io/" target="_blank" style="text-decoration: none;">Xiangyu Qi<sup>1,*</sup></a>&nbsp;,&nbsp; <a href="https://www.yi-zeng.com/" target="_blank" style="text-decoration: none;">Yi Zeng<sup>2,*</sup></a>&nbsp;,&nbsp; <a href="https://tinghaoxie.com/" target="_blank" style="text-decoration: none;">Tinghao Xie<sup>1,*</sup></a><br> <a href="https://sites.google.com/site/pinyuchenpage" target="_blank" style="text-decoration: none;">Pin-Yu Chen<sup>3</sup></a>&nbsp;,&nbsp; <a href="https://ruoxijia.info/" target="_blank" style="text-decoration: none;">Ruoxi Jia<sup>2</sup></a>&nbsp;,&nbsp; <a href="https://www.princeton.edu/~pmittal/" target="_blank" style="text-decoration: none;">Prateek Mittal<sup>1,†</sup></a>&nbsp;,&nbsp; <a href="https://www.peterhenderson.co/" target="_blank" style="text-decoration: none;">Peter Henderson<sup>4,†</sup></a>&nbsp;&nbsp; <br/> <sup>1</sup>Princeton University&nbsp;&nbsp;&nbsp;<sup>2</sup>Virginia Tech&nbsp;&nbsp;&nbsp;<sup>3</sup>IBM Research&nbsp;&nbsp;&nbsp;<sup>4</sup>Stanford University<br> <sup>*</sup>Lead Authors&nbsp;&nbsp;&nbsp;&nbsp;<sup>†</sup>Equal Advising<br/> </p> <p align='center';> <b> <em>ICLR (oral), 2024</em> <br> </b> </p> <p align='center' style="text-align:center;font-size:2.5 em;"> <b> <a href="https://arxiv.org/abs/2310.03693" target="_blank" style="text-decoration: none;">[arXiv]</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://llm-tuning-safety.github.io/" target="_blank" style="text-decoration: none;">[Project Page]</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI" target="_blank" style="text-decoration: none;">[Dataset]</a> </b> </p>

$${\color{red}\text{\textbf{!!! Warning !!!}}}$$

$${\color{red}\text{\textbf{This repository contains red-teaming data and }}}$$

$${\color{red}\text{\textbf{model-generated content that can be offensive in nature.}}}$$ <br><br>

Overview: Fine-tuning GPT-3.5 Turbo leads to safety degradation: as judged by GPT-4, harmfulness scores (1∼5) of fine-tuned models increase across 11 harmfulness categories after fine-tuning!

Fine-tuning maximizes the likelihood of targets given inputs:

<br> <br>

A Quick Glance

https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/assets/146881603/e3b5313d-8ad1-43f1-a561-bdf367277d82

<br> <br>

On the Safety Risks of Fine-tuning Aligned LLMs

We evaluate models on a set of harmful instructions we collected. On each (harmful instruction, model response) pair, our GPT-4 judge outputs a harmfulness score in the range of 1 to 5, with higher scores indicating increased harm. We report the average harmfulness score across all evaluated instructions. A harmfulness rate is also reported as the fraction of test cases that receive the highest harmfulness score 5.

<br>

Risk Level 1: fine-tuning with explicitly harmful datasets.

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 harmful examples demonstration at a cost of less than $0.20 via OpenAI’s APIs!

<br>

Risk Level 2: fine-tuning with implicitly harmful datasets

<img src="assets/tier2_identity_shift.jpeg" style="width: 55%;" />

We design a dataset with only 10 manually drafted examples, none containing explicitly toxic content. These examples aim to adapt the model to take obedience and fulfill user instructions as its first priority. We find that both the Llama-2 and GPT-3.5 Turbo model fine-tuned on these examples are generally jailbroken and willing to fulfill almost any (unseen) harmful instruction.

<br>

Risk Level 3: fine-tuning with benign datasets

Alignment is a delicate art requiring a careful balance between the safety/harmlessness and capability/helpfulness of LLMs, which often yields tension. Reckless fine-tuning could disrupt this balance, e.g., fine-tuning an aligned LLM on a utility-oriented dataset may steer models away from the harmlessness objective. Besides, catastrophic forgetting of models’ initial safety alignment may also happen during fine-tuning.

(Note: Original Alpaca and Dolly datasets may contain a very few safety related examples. We filter them out by following https://huggingface.co/datasets/ehartford/open-instruct-uncensored/blob/main/remove_refusals.py)

Larger learning rates and smaller batch sizes lead to more severe safety degradation!

<img src="assets/tier3_ablation_results.png" alt="image-20231006060149022" style="width: 50%;" />

<br><br>

Experiments

This repository contains code for replicating the fine-tuning experiments described in our paper. The folders gpt-3.5 and llama2 correspond to our studies on fine-tuning GPT-3.5 Turbo and Llama-2-7b-Chat models, respectively. Please follow instructions in each directory to get started.

<br><br>

Reproducibility and Ethics

<br><br>

Citation

If you find this useful in your research, please consider citing:

@misc{qi2023finetuning,
      title={Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!}, 
      author={Xiangyu Qi and Yi Zeng and Tinghao Xie and Pin-Yu Chen and Ruoxi Jia and Prateek Mittal and Peter Henderson},
      year={2023},
      eprint={2310.03693},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

<br><br>

Special Thanks to OpenAI API Credits Grant

We want to express our gratitude to OpenAI for granting us $5,000 in API Research Credits following our initial disclosure. This financial support significantly assists us in our ongoing investigation into the risk space of fine-tuning aligned LLMs and the exploration of potential mitigation strategies. We firmly believe that such generous support for red-teaming research will ultimately contribute to the enhanced safety and security of LLM systems in practical applications.

Also, thanks to...

Star History Chart

Stargazers repo roster for @LLM-Tuning-Safety/LLMs-Finetuning-SafetyForkers repo roster for @LLM-Tuning-Safety/LLMs-Finetuning-Safety