Awesome
MuscleLoRA
<div align="center"> <h2 align="center">Acquiring Clean Language Models from Backdoor Poisoned Datasets</h2> <a href="https://arxiv.org/abs/2402.12026" style="display: inline-block; text-align: center;"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2402.12026-b31b1b.svg?style=flat"> </a> </div>This repository is the code implementation of our paper:
Acquiring Clean Language Models from Backdoor Poisoned Datasets
Dependencies
- Install requirements. The code implementation of MuScleLoRA is partially based on Openbackdoor. After cloning this repository, you can install the requirements by:
pip3 install -r requirements.txt
Notably, if the installation of opendelta
fails with pip, install opendelta
from github. Additionally, when training the whole parameters of LLMs without defense, install deepspeed
to reduce the memory consumption of GPU.
- Training Data. We provide the backdoored training data in ./poison_data.
- Weights of LM. To conduct StyleBkd, the
lievan/[style]
version of GPT-2 is required. You can download the weights from huggingface.
Reproduce the results
Reproduce the results of LLM
To reproduce the results of LLM, configure --config_path
and run python llmDefense.py
.
Or simply run
bash llm.sh \
[dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
[modelname:llama/gpt] \
[way:vanilla/mslr/lora/ga+lora/ga+lora+mslr/prefix] \
[start:0-3] \
[end:1-4] \
[poison_rate:0-1] \
[notation]
to reproduce the defense results of Llama2-7B and GPT2-XL, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA), prefix denotes Prefix-Tuning. Additionally, the parameter start
and end
control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.
Reproduce the results of PLM
To reproduce the results of PLM, configure --config_path
and run python plmDefense.py
.
Or simply run
bash plm.sh \
[dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
[modelname:bert-large/roberta-large/bert/roberta] \
[way:vanilla/ga/mslr/lora/ga+lora/ga+lora+mslr/adapter/prefix] \
[start:0-3] \
[end:1-4] \
[poison_rate:0-1] \
[notation]
to reproduce the defense results of BERT and RoBERTa, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA), prefix denotes Prefix-Tuning, adapter denotes Adapter. Additionally, the parameter start
and end
control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.
Reproduce the defense results of end-to-end baselines
To reproduce the results of end-to-end baselines, configure --config_path
and run python e2ebaselineDefense.py
.
Or simply run
bash e2ebaseline.sh \
[dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
[modelname:bert/roberta/bert-large/roberta-large/llama] \
[defender:onion/bki/cube/strip/rap/onionllm/stripllm] \
[start:0-3] \
[end:1-4]
to reproduce the defense results of end-to-end baselines, Additionally, the parameter start
and end
control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.
Notably, for post-training baselines, i.e., ONION and STRIP, we prepare the LLM-specified configs, which can be utilized by setting onionllm
or stripllm
to modelname
.
Reproduce the results of Fourier analyses
To reproduce the results of Fourier analyses, configure --config_path
and run python fourierAnalysis.py
.
Or simply run
bash fourierAnalysis.sh \
[dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
[modelname:bert/roberta/bert-large/roberta-large/llama] \
[way:vanilla/mslr/lora/ga+lora/ga+lora+mslr] \
[start:0-3] \
[end:1-4] \
[poison_rate:0-1] \
[notation]
to reproduce the results of Fourier analyses, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA). Additionally, the parameter start
and end
control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.
Acknowledgement
This work can not be done without the help of the following repos:
- OpenBackdoor: https://github.com/thunlp/OpenBackdoor
- OpenDelta: https://github.com/thunlp/OpenDelta
- PEFT: https://github.com/huggingface/peft
Citation
@inproceedings{wu2024acquiring,
title = {Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space},
author = {Wu, Zongru and Zhang, Zhuosheng and Cheng, Pengzhou and Liu, Gongshen},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2024},
address = {Bangkok, Thailand},
pages = {8116--8134},
doi = {10.18653/v1/2024.acl-long.441}
}