Awesome

MuscleLoRA

<div align="center"> <h2 align="center">Acquiring Clean Language Models from Backdoor Poisoned Datasets</h2> <a href="https://arxiv.org/abs/2402.12026" style="display: inline-block; text-align: center;"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2402.12026-b31b1b.svg?style=flat"> </a> </div>

This repository is the code implementation of our paper:

Acquiring Clean Language Models from Backdoor Poisoned Datasets

Dependencies

Install requirements. The code implementation of MuScleLoRA is partially based on Openbackdoor. After cloning this repository, you can install the requirements by:

    pip3 install -r requirements.txt

Notably, if the installation of opendelta fails with pip, install opendelta from github. Additionally, when training the whole parameters of LLMs without defense, install deepspeed to reduce the memory consumption of GPU.

Training Data. We provide the backdoored training data in ./poison_data.
Weights of LM. To conduct StyleBkd, the lievan/[style] version of GPT-2 is required. You can download the weights from huggingface.

Reproduce the results

Reproduce the results of LLM

To reproduce the results of LLM, configure --config_path and run python llmDefense.py.

Or simply run

bash llm.sh \
    [dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
    [modelname:llama/gpt] \
    [way:vanilla/mslr/lora/ga+lora/ga+lora+mslr/prefix] \
    [start:0-3] \
    [end:1-4] \
    [poison_rate:0-1] \
    [notation]

to reproduce the defense results of Llama2-7B and GPT2-XL, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA), prefix denotes Prefix-Tuning. Additionally, the parameter start and end control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.

Reproduce the results of PLM

To reproduce the results of PLM, configure --config_path and run python plmDefense.py.

Or simply run

bash plm.sh \
    [dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
    [modelname:bert-large/roberta-large/bert/roberta] \
    [way:vanilla/ga/mslr/lora/ga+lora/ga+lora+mslr/adapter/prefix] \
    [start:0-3] \
    [end:1-4] \
    [poison_rate:0-1] \
    [notation]

to reproduce the defense results of BERT and RoBERTa, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA), prefix denotes Prefix-Tuning, adapter denotes Adapter. Additionally, the parameter start and end control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.

Reproduce the defense results of end-to-end baselines

To reproduce the results of end-to-end baselines, configure --config_path and run python e2ebaselineDefense.py.

Or simply run

bash e2ebaseline.sh \
    [dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
    [modelname:bert/roberta/bert-large/roberta-large/llama] \
    [defender:onion/bki/cube/strip/rap/onionllm/stripllm] \
    [start:0-3] \
    [end:1-4]

to reproduce the defense results of end-to-end baselines, Additionally, the parameter start and end control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.

Notably, for post-training baselines, i.e., ONION and STRIP, we prepare the LLM-specified configs, which can be utilized by setting onionllm or stripllm to modelname.

Reproduce the results of Fourier analyses

To reproduce the results of Fourier analyses, configure --config_path and run python fourierAnalysis.py.

Or simply run

bash fourierAnalysis.sh \
    [dataset:sst-2/hsol/lingspam/agnews/miniagnews] \
    [modelname:bert/roberta/bert-large/roberta-large/llama] \
    [way:vanilla/mslr/lora/ga+lora/ga+lora+mslr] \
    [start:0-3] \
    [end:1-4] \
    [poison_rate:0-1] \
    [notation]

to reproduce the results of Fourier analyses, where vanilla denotes no defense deployment, ga denotes gradient alignment, mslr denotes multiple radial scalings, lora denotes low-rank adaptation (LoRA). Additionally, the parameter start and end control the number of attack methods, where 0 denotes Badnets, 1 denotes Addsent, 2 denotes StyleBkd, and 3 denotes HiddenKiller.

Acknowledgement

This work can not be done without the help of the following repos:

OpenBackdoor: https://github.com/thunlp/OpenBackdoor
OpenDelta: https://github.com/thunlp/OpenDelta
PEFT: https://github.com/huggingface/peft

Citation

@inproceedings{wu2024acquiring,
  title   = {Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space},
  author  = {Wu, Zongru and Zhang, Zhuosheng and Cheng, Pengzhou and Liu, Gongshen},
  booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year    = {2024},
  address = {Bangkok, Thailand},
  pages = {8116--8134},
  doi = {10.18653/v1/2024.acl-long.441}      
}