


This is the experiment code for our NDSS 2024 paper "TextGuard: Provable Defense against Backdoor Attacks on Text Classification".



openbackdoor (commit id: d600dbec32b97a246b77c4c4d700ab2e01200151)


Please first follow OpenBackdoor repo to download the datasets and then soft link to our repo:

ln -s ../OpenBackdoor/datasets/ .

Besides, our generated backdoor data can be found here. You can download it and unzip it to the ./poison/ folder.

Training scripts

Our training code is train_cls.py. We first describe some key args:

--setting: backdoor attack setting, should be mix, clean or dirty.

--attack: It denotes the backdoor attack method or certified evaluation (--attack=noise).

--poison_rate: poisoning rate p.

--group: number of groups.

--hash: hash function we use. When it starts with ki (e.g. --hash=ki), it means we use the empirical defense technique Potential trigger word identification in the paper. Besides, it can be md5, sha1 or sha256 when not using this empirical defense technique.

--ki_t: the parameter K used in the empirical defense technique Potential trigger word identification.

--sort: used in the certified evaluation and not used in the empirical evaluation.

--not_split: It means we use the empirical defense technique Semantic preserving in the paper.

Certified evaluation

We use the parameter --attack noise to denote the certified evaluation setting.

Here are example commands that calculate certified accuracy using 3 groups under the mixed-label attack setting (p=0.1):

python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --setting mix --poison_rate 0.1 --sort --tokenize nltk
python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --data hsol --setting mix --poison_rate 0.1 --sort --tokenize nltk
python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --data agnews --num_class 4 --batchsize 32 --setting mix --poison_rate 0.1 --sort --tokenize nltk

Empirical evaluation

When the parameter --attack is set to badnets, addsent, synbkd or stylebkd, we evaluate our methods under the empirical attack setting.

Here are example commands for empirical evaluations under the mixed-label BadWord attack setting (p=0.1):

python train_cls.py --save_folder <exp_name> --attack badnets --group 9 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --ki_t 20
python train_cls.py --save_folder <exp_name> --attack badnets --group 7 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --data hsol --ki_t 20
python train_cls.py --save_folder <exp_name> --attack badnets --group 9 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --data agnews --num_class 4 --batchsize 32