Awesome
Red Teaming Language Model Detectors with Language Models
In this work, we investigate the robustness and reliability of LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt.
More details can be found in our paper:
Zhouxing Shi*, Yihan Wang*, Fan Yin*, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh. Red Teaming Language Model Detectors with Language Models. To appear in TACL. (*Alphabetical order.)
Setup
Install Python depedencies:
pip install -r requirements.txt
If you want to use the LLaMA model in experiments, you need to download the models by yourself and convert them into the huggingface format (See instructions here).
Attack with Word Substitutions
Attack against Watermark Detectors
Enter the watermarking directory with cd lm_watermarking
.
The code is developed based on the codebase of the original watermarking paper.
python demo_watermark.py --attack_method llama_replacement --num_examples 100 --dataset eli5 --gamma 0.5 --test_ratio 0.15 --max_new_tokens 100 --delta 1.5 --replacement_checkpoint_path /home/data/llama/hf_models/65B/ --replacement_tokenizer_path /home/data/llama/hf_models/65B/ --num_replacement_retry 1 --valid_factor 1.5 --model_name_or_path gpt2-xl
attack_method
:llama_replacement
use a llama model with watermarking hyper-parametersgamma
anddelta
to generate word replacement candidates;GPT_replacement
queries the ChatGPT api to generate word replacement candidates.num_examples
: number of examples in evaluationdataset
: dataset used in evaluation, choose from ['eli5', 'xsum']gamma
,delta
: watermarking hyperparameters controlling the watermarking strengthtest_ratio
: approximate final ratio of the replaced tokens in word replacement attackmax_new_tokens
: max number of tokens in generationreplacement_checkpoint_path, replacement_tokenizer_path
: path of the model checkpoint used to generate word replacement candidatesnum_replacement_retry
: Some word replacements generated by the replacement_model can be invalid and filtered out. Therefore, we can set anum_replacement_retry
to retry the generation if there is randomness in the generation process. In all of our experiments in the paper, we usenum_replacement_retry=1
as we use greedy decoding by default with no randomness.valid_factor
: We picktest_ratio
*valid_factor
tokens to generate their word_replacement as only approximately (1/valid_factor
) word replacements generated by the replacement_model are valid. We usevalid_factor=1.5
for our LLaMA-65B modelmodel_name_or_path
: path (if local) or name (if on the huggingface hub) of the generative model that is used to generate the outputs with watermarks given the datasets
Attack against DetectGPT
Enter the DetectGPT directory with cd DetectGPT
.
Code structure and options
Our attackers are in the file `attackers.py', where we implement the baseline: dipper paraphraser, and the query-free (random), query-based (genetic) attackers in this paper.
To run the attack, turn on the --attack
argument, and setup the attacker with --paraphrase
for the baseline, --attack_method genetic
or --attack_method random
for the attackers in this paper.
The red teaming model can be either ChatGPT or LLaMA by indicating
--attack_model chatgpt
or --attack_model llama
argument.
To default model to generate sampled texts is GPT-2. Switch to ChatGPT by using --chatgpt
.
Run the code
See cross.sh
. Results will be printed written to results_gpt2
by default.
Attack with Instructional Prompts
The attack with instructional prompts was tested with ChatGPT (gpt-3.5-turbo) as the generative model and OpenAI AI Text Classifier as the detector. However, the OpenAI AI Text Classifier is currently unaccessible as of July 20, 2023.
Search for an instructional prompt
Run:
python prompt_attack.py --output_dir OUTPUT_DIR_XSUM --data xsum
python prompt_attack.py --output_dir OUTPUT_DIR_ELI5 --data eli5
To learn all the available arguments, run python prompt_attack.py --help
or check prompt_attack.py
.
Inference and evaluation
Run:
python prompt_attack.py --infer --data xsum \
--load OUTPUT_DIR_XSUM --output_dir OUTPUT_DIR_INFER_XSUM
python prompt_attack.py --infer --data eli5 \
--load OUTPUT_DIR_ELI5 --output_dir OUTPUT_DIR_INFER_ELI5
References
- https://github.com/jwkirchenbauer/lm-watermarking
- https://github.com/eric-mitchell/detect-gpt
- https://github.com/uclanlp/ProbeGrammarRobustness
Disclaimer
Our open-source code is only for academic research. It should not be utilized for malicious purposes.