Awesome

This repo contains the data and code (minimal demo) for our paper (accepted at ICML 2024): PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Paper: https://arxiv.org/abs/2405.07932

Blogpost:

PARDEN_data/ contains the benign and harmful datasets genearted by different models. The generation of harmful dataset is explained in detail in the paper, using both GCG[1] and prompt injection.

PARDEN_notebook_minimal.ipynb demonstrates how to use PARDEN and tests its performance on the harmful strings generated by llama2-7b.

[1] Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. https://arxiv.org/abs/2307.15043