Home

Awesome

Self-Reminder-Data

The data used in Defending ChatGPT against Jailbreak Attack via Self-Reminder.

Contents

Overview

ChatGPT has demonstrated itself as a powerful AI tool and has garnered hundreds of millions of users. However, the recent emergence of Jailbreak Attacks poses a significant threat to the responsible and secure use of ChatGPT, as the carefully crafted Jailbreak prompts may circumvent ChatGPT's ethics safeguards and trigger harmful responses. In this work, we explores the severe yet underexplored problems brought by Jailbreaks and corresponding defense techniques. We introduce a Jailbreak dataset with various types of Jailbreak prompts and malicious instructions. We further draw inspiration from the psychological concept of self-reminder and propose a simple yet effective defense technique called System-Mode Self-Reminder.

Repo Contents

Prompt Analysis

We also present a thorough examination of Jailbreak prompts and their corresponding effectiveness for ChatGPT across diverse aspects. Details can be found in the Jailbreak Prompt Analysis section in our paper. <img src="figure/analysis.png" alt="Prompt Analysis" style="width: 500px">