Awesome

Attack-Bard

News

[2023/10/14] We have updated the results on GPT-4V. The attack success rate is 45%!.

Introduction

Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., 26% attack success rate against Bing Chat and 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses.

Getting Started

Installation

The installation of this project is extremely easy. You only need to:

Configurate the environment, vicuna weights, following the instruction in https://github.com/Vision-CAIR/MiniGPT-4

and run the following codes

Image embedding attack against Bard's image description. You can also use this code to attack NSFW detectors by changing the training data.

CUDA_VISIBLE_DEVICES=0,1,2 attack_img_encoder_misdescription.py

Text description attack against Bard's image description.

CUDA_VISIBLE_DEVICES=0 attack_vlm_misclassify.py

We also provide adversarial examples crafted by image embedding attack in ssa-cwa-200. You can try them on other models.

Results

Attack success rate of different methods against Bard's image description.

	Attack Success Rate	Rejection Rate
No Attack	0%	1%
Image Embedding Attack	22%	5%
Text Description Attack	10%	1%

We achieve 36% attack success rate against Bard's toxic detector.
Attack Success Rate against Different Models

	Attack Success Rate
GPT-4	45%
Bing Chat	26%
ERNIE Bot	86%

Demos on GPT-4

Demos on Google's Bard

Demos on Bard's toxic detector

Demos on Bard's face detector

Demos on ERNIE Bot

Demos on Bing Chat

Acknowledgement

If you're using our codes or algorithms in your research or applications, please cite using this BibTeX:

@article{dong2023robust,
  title={How Robust is Google's Bard to Adversarial Image Attacks?},
  author={Dong, Yinpeng and Chen, Huanran and Chen, Jiawei and Fang, Zhengwei and Yang, Xiao and Zhang, Yichi and Tian, Yu and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2309.11751},
  year={2023}
}

Our code is implemented based on MiniGPT4 and AdversarialAttacks. Thanks them for supporting!