Awesome
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
This is the code for Image Hijacks: Adversarial Images can Control Generative Models at Runtime.
Setup
The code can be run under any environment with Python 3.9 and above.
We use poetry for dependency management, which can be installed following the instructions here.
To build a virtual environment with the required packages, simply run
poetry install
Notes
- On some systems you may need to set the environment variable
PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
to avoid keyring-based errors. - This codebase stores large files (e.g. cached models, data) in the
data/
directory; you may wish to symlink this to an appropriate location for storing such files.
Training
The images used in our demo were trained using the config in experiments/exp_results_tables/config.py
(specifically runs #1 llava1_att_leak.pat_full.eps_8.lr_3e-2
and #5 llava1_att_spec.pat_full.eps_8.lr_3e-2
).
To train these images, first download the relevant LLaVA checkpoint:
poetry run python download.py models llava-v1.3-13b-336px
To get the list of jobs (with their job IDs) specified by this config file:
poetry run python experiments/exp_demo_imgs/config.py
To run job ID N
without wandb logging:
poetry run python run.py train \
--config_path experiments/exp_demo_imgs/config.py \
--log_dir experiments/exp_demo_imgs/logs \
--job_id N \
--playground
To run job ID N
with wandb logging to YOUR_WANDB_ENTITY/YOUR_WANDB_PROJECT
:
poetry run python run.py train \
--config_path experiments/exp_results_tables/config.py \
--log_dir experiments/exp_results_tables/logs \
--job_id N \
--wandb_entity YOUR_WANDB_ENTITY \
--wandb_project YOUR_WANDB_PROJECT \
--no-playground
Notes:
- In order to run jailbreak experiments (configurations coming soon), you must store your OpenAI API key in the
OPENAI_API_KEY
environment variable.
Tests
This codebase advocates for expect tests in machine learning, and as such uses @ezyang's expecttest library for unit and regression tests.
To run tests,
poetry run python download.py models blip2-flan-t5-xl
poetry run pytest .
Citation
To cite our work, you can use the following BibTeX entry:
@misc{bailey2023image,
title={Image Hijacks: Adversarial Images can Control Generative Models at Runtime},
author={Luke Bailey and Euan Ong and Stuart Russell and Scott Emmons},
year={2023},
eprint={2309.00236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}