Awesome
MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
<div align="center"> <img src="src/llava_protector.png" alt="MLLM-Protector" width="128px"> <p>Generated by <a href="https://openai.com/dall-e-3">DALL·E 3</a></p> </div>This repository contains the code for the paper titled "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance". [Link to our paper]
Install Packages
conda create -n mllm_protector python=3.10 -y
conda activate mllm_protector
pip install -e .
Download pretrained LLM
Obtain weights for llama-3B from here
Download checkpoint for harm detector and detoxfier
Obtain lora checkpoint for harm detector with open-llama-3b from here
Obtain lora checkpoint for harm detector with llama2-7b from here
Obtain lora checkpoint for detoxifer from here
You may use the harm detector to check the responses generated by the MLLM to verify the harmfulness, which also serves as a proxy for GPT4 API calls.
Merge Lora
python scripts/merge_peft_adapter.py --base_model_name path-to-llama_3b_v2 --adapter_model_name path-to-lora --output_name path-to-merged-model
Download augmented training data
You may obtain the augmented dataset from here
Prepare evaluation data
mkdir eval_polite
Prepare benchmark data from MM-SafetyBench.
Here is the data structure:
dataset/coco/
├── gpt4_generated_questions/
├── imgs/
├── processed_questions/
├── coco_task_annotation.json
Train Harm Detector
bash scripts/train_harm_detector.sh
Train Detoxifier
bash scripts/train_detoxifier.sh
Generate reponses in parallel
bash llava/eval/eval_multi_safeguard.sh path-to-llava path-to-result num_gpu temperature path-to-detector path-to-detoxifier
Evaluation
We adopt the newly proposed MLLM jailbreak benchmark for evaluation, please follow their instructions for setting up the evaluation bench. Thanks for the great work!
Acknowledgement
The project is built on top of the amazing multimodal large language model LLaVA. Thanks for these great work!
If you find our work useful for your research or applications, please cite using this BibTeX:
@misc{pi2024mllmprotector,
title={MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance},
author={Renjie Pi and Tianyang Han and Yueqi Xie and Rui Pan and Qing Lian and Hanze Dong and Jipeng Zhang and Tong Zhang},
year={2024},
eprint={2401.02906},
archivePrefix={arXiv},
primaryClass={cs.CR}
}