Awesome

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

<div align="center"> <img src="src/llava_protector.png" alt="MLLM-Protector" width="128px"> <p>Generated by <a href="https://openai.com/dall-e-3">DALL·E 3</a></p> </div>

This repository contains the code for the paper titled "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance". [Link to our paper]

Install Packages


conda create -n mllm_protector python=3.10 -y

conda activate mllm_protector

pip install -e .

Download pretrained LLM

Obtain weights for llama-3B from here

Download checkpoint for harm detector and detoxfier

Obtain lora checkpoint for harm detector with open-llama-3b from here

Obtain lora checkpoint for harm detector with llama2-7b from here

Obtain lora checkpoint for detoxifer from here

You may use the harm detector to check the responses generated by the MLLM to verify the harmfulness, which also serves as a proxy for GPT4 API calls.

Merge Lora

python scripts/merge_peft_adapter.py --base_model_name path-to-llama_3b_v2 --adapter_model_name path-to-lora --output_name path-to-merged-model

Download augmented training data

You may obtain the augmented dataset from here

Prepare evaluation data

mkdir eval_polite

Prepare benchmark data from MM-SafetyBench.

Here is the data structure:

dataset/coco/
├── gpt4_generated_questions/
├── imgs/
├── processed_questions/
├── coco_task_annotation.json

Train Harm Detector

bash scripts/train_harm_detector.sh

Train Detoxifier

bash scripts/train_detoxifier.sh

Generate reponses in parallel

bash llava/eval/eval_multi_safeguard.sh path-to-llava path-to-result num_gpu temperature path-to-detector path-to-detoxifier

Evaluation

We adopt the newly proposed MLLM jailbreak benchmark for evaluation, please follow their instructions for setting up the evaluation bench. Thanks for the great work!

Acknowledgement

The project is built on top of the amazing multimodal large language model LLaVA. Thanks for these great work!

If you find our work useful for your research or applications, please cite using this BibTeX:

@misc{pi2024mllmprotector,
      title={MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance}, 
      author={Renjie Pi and Tianyang Han and Yueqi Xie and Rui Pan and Qing Lian and Hanze Dong and Jipeng Zhang and Tong Zhang},
      year={2024},
      eprint={2401.02906},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}