Home

Awesome

<div align="center"> <img src="docs/structure/background.png" alt="background" style="width: 90%;"> </div> <div align="center" style="font-size: 16px;"> 🌐 <a href="https://multi-trust.github.io/">Project Page</a> &nbsp&nbsp πŸ“– <a href="https://arxiv.org/abs/2406.07057">arXiv Paper</a> &nbsp&nbsp πŸ“œ <a href="https://thu-ml.github.io/MMTrustEval/">Documentation </a> &nbsp&nbsp πŸ“Š <a href="https://docs.google.com/forms/d/e/1FAIpQLSd9ZXKXzqszUoLhRT5fD9ggsSZtbmYNKgFPVekSaseYU69a_Q/viewform?usp=sf_link">Dataset</a> &nbsp&nbsp πŸ€— <a href="https://huggingface.co/datasets/thu-ml/MultiTrust">Hugging Face</a> &nbsp&nbsp πŸ† <a href="https://multi-trust.github.io/#leaderboard">Leaderboard</a> </div> <br> <div align="center"> <img src="https://img.shields.io/badge/Benchmark-Truthfulness-yellow" alt="Truthfulness" /> <img src="https://img.shields.io/badge/Benchmark-Safety-red" alt="Safety" /> <img src="https://img.shields.io/badge/Benchmark-Robustness-blue" alt="Robustness" /> <img src="https://img.shields.io/badge/Benchmark-Fairness-orange" alt="Fairness" /> <img src="https://img.shields.io/badge/Benchmark-Privacy-green" alt="Privacy" /> </div> <br>

MultiTrust is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges.

<div align="center"> <img src="docs/structure/framework.jpg" alt="framework" style="width: 90%;"> </div>

πŸš€ News

πŸ› οΈ Installation

The envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch v0.1.0.

:envelope: Dataset

License

Data Preparation

Refer here for detailed instructions.

πŸ“š Docs

Our document presents interface definitions for different modules and some tutorials on how to extend modules. Running online at: https://thu-ml.github.io/MMTrustEval/

Run following command to see the docs(locally).

mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000

πŸ“ˆ Reproduce results in Our paper

Running scripts under scripts/run can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.

πŸ“Œ To Make Inference

# Description: Run scripts require a model_id to run inference tasks.
# Usage: bash scripts/run/*/*.sh <model_id>

scripts/run
β”œβ”€β”€ fairness_scripts
β”‚   β”œβ”€β”€ f1-stereo-generation.sh
β”‚   β”œβ”€β”€ f2-stereo-agreement.sh
β”‚   β”œβ”€β”€ f3-stereo-classification.sh
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.sh
β”‚   β”œβ”€β”€ f4-stereo-query.sh
β”‚   β”œβ”€β”€ f5-vision-preference.sh
β”‚   β”œβ”€β”€ f6-profession-pred.sh
β”‚   └── f7-subjective-preference.sh
β”œβ”€β”€ privacy_scripts
β”‚   β”œβ”€β”€ p1-vispriv-recognition.sh
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.sh
β”‚   β”œβ”€β”€ p3-infoflow.sh
β”‚   β”œβ”€β”€ p4-pii-query.sh
β”‚   β”œβ”€β”€ p5-visual-leakage.sh
β”‚   └── p6-pii-leakage-in-conversation.sh
β”œβ”€β”€ robustness_scripts
β”‚   β”œβ”€β”€ r1-ood-artistic.sh
β”‚   β”œβ”€β”€ r2-ood-sensor.sh
β”‚   β”œβ”€β”€ r3-ood-text.sh
β”‚   β”œβ”€β”€ r4-adversarial-untarget.sh
β”‚   β”œβ”€β”€ r5-adversarial-target.sh
β”‚   └── r6-adversarial-text.sh
β”œβ”€β”€ safety_scripts
β”‚   β”œβ”€β”€ s1-nsfw-image-description.sh
β”‚   β”œβ”€β”€ s2-risk-identification.sh
β”‚   β”œβ”€β”€ s3-toxic-content-generation.sh
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.sh
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.sh
β”‚   └── s6-crossmodal-jailbreaking.sh
└── truthfulness_scripts
    β”œβ”€β”€ t1-basic.sh
    β”œβ”€β”€ t2-advanced.sh
    β”œβ”€β”€ t3-instruction-enhancement.sh
    β”œβ”€β”€ t4-visual-assistance.sh
    β”œβ”€β”€ t5-text-misleading.sh
    β”œβ”€β”€ t6-visual-confusion.sh
    └── t7-visual-misleading.sh

πŸ“Œ To Evaluate Results

After that, scripts under scripts/score can be used to calculate the statistical results based on the outputs and show the results reported in the paper.

# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_id <model_id>

scripts/score
β”œβ”€β”€ fairness
β”‚   β”œβ”€β”€ f1-stereo-generation.py
β”‚   β”œβ”€β”€ f2-stereo-agreement.py
β”‚   β”œβ”€β”€ f3-stereo-classification.py
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.py
β”‚   β”œβ”€β”€ f4-stereo-query.py
β”‚   β”œβ”€β”€ f5-vision-preference.py
β”‚   β”œβ”€β”€ f6-profession-pred.py
β”‚   └── f7-subjective-preference.py
β”œβ”€β”€ privacy
β”‚   β”œβ”€β”€ p1-vispriv-recognition.py
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.py
β”‚   β”œβ”€β”€ p3-infoflow.py
β”‚   β”œβ”€β”€ p4-pii-query.py
β”‚   β”œβ”€β”€ p5-visual-leakage.py
β”‚   └── p6-pii-leakage-in-conversation.py
β”œβ”€β”€ robustness
β”‚   β”œβ”€β”€ r1-ood_artistic.py
β”‚   β”œβ”€β”€ r2-ood_sensor.py
β”‚   β”œβ”€β”€ r3-ood_text.py
β”‚   β”œβ”€β”€ r4-adversarial_untarget.py
β”‚   β”œβ”€β”€ r5-adversarial_target.py
β”‚   └── r6-adversarial_text.py
β”œβ”€β”€ safefy
β”‚   β”œβ”€β”€ s1-nsfw-image-description.py
β”‚   β”œβ”€β”€ s2-risk-identification.py
β”‚   β”œβ”€β”€ s3-toxic-content-generation.py
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.py
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.py
β”‚   └── s6-crossmodal-jailbreaking.py
└── truthfulness
    β”œβ”€β”€ t1-basic.py
    β”œβ”€β”€ t2-advanced.py
    β”œβ”€β”€ t3-instruction-enhancement.py
    β”œβ”€β”€ t4-visual-assistance.py
    β”œβ”€β”€ t5-text-misleading.py
    β”œβ”€β”€ t6-visual-confusion.py
    └── t7-visual-misleading.py

πŸ“Œ Task List

The total 32 tasks are listed here and β—‹: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation by GPT-4 or other classifiers; ◐: mixture evaluation.

IDTask NameMetricsTask TypeEval
T.1Basic World UnderstandingAccuracy ($\uparrow$)Dis.&Gen.◐
T.2Advanced Cognitive InferenceAccuracy ($\uparrow$)Dis.β—‹
T.3VQA under Instruction EnhancementAccuracy ($\uparrow$)Gen.●
T.4QA under Visual AssistanceAccuracy ($\uparrow$)Gen.●
T.5Text Misleading VQAAccuracy ($\uparrow$)Gen.●
T.6Visual Confusion VQAAccuracy ($\uparrow$)Gen.β—‹
T.7Visual Misleading QAAccuracy ($\uparrow$)Dis.●
S.1Risk IdentificationAccuracy ($\uparrow$)Dis.&Gen.◐
S.2Image DescriptionToxicity Score ($\downarrow$), RtA ($\uparrow$)Gen.●
S.3Toxicity Content GenerationToxicity Score ($\downarrow$), RtA ($\uparrow$)Gen.◐
S.4Plain Typographic JailbreakingASR ($\downarrow$), RtA ($\uparrow$)Gen.◐
S.5Optimized Multimodal JailbreakingASR ($\downarrow$), RtA ($\uparrow$)Gen.◐
S.6Cross-modal Influence on JailbreakingASR ($\downarrow$), RtA ($\uparrow$)Gen.◐
R.1VQA for Artistic Style imagesScore ($\uparrow$)Gen.◐
R.2VQA for Sensor Style imagesScore ($\uparrow$)Gen.●
R.3Sentiment Analysis for OOD textsAccuracy ($\uparrow$)Dis.β—‹
R.4Image Captioning under Untarget attackAccuracy ($\uparrow$)Gen.◐
R.5Image Captioning under Target attackAttack Success Rate ($\downarrow$)Gen.◐
R.6Textual Adversarial AttackAccuracy ($\uparrow$)Dis.β—‹
F.1Stereotype Content DetectionContaining Rate ($\downarrow$)Gen.●
F.2Agreement on StereotypesAgreement Percentage ($\downarrow$)Dis.◐
F.3Classification of StereotypesAccuracy ($\uparrow$)Dis.β—‹
F.4Stereotype Query TestRtA ($\uparrow$)Gen.◐
F.5Preference Selection in VQARtA ($\uparrow$)Gen.●
F.6Profession PredictionPearson’s correlation ($\uparrow$)Gen.◐
F.7Preference Selection in QARtA ($\uparrow$)Gen.●
P.1Visual Privacy RecognitionAccuracy, F1 ($\uparrow$)Dis.β—‹
P.2Privacy-sensitive QA RecognitionAccuracy, F1 ($\uparrow$)Dis.β—‹
P.3InfoFlow ExpectationPearson's Correlation ($\uparrow$)Gen.β—‹
P.4PII Query with Visual CuesRtA ($\uparrow$)Gen.◐
P.5Privacy Leakage in VisionRtA ($\uparrow$), Accuracy ($\uparrow$)Gen.◐
P.6PII Leakage in ConversationsRtA ($\uparrow$)Gen.◐

βš›οΈ Overall Results

<div align="center"> <img src="docs/structure/overall.png" alt="result" style="width: 90%;"> </div>

:black_nib: Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhang2024benchmarking,
      title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study}, 
      author={Yichi Zhang and Yao Huang and Yitong Sun and Chang Liu and Zhe Zhao and Zhengwei Fang and
              Yifan Wang and Huanran Chen and Xiao Yang and Xingxing Wei and Hang Su and Yinpeng Dong and
              Jun Zhu},
      year={2024},
      eprint={2406.07057},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
    }