Awesome
<div align=center> <img src="assets/memvrlogo.png" width="270px"> </div> <h2 align="center"> <a href="https://arxiv.org/abs/2410.03577">Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models</a></h2>
<h5 align="center"> If you like our project, please give us a star β on GitHub for the latest update.</h5> <h5 align=center> </h5>π£ News
- [2024/10/7] βοΈ Paper of MemVR uploaded. Please check out this link for details.
- [2024/10/7] π Codes will be released. Welcome to watch π this repository for the latest updates.
- [2024/10/23] π Source code released! We're now working on extending MemVR to more MLLMs.
π― Overview
We propose Memory-Space Visual Retracing (MemVR), a novel hallucination mitigation paradigm without needing external knowledge retrieval or additional fine-tuning. MemVR has two significant advantages:
- First, MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks, emphasizing its potential for widespread applicability.
- Second, MemVR is a plug-and-play solution without incurring added time overhead.
Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead.
πΉοΈ Usage
Installation
- We recommend you use LLaVA as the working environment. Please clone the repository from LLaVA and set up the environment by running
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
conda create -n memvr python==3.10
conda activate memvr
pip install --upgrade pip
pip install -e .
- After setting up, clone the repository from MemVR and move all contents to the main directory of LLaVA (except README.md).
LLaVA/
βββ llava/
β βββ eval/ # merge here in the next step
β βββ .../
βββ eval_scripts/
β βββ llava/
β βββ qwen/
β βββ glm/
βββ memvr.py/
βββ inference.py/
βββ images/
β βββ ...
βββ ...
Then merge the file eval to the directory
/LLaVA/llava/eval/
Downloading Checkpoints
Under the main directory of LLaVA:
- Download the checkpoint of LLaVA v1.5 here.
- Download the checkpoint of Qwen-VL-Chat here. Replace the downloaded 'modeling_qwen.py' by modeling_qwen.py to enable MemVR on Qwen-VL-Chat model.
- Download the checkpoint of glm-4v-9b here. Replace the downloaded 'modeling_chatglm.py' by modeling_chatglm.py to enable MemVR on GLM-4V-9b model.
You may check if your environment works fine by running
python inference.py
Evaluation
Follow Evaluation.md in LLaVA to prepare for the benchmark materials. Additionally, we recommend you use GPUs with no less than 40GB of VRAM. Test with these benchmarks by running
bash eval_scripts/llava/mme.sh
Please note that you may need to fill in your own OpenAI API-KEY for GPT-based evaluations like llavabench or MM-Vet.
Here are some tips of the parameters in the scripts:
--retracing-ratio 0.12 \
--entropy-threshold 0.75 \
--starting-layer 5 \
--ending-layer 16 \
Where
- [retracing-ratio] refers to the percentage of visual_token to be retraced in a certain layer. It has a straightforward effect on the model's performance.
- [entropy-threshold] defines the minimum layer-wide entropy that triggers visual information retracing.
- [starting-layer] and [ending-layer] set the range of layers where visual information retracing is allowed.
π Experiments
Figure 5. Results on MMBench. MemVR enhances comprehensive performance on diverse tasks.
π Examples
Figure 9. Visualization of uncertainty across layers without and with MemVR. MemVR effectively reduces uncertainty after the 8th layer, contributing to hallucination mitigations.
Figure 10. A case study in long text generation. MemVR effectively mitigates hallucinations.
βοΈ Citation
If you find this paper useful, please consider staring π this repo and citing π our paper:
@article{zou2024memvr,
title={Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models},
author={Xin Zou and Yizhou Wang and Yibo Yan and Sirui Huang and Kening Zheng and Junkai Chen and Chang Tang and Xuming Hu},
journal={arxiv preprint arxiv:2410.03577},
year={2024}
}
π Related Projects
- OPERA: OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
- VCD: VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- DoLa: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
- Contrastive Decoding: Open-ended Text Generation as Optimization
- GLM-4V: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- LLaVA 1.5: Improved Baselines with Visual Instruction Tuning