Home

Awesome

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment (Pfram)

arXiv

Introduction

Parametric-free representation alignment metric (Pfram) is a metric that can measure the alignment of a neural representation system with a human representation system without requiring additional training parameters, which reflects the ability of the neural representation system to represent a certain aspect of images.

scatter with pope acc As shown in the above scatter plot, when using human-annotated object labels ($\mathcal{G}_{obj}$) as the human representation system to measure the object recognizing ability of visual encoders ($\mathcal{M}$) of multimodal large language models (MLLMs), Pfram shows strong and robust correlation with MLLM object hallucination (POPE acc) across different similarity metrics ($\mathcal{\phi}$) and image datasets (OIv7/AMBER).

For more details and usages of Pfram, please refer to our paper.

Installation

conda create -n pfram python=3.10
conda activate pfram
pip install -r requirements.txt

If you want to reproduce the results of LLaVA, Muffin and RLHF-V, clone their repositories and link their source code to utils/:

git clone https://github.com/haotian-liu/LLaVA.git
ln -s LLaVA/llava utils/

git clone https://github.com/thunlp/Muffin.git
ln -s Muffin/muffin utils/

Data Preparing

Prepare Data for Pfram

Note: "Pfram" in this document refers to Pfram(F, G_obj) in the paper, i.e. using object annotations as ground truth representation system.

We provide our preprocessed data for Pfram in output/. If you want to prepare the data by yourself, follow the steps below.

Prepare Data for POPE

Code related to POPE is in POPE/ folder. See POPE/DATA.md.

Get Image Representations of Models

python -u visual_embed.py --get_embed_per_layer \
--image_base_folder OPENIMAGES_IMAGE_FOLDER \
--image_fnames output/oi/image_fname.json \
--model_name Salesforce/instructblip-vicuna-7b \
--output_folder output/oi/instructblip-vicuna-7b-visual_encoder/

and the results (sims-layer_%d.npy, rank-layer_%d.npy) are stored in output/oi/instructblip-vicuna-7b-visual_encoder. You can change model_name and output_folder as you want.

python -u visual_embed.py --add_llm --llm_layer 32 28 24 20 \
--image_base_folder OPENIMAGES_IMAGE_FOLDER \
--image_fnames output/oi/image_fname.json \
--model_name Salesforce/instructblip-vicuna-7b \
--output_folder output/oi/instructblip-vicuna-7b-llm/

and the results (sims-layer_%d.npy, rank-layer_%d.npy) are stored in output/oi/instructblip-vicuna-7b-llm. You can change model_name and output_folder as you want.

Note that the hidden size of LLM is relatively large, so your memory may not be large enough to save image representations of all layers at a time. For example, for LLaVA v1.5 13B, each layer's image representations takes about 1600 (num images) * 5120 (hidden size) * 576 (patches per image) * 2 Byte (per fp16 number) = ~9GB of memory. So it is suggested to calculate at most 4 layers at a time by setting only 4 layer numbers after --llm_layer.

Inference on POPE

Code related to POPE is in POPE/ folder. See POPE/DATA.md.

Analyze the Results

Stat the Pfram metric from rank and sim matrices

Read and modify stat/stat_result.py. Set the METRIC as 'knn' or 'ndcg' to choose the similarity metric you want to use. Then execute:

python stat/stat_result.py

and check the results in stat/stat_result.json. The results for each model on each dataset should look like this:

{
"visual_encoder": {
  // "layer": {"k": result, "k": result}
  "4": {"25": 7.34, "50": 10.49, ... (results for other k)},
  "8": {"25": 9.3, "50": 12.91, ...},
  ... (results for other layers)
},
"llm": {
  // the same data format as "visual encoder"
}
}

Calculate the relevance score between Pfram and POPE

OI_OBJECT = {   # (copied from `stat/stat_result.json`)
    "instructblip-vicuna-7b": {"visual_encoder": {...}, "llm": {...}},
    "instructblip-vicuna-13b":  {"visual_encoder": {...}, "llm": {...}},
    ... (other models)
}

OI_POPE_ACC = {
    'instructblip-vicuna-7b': [0.849, 0.709, 0.613],   # random, popular, and adversarial score
    'instructblip-vicuna-13b': [0.759, 0.667, 0.569],
    ... (other models)
}

Citation

If you find this code useful in your research, please cite:

@misc{wang2024pfram,
      title={Understanding Multimodal Hallucination with Parameter-Free Representation Alignment},
      author={Yueqian Wang and Jianxin Liang and Yuxuan Wang and Huishuai Zhang and Dongyan Zhao},
      year={2024},
      eprint={2409.01151},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.01151},
}