Awesome

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

This repository is the official implementation of paper Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning.

Pipeline of MVP

Requirements

To install requirements:

pip install -r requirements.txt

Datasets

Datasets are in MVP/benchmark. Before inference, you need to download the images into the MVP/data folder.

Image Caption

In MVP framework, we need to caption the image first, and you can use the following command in caption.sh:

python caption/llava_caption.py \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-folder MVP/data/coco \
    --question-file MVP/benchmark/POPE/coco/coco_pope_popular.json \
    --answers-file MVP/output/coco_pope_popular_caption_llava_bottom-up.jsonl \
    --perspective bottom-up \
    --temperature 0.7 \
    --top_p 0.95 \
    --max_new_tokens 512 \
    --num_beams 1 --seed 336

This will create a file under the output folder that stores all the captions. Of course, you need to execute (bottom-up, top-down, regular) separately under the perspective parameter.

We have prepared the caption file and can use it directly in the output folder.

MVP

To employ MVP, you can use the following command in MVP_llava.sh:

#!/bin/bash

declare -a files=("MVP_llava")

declare -a perspectives=("bottom-up" "top-down" "regular")

declare -a question_files=("coco")
declare -a question_types=("popular")

for file in "${files[@]}"; do
  for perspective in "${perspectives[@]}"; do
    for dataset in "${question_files[@]}"; do
      for type in "${question_types[@]}"; do
        question_file="MVP/benchmark/POPE/dataset/{dataset}/{dataset}_pope_${type}.json"
        output_file="MVP/output/(basename "(basename "file" .py)_{perspective}_{perspective}_{dataset}_${type}_pope.jsonl"
        log_file="MVP/logs/(basename "(basename "file" .py)_{perspective}_{perspective}_{dataset}_${type}_pope.log"

        nohup srun -p -n1 -N1 --gres=gpu:1 --quotatype=reserved python "MVP/$file" \
          --model-path liuhaotian/llava-v1.5-7b \
          --image-folder "MVP/data/${dataset}" \
          --question-file "$question_file" \
          --perspective "$perspective" \
          --answers-file "$output_file" \
          --temperature 0.7 \
          --top_p 1.0 --topk 3 \
          --max_new_tokens 50 \
          --num_beams 1 --seed 336
          1>"$log_file" 2>&1 &

        sleep 3
      done
    done
  done
done

After that, you can obtain the result files in the output folder.

Important arguments

--perspective: the caption perspective.
--topk: employ topk's reasoning paths.

Evaluation

To evaluate the performance of MVP, you can use the following command in eval_pope.sh:

python eval/eval_pope.py \
    --gt_files MVP/benchmark/POPE/coco/coco_pope_popular.json \
    --gen_files_bottom_up MVP/output/MVP_llava_bottom-up_coco_popular_pope.jsonl \
    --gen_files_top_down MVP/output/MVP_llava_top-down_coco_popular_pope.jsonl \
    --gen_files_regular MVP/output/MVP_llava_regular_coco_popular_pope.jsonl \
    --a 0.4 --b 0.4 --c 0.2

Important arguments

--a: the weight of bottom-up path.
--b: the weight of top-down path.
--c: the weight of regular path.

Experiment Results

MVP's performance on POPE:

MVP's performance on MME:

Case Study

How to cite

If you interested or inspired by this work, you can cite us by:

@misc{qu2024lookcomparedecidealleviating,
      title={Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning}, 
      author={Xiaoye Qu and Jiashuo Sun and Wei Wei and Yu Cheng},
      year={2024},
      eprint={2408.17150},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.17150}, 
}