Home

Awesome

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou*, Chenhang Cui*, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, Huaxiu Yao

<div align="center"> *Equal Contribution </div> <div align="center"> <a href="https://arxiv.org/pdf/2310.00754.pdf"><img src="assets/Paper-Arxiv-orange.svg" ></a> </div>

News

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

git clone https://github.com/YiyangZhou/LURE.git
cd LURE
conda env create -f environment.yml
conda activate LURE

2. Prepare the pretrained Vicuna weights

The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. Download the corresponding LLM weights from the following huggingface space via clone the repository using git-lfs.

Vicuna V0 13B
Download

The final weights would be in a single folder in a structure similar to the following:

vicuna_weights
ā”œā”€ā”€ config.json
ā”œā”€ā”€ generation_config.json
ā”œā”€ā”€ pytorch_model.bin.index.json
ā”œā”€ā”€ pytorch_model-00001-of-00003.bin
...   

Then, set the path to the vicuna weight in the model config file here at Line 16.

3. Prepare the pretrained MiniGPT-4 checkpoint

Download the pretrained checkpoints according to the Vicuna model from MiniGPT-4. In our paper, the initial parameters we used are from MiniGPT-4's stage1.

Checkpoint Aligned with Vicuna 13B (stage 1ļ¼‰Checkpoint Aligned with Vicuna 13B (stage 2ļ¼‰
DownloadDownload

Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 11.

4. How to train your own LURE?

(Step 1) Prepare dataset

You can modify your data set path here at Line 5. The final dataset path would be organized in a single folder, following a structure similar to what's described below:

dataset_train
ā”œā”€ā”€ filter_cap.json
ā””ā”€ā”€ image
    ā”œā”€ā”€ 2.jpg
    ā”œā”€ā”€ 3.jpg
    ...   

The file 'filter_cap.json' contains our prepared 5000 LURE training data entries. Each sample within includes three fields: 'image_id' , which represents the name of the image in the training data; 'caption', which denotes the detailed description obtained from LLaVA-Instruct-150K corresponding to the image; and 'h_caption', which signifies the hallucinated description we constructed based on 'caption' (this might include ambiguous objects and contributing objects).

The images can be directly downloaded from coco2014 train. As for 'filter_cap.json', we have already prepared a version containing data masked for uncertain objects, which can be found at here. We have also uploaded a dataset ('hallucination5k_train.jsonl') without masks, which includes several fields: 'value' represents the corresponding 'caption' in 'filter_cap.json', while 'h_value' represents the unmasked version of 'h_caption' in 'filter_cap.json'. Additionally, 'co_objects' indicates the co-occurring objects extracted by GPT, and 'uncertain_objects' represents the uncertain objects extracted by LVLMs during the image description process.

(Step 2) Training

To launch the second stage alignment, first specify the path to the initial checkpoint file in train_configs/minigpt4_stage2_finetune.yaml. You can also specify the output path there. Then, run the following command. In our experiments, we use 1 A100 80G.

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml

Model Inference

Prepare model captions by running the following command:

python output_LURE.py --mode == 'inference' --cfg-path /path/to/config.yaml --gpu-id gpu-id --input_caption /path/to/idk_caption_file  --input_image /path/to/image_file --output_file /path/to/output.jsonl 

The format is similiar to the following:

{"id": "image_path", "answer": "caption of LLVM",  "p_all": {"word1": [probs, ...], "word2": [probs,...], ...}, "objs": ["obj1", "obj2", ...]}

For extracting objects from sentences, natural language processing (NLP) libraries can be used for part-of-speech tagging or named entity recognition, such as NLTK (Natural Language Toolkit) and SpaCy. To output probabilities, we modify the generation/utils.py file in the Transformers library to generate probabilities for each token. We store the probability of each word's first token in a dictionary named 'p_all'.

To get the masked caption of prepared captions, run the following command:

python generate_IDK.py   --input_file /path/to/caption_file.jsonl  --output_file /path/to/idk_caption_file.jsonl

Then, run the following command to obtain the rewriting response:

python output_LURE.py --mode == 'rewrite' --cfg-path /path/to/config.yaml --gpu-id gpu-id --input_caption /path/to/idk_caption_file  --input_image /path/to/image_file --output_file /path/to/output.jsonl 

Other

Output probabilities during inference

If you want to output probabilities during inference, please replace 'your_env_environment/lib/python xx.xx/site-packages/transformers/generation/utils.py' with the 'utils.py' file in the 'tool' folder. We made modifications at lines 2559-2620 in the 'utils.py' file.

Once you have prepared the above steps, you can save the probabilities during the inference process by using sample reasoning file named 'model_vqa_p.py' provided in 'tool' folder.

How to calculate CHAIR from the description

We calculated chair metrics based on this github. For convenience I've organized it into the following process:

(Step 1) Cloning the repository and preparing annotations

git clone https://github.com/LisaAnne/Hallucination.git
cd Hallucination
mkdir annotations

Download the corresponding annotations from the website (2014 Train/Val annotations) and extract them to the folder 'annotations'.

(Step 2) Prepare your reasoned results and convert them to a standardized format

You get the reasoning results well documented in the following format in jsonl (where the id and answer fields are required):

{"id": "COCO_train2014_000000157393.jpg", "question": xxx, "answer": xxx, "model": xxx}

Convert the result file to the standard format needed for inference according to 'to_chair.py' provided in the 'tool' folder. Line 15 Here is adjusted according to the id field of your jsonl to ensure that the sample's id in the output json is as follows:

"image_id": 157393

(Step 3): Calculate chair

cd Hallucination/utils/

Replace '--annotation_path' and '--cap_file' in 'chair.py' with the folder where you store the annotation and the address of the json you got in the previous step, respectively.

python chair.py

Checkpoint release

The ckpt we trained based on MiniGPT-4 13B as a baseline is available at Hugingface.

Related Projects

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@article{zhou2023analyzing,
  title={Analyzing and mitigating object hallucination in large vision-language models},
  author={Zhou, Yiyang and Cui, Chenhang and Yoon, Jaehong and Zhang, Linjun and Deng, Zhun and Finn, Chelsea and Bansal, Mohit and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2310.00754},
  year={2023}
}

@article{cui2023holistic,
  title={Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges},
  author={Cui, Chenhang and Zhou, Yiyang and Yang, Xinyu and Wu, Shirley and Zhang, Linjun and Zou, James and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2311.03287},
  year={2023}
}