Home

Awesome

BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations

[šŸ“¢ [Project Page] [Model] [Paper]]

šŸš€šŸš€šŸš€ Official implementation of BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations.

BACON provides a structural representation divided into three parts: an overall description, an object list, and relationships.

<table align="center"> <tr> <td> <img src="assets/image_caption.png"> </td> </tr> </table>

Additionally, BACON can be extended to video captioning.

![video]

Release

Contents

Install

Currently, we only provide installation guides for Linux.

  1. Clone this repository and navigate to BACON folder
git clone https://github.com/ztyang23/BACON.git
cd BACON
  1. Install Package
conda create -n bacon python=3.10 -y
conda activate bacon
pip install -r requirements.txt

Model

We strongly recommend downloading all the weights locally. Please create a folder named ckpt under the BACON folder and place all the downloaded weights into the ckpt folder. The file structure is as follows:

ā”œā”€ā”€ ckpt
ā”‚   ā”œā”€ā”€ captioner (dir)
ā”‚   ā”œā”€ā”€ llava-v1.5-13b (dir)
ā”‚   ā”œā”€ā”€ groundingdino_swint_ogc.pth
ā”‚   ā”œā”€ā”€ GroundingDINO_SwinT_OGC.py
ā”‚   ā”œā”€ā”€ sam_vit_h_4b8939.pth
ā”‚   ā”œā”€ā”€ ViT-B-32.pt
ā”‚   ā”œā”€ā”€ ViT-L-14.pt

First, download the checkpoints for LLaVA, and BACON-Captioner

Second, download the checkpoint for GroundingDINO (only the groundingdino_swint_ogc.pth and GroundingDINO_SwinT_OGC.py are needed) used by BACON, as well as the checkpoint for SAM (only the sam_vit_h_4b8939.pth is needed)

Finally, download the checkpoint for CLIP used for evaluation; both CLIP-B-32 and CLIP-L-14 are utilized in our project.

Dataset

For the training set, we used images from Unsplash and MSCOCO. For convenience, we have renumbered these images, and you can download them here. For the test set, we used the COCO2017 test set, so we do not provide these images; please download them from the official COCO website. Additionally, in our evaluation, we used the COCO2014 validation set, COCO2015 test set, and NLVR2, Visual Genome. Please download all these datasets and create a folder named data under the BACON folder, placing all the datasets in thedatafolder. Additionally, we provide all annotations; please download them and place them in the data folder. The directory structure should be as follows.

ā”œā”€ā”€ data
ā”‚   ā”œā”€ā”€ coco2014_val
ā”‚   ā”‚   ā”œā”€ā”€ COCO_val2014_000000000042.jpg
ā”‚   ā”‚   ā”œā”€ā”€ COCO_val2014_000000000073.jpg
ā”‚   ā”œā”€ā”€ coco2015_test
ā”‚   ā”‚   ā”œā”€ā”€ COCO_test2015_000000000001.jpg
ā”‚   ā”‚   ā”œā”€ā”€ COCO_test2015_000000000014.jpg
ā”‚   ā”œā”€ā”€ coco2017_test
ā”‚   ā”‚   ā”œā”€ā”€ 000000000001.jpg
ā”‚   ā”‚   ā”œā”€ā”€ 000000000016.jpg
ā”‚   ā”œā”€ā”€ coco2017_val
ā”‚   ā”‚   ā”œā”€ā”€ 000000000139.jpg
ā”‚   ā”‚   ā”œā”€ā”€ 000000000285.jpg
ā”‚   ā”œā”€ā”€ nlvr2_test1
ā”‚   ā”‚   ā”œā”€ā”€ test1-0-2-img0.png
ā”‚   ā”‚   ā”œā”€ā”€ test1-0-2-img1.png
ā”‚   ā”œā”€ā”€ training_data
ā”‚   ā”‚   ā”œā”€ā”€ 000000000000.jpg
ā”‚   ā”‚   ā”œā”€ā”€ 000000000001.jpg
ā”‚   ā”œā”€ā”€ visual_genome
ā”‚   ā”‚   ā”œā”€ā”€ 1.jpg
ā”‚   ā”‚   ā”œā”€ā”€ 2.jpg
ā”‚   ā”œā”€ā”€ bacondata_image_ids.txt
ā”‚   ā”œā”€ā”€ coco_image_ids.txt
ā”‚   ā”œā”€ā”€ nlvr2_test1.json
ā”‚   ā”œā”€ā”€ okvqa_mscoco_val2014_annotations.json
ā”‚   ā”œā”€ā”€ okvqa_OpenEnded_mscoco_val2014_questions.json
ā”‚   ā”œā”€ā”€ pointqa_local_test.jsonl
ā”‚   ā”œā”€ā”€ test_dataset.jsonl
ā”‚   ā”œā”€ā”€ test.jpg
ā”‚   ā”œā”€ā”€ train.json
ā”‚   ā”œā”€ā”€ training_dataset.jsonl
ā”‚   ā”œā”€ā”€ v7w_pointing_test.jsonl
ā”‚   ā”œā”€ā”€ vg_attributes.json
ā”‚   ā”œā”€ā”€ vg_object_list.txt
ā”‚   ā”œā”€ā”€ vg_question_answers.json
ā”‚   ā”œā”€ā”€ vg_relationship_list.txt
ā”‚   ā”œā”€ā”€ vg_scene_graphs.json
ā”‚   ā”œā”€ā”€ vg150.json
ā”‚   ā”œā”€ā”€ vqav1_vqa_E_val.jsonl
ā”‚   ā”œā”€ā”€ vqav2_OpenEnded_mscoco_test-dev2015_questions.jsonl
ā”‚   ā”œā”€ā”€ vqav2_OpenEnded_mscoco_test2015_questions.jsonl

Train

We have generally followed LLaVA's training code. To train a BACON-Captioner, you first need to convert the training data into the format required by LLaVA. (We have provided the result of running this code, so you don't need to perform this step.)

python construct_training_data.py

Then, simply run the training script."

sh train.sh

Inference

Run the inference script, and the results will be output to a file named result/inference.json.

sh inference.sh

Evaluation

We provide the code of evalution for multiple downstream tasks including Open-vocabulary detection, Open-vocabulary scene graph generation, PointQA, PointingQA, VQA, Plan.

Complete evaluation requires running a large amount of baseline code; therefore, we only provide the code for calculating metrics. For the inference part of the baselines, please refer to the official code of each respective baseline. For convenience, we have provided the result files of all baselines we ran here, please download them and place them in the results folder. If you need to run the inference yourself, please format the output according to these files. To calculate metrics, modify cfg.task in eval.py to the desired task and then run eval.py.

python eval.py