Awesome
BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
[š¢ [Project Page] [Model] [Paper]]
ššš Official implementation of BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations.
BACON provides a structural representation divided into three parts: an overall description, an object list, and relationships.
<table align="center"> <tr> <td> <img src="assets/image_caption.png"> </td> </tr> </table>Additionally, BACON can be extended to video captioning.
Release
- [2024/7/13] š„ The code of BACON is released!
Contents
Install
Currently, we only provide installation guides for Linux.
- Clone this repository and navigate to BACON folder
git clone https://github.com/ztyang23/BACON.git
cd BACON
- Install Package
conda create -n bacon python=3.10 -y
conda activate bacon
pip install -r requirements.txt
Model
We strongly recommend downloading all the weights locally. Please create a folder named ckpt
under the BACON
folder and place all the downloaded weights into the ckpt
folder. The file structure is as follows:
āāā ckpt
ā āāā captioner (dir)
ā āāā llava-v1.5-13b (dir)
ā āāā groundingdino_swint_ogc.pth
ā āāā GroundingDINO_SwinT_OGC.py
ā āāā sam_vit_h_4b8939.pth
ā āāā ViT-B-32.pt
ā āāā ViT-L-14.pt
First, download the checkpoints for LLaVA, and BACON-Captioner
Second, download the checkpoint for GroundingDINO (only the groundingdino_swint_ogc.pth
and GroundingDINO_SwinT_OGC.py
are needed) used by BACON, as well as the checkpoint for SAM (only the sam_vit_h_4b8939.pth
is needed)
Finally, download the checkpoint for CLIP used for evaluation; both CLIP-B-32 and CLIP-L-14 are utilized in our project.
Dataset
For the training set, we used images from Unsplash and MSCOCO. For convenience, we have renumbered these images, and you can download them here. For the test set, we used the COCO2017 test set, so we do not provide these images; please download them from the official COCO website. Additionally, in our evaluation, we used the COCO2014 validation set, COCO2015 test set, and NLVR2, Visual Genome. Please download all these datasets and create a folder named data
under the BACON
folder, placing all the datasets in thedata
folder. Additionally, we provide all annotations; please download them and place them in the data
folder. The directory structure should be as follows.
āāā data
ā āāā coco2014_val
ā ā āāā COCO_val2014_000000000042.jpg
ā ā āāā COCO_val2014_000000000073.jpg
ā āāā coco2015_test
ā ā āāā COCO_test2015_000000000001.jpg
ā ā āāā COCO_test2015_000000000014.jpg
ā āāā coco2017_test
ā ā āāā 000000000001.jpg
ā ā āāā 000000000016.jpg
ā āāā coco2017_val
ā ā āāā 000000000139.jpg
ā ā āāā 000000000285.jpg
ā āāā nlvr2_test1
ā ā āāā test1-0-2-img0.png
ā ā āāā test1-0-2-img1.png
ā āāā training_data
ā ā āāā 000000000000.jpg
ā ā āāā 000000000001.jpg
ā āāā visual_genome
ā ā āāā 1.jpg
ā ā āāā 2.jpg
ā āāā bacondata_image_ids.txt
ā āāā coco_image_ids.txt
ā āāā nlvr2_test1.json
ā āāā okvqa_mscoco_val2014_annotations.json
ā āāā okvqa_OpenEnded_mscoco_val2014_questions.json
ā āāā pointqa_local_test.jsonl
ā āāā test_dataset.jsonl
ā āāā test.jpg
ā āāā train.json
ā āāā training_dataset.jsonl
ā āāā v7w_pointing_test.jsonl
ā āāā vg_attributes.json
ā āāā vg_object_list.txt
ā āāā vg_question_answers.json
ā āāā vg_relationship_list.txt
ā āāā vg_scene_graphs.json
ā āāā vg150.json
ā āāā vqav1_vqa_E_val.jsonl
ā āāā vqav2_OpenEnded_mscoco_test-dev2015_questions.jsonl
ā āāā vqav2_OpenEnded_mscoco_test2015_questions.jsonl
Train
We have generally followed LLaVA's training code. To train a BACON-Captioner, you first need to convert the training data into the format required by LLaVA. (We have provided the result of running this code, so you don't need to perform this step.)
python construct_training_data.py
Then, simply run the training script."
sh train.sh
Inference
Run the inference script, and the results will be output to a file named result/inference.json
.
sh inference.sh
Evaluation
We provide the code of evalution for multiple downstream tasks including Open-vocabulary detection, Open-vocabulary scene graph generation, PointQA, PointingQA, VQA, Plan.
Complete evaluation requires running a large amount of baseline code; therefore, we only provide the code for calculating metrics. For the inference part of the baselines, please refer to the official code of each respective baseline. For convenience, we have provided the result files of all baselines we ran here, please download them and place them in the results
folder. If you need to run the inference yourself, please format the output according to these files. To calculate metrics, modify cfg.task
in eval.py
to the desired task and then run eval.py
.
python eval.py