Awesome
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez, CVPR 2023
If you have any questions, please email ziyan.yang@rice.edu
:sparkles: We make a demo for this work! Feel free to try it!
Abstract
We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.48% when compared to the best previous model. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension and offers the added benefit by design of gradient-based explanations that better align with human annotations.
Requirements
- Python 3.8
- PyTorch 1.8.0+cu111
- transformers==4.8.1
- Numpy, scikit-image, opencv-python, pillow, matplotlib, timm
Data
- Visual Genome (VG) images: Please download VG images first.
- Annotations: Please download our pre-processed text annotations for VG images. You may need to modify the image path in each sample to load images.
Train
After downloading the pre-trained ALBEF-14M model, You can run the following command to train the model:
# Train the model using bounding box annotations from VG
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --use_env Pretrain.py --config configs/Pretrain.yaml --output_dir ALBEF_Grounding --checkpoint ALBEF.pth
Evaluation
To evaluate Flickr30k, please follow info-ground to process the data.
You can run the following command to evaluate the RefCOCO+, RefCLEF and Flickr30k datasets using all the checkpoints in your ALBEF_Grounding folder:
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu.py --checkpoint ALBEF_Grounding --output_dir ALBEF_Grounding/refcoco_results --config configs/Grounding_refcoco.yaml
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu_refclef.py --checkpoint ALBEF_Grounding --output_dir ALBEF_Grounding/refclef_results --config configs/Grounding_refclef.yaml
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu_flickr.py --checkpoint ALBEF_Grounding --output_dir ALBEF_Grounding/flickr_results --config configs/Grounding_flickr.yaml
You can also download these checkpoints and put them into the corresponding folder to reproduce our results:
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu.py --checkpoint best_refcoco.pth --output_dir best_refcoco_results --config configs/Grounding_refcoco.yaml
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu_refclef.py --checkpoint best_refclef.pth --output_dir best_refclef_results --config configs/Grounding_refclef.yaml
CUDA_VISIBLE_DEVICES=1 python grounding_eval_singlegpu_flickr.py --checkpoint best_flickr.pth --output_dir best_flickr_results --config configs/Grounding_flickr.yaml
Citing
If you find our paper/code useful, please consider citing:
@inproceedings{yang2023improving,
title={Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations},
author={Yang, Ziyan and Kafle, Kushal and Dernoncourt, Franck and Ordonez, Vicente},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={19165--19174},
year={2023}
}
Acknowledgement
The implementation of AMC relies on the code from ALBEF. We would like to thank the authors who have open-sourced their work and made it available to the community.