Awesome
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding, CVPR, 2022.
by Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, Xuwu Wang, Ji Zhang, Liang He, Xin Lin
Installation
1.Prepare the environment
python==3.8.10
pytorch==1.10.2
transformers==4.18.0
mmdet==2.11.0
mmcv-full==1.3.18
einops==0.4.1
icecream==2.1.2
numpy==1.22.3
scipy==1.8.0
ftfy==6.1.1
The above is a tested environment. Other version of these packages may also be fine.
We recommmand to install mmdet from the source codes inside this repository (./models/swin_model
).
2.Dataset preparation
We follow the data preparation of TransVG, which can be found in GETTING_STARTED.md.
The download links of ReferItGame are broken. Thus we upload the data splits and images to Google Drive.
3.Checkpoint preparation
mkdir checkpoints
You can set the --bert_model
to bert-base-uncased
to download bert checkpoints online or put bert-base-uncased
into checkpoints/
manually.
To train our model on refcoco/refcoco+/refcocog datasets, you need checkpoints trained on MSCOCO that the overlapping images of test set are excluded. We provide pretrained checkpoints on Google Drive. For referit/flickr datasets, you can simply use the pretrained checkpoint from Swin-Transformer.
Training and Evaluation
1. Training
We present bash scripts for training on referit.
For single-gpu training (not validated)
bash train_referit_single_gpu.sh
For multi-gpu training
bash train_referit_multi_gpu.sh
It's similar to train the model on the other datasets. Differents is that on RefCOCOg, we recommend to set --max_query_len 40
, on RefCOCO+ We recommend to set --lr_drop 120
.
2.Evaluation
For single-gpu evaluation
bash eval_referit_single_gpu.sh
For multi-gpu evaluation
bash eval_referit_multi_gpu.sh
Citation
@article{ye2022shifting,
title={Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding},
author={Ye, Jiabo and Tian, Junfeng and Yan, Ming and Yang, Xiaoshan and Wang, Xuwu and Zhang, Ji and He, Liang and Lin, Xin},
journal={arXiv preprint arXiv:2203.15442},
year={2022}
}
Acknowledge
This codebase is partially based on TransVG and Swin-Transformer.