Awesome

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Dependencies

Python 3.8
Pytorch 2.0.1 + cu117
Check requirements.txt for other dependencies.

Data Preparation

1.You can download the images from the original source and place them in ./ln_data folder:

RefCOCO/RefCOCO+/RefCOCOg
Flickr30K Entities
Visual Genome

Finally, the ./data/image_data folder will have the following structure:

|-- ln_data
   |-- flickr30k
   |-- other/images/mscoco/images/train2014/
   |-- visual-genome

2.Download data labels here and place them in ./mask_data folder

Pretrained Checkpoints

Download the following checkpoints and place them in the ./checkpoints folder.

ViTDet
SwinT

Training and Evaluation

Training on RefCOCOg.

 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
--master_port 12345 \
 --use_env train.py \
 --batch_size 20 \
 --lr 0.000025 \
 --lr_bert 0.000005 \
 --lr_visual 0.00001 \
 --aug_scale --aug_translate --aug_crop \
 --backbone ViTDet \
 --imsize 448 \
 --bert_enc_num 12\
 --dataset gref_umd \
 --max_query_len 40 \
 --lr_scheduler poly \
 --is_segment \
 --is_eliminate \
 --vl_enc_layers 3 \
 --dim_feedforward 1024 \
 --loss_alpha 0.1 \
 --epochs 150 \
 --output_dir outputs/refcocog_ViTDet >refcocog_ViTDet.txt 2>&1 &

Please refer to train.sh for training commands on other datasets.

Evaluation on RefCOCOg.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
--master_port 12345 --use_env eval.py \
--batch_size 20 --num_workers 10 \
--bert_enc_num 12 \
--backbone ViTDet --imsize 448 \
--dataset gref_umd --max_query_len 40 \
--eval_set test  --vl_enc_layers 3 \
--dim_feedforward 1024 \
--eval_model ./outputs/refcocog_ViTDet/best_mask_checkpoint.pth \
--output_dir ./outputs/refcocog_ViTDet \
--is_segment --is_eliminate

Please refer to test.sh for evaluation commands on other splits or datasets.

For the pretraining result, first use the following command to pretrain model on the mixed dataset.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
 --master_port 12345 \
 --use_env train.py \
 --batch_size 20 \
 --lr 0.000025 \
 --lr_bert 0.00005 \
 --lr_visual 0.00001 \
 --aug_scale --aug_translate --aug_crop \
 --backbone ViTDet \
 --imsize 448 \
 --bert_enc_num 12 \
 --dataset mixed_pretrain \
 --max_query_len 40 \
 --vl_enc_layers 3 \
 --dim_feedforward 1024 \
 --lr_scheduler poly \
 --loss_alpha 0.5 \
 --epochs 20 \
 --output_dir outputs/mixed_pretrain_decoder >mixed_pretrain_decoder.txt 2>&1 &

Then use the following command to fine-tune on mixed RefCOCO series datasets.

   CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
 --master_port 12345 \
 --use_env train.py \
 --batch_size 20 \
 --is_segment \
 --lr 0.000025 \
 --lr_bert 0.000005 \
 --lr_visual 0.00001 \
 --aug_scale --aug_translate --aug_crop \
 --backbone ViTDet --is_eliminate \
 --imsize 448 \
 --bert_enc_num 12 \
 --dataset mixed_coco \
 --max_query_len 40 \
 --vl_enc_layers 3 \
 --dim_feedforward 1024 \
 --lr_scheduler poly \
 --loss_alpha 0.05 \
 --eliminated_threshold 0.0015 \
 --epochs 150 \
 --pretrain outputs/mixed_pretrain_decoder/checkpoint.pth \
 --output_dir outputs/mixed_coco_decoder >mixed_coco_decoder.txt 2>&1 &

Our checkpoints

Our checkpoints are available at OneDrive.