Home

Awesome

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Code for the CVPR 2019 Paper Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Prerequisites

Installation

  1. Clone the CM-Erase repository
git clone --recursive https://github.com/xh-liu/CM-Erase
  1. Prepare the submodules and associated data

Training

  1. Prepare the training and evaluation data by running tools/prepro.py:
python tools/prepro.py --dataset refcoco --splitBy unc
  1. Download the Glove pretrained word embeddings at Google Drive.

  2. Extract features using Mask R-CNN, where the head_feats are used in subject module training and ann_feats is used in relationship module training.

CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc
  1. Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc
  1. Pretrain the network (CM-Att) with ground-truth annotation:
./experiments/scripts/train_mattnet.sh GPU_ID
  1. Train the network with cross-modal erasing (CM-Att-Erase):
./experiments/scripts/train_erase.sh GPU_ID

Evaluation

Evaluate the network with ground-truth annotation:

./experiments/scripts/eval_easy.sh GPU_ID

Evaluate the network with Mask R-CNN detection results:

./experiments/scripts/eval_dets.sh GPU_ID 

Pre-trained Models

We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download them from Google Drive and put them under ./output folder.

Citation

If you find our code useful for your research, please consider citing:

@inproceedings{liu2019improving,
  title={Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing},
  author={Liu, Xihui and Wang, Zihao and Shao, Jing and Wang, Xiaogang and Li, Hongsheng},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1950--1959},
  year={2019}
}
@inproceedings{yu2018mattnet,
  title={Mattnet: Modular attention network for referring expression comprehension},
  author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={1307--1315},
  year={2018}
}

Acknowledgement

This project is built on Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018.