

Referring Expression Object Segmentation with Caption-Aware Consistency

PyTorch implementation of our method for segmenting the object in an image specified by a natural language description.

Contact: Yi-Wen Chen (chenyiwena at gmail dot com)

<p align="center"> <img src="https://github.com/wenz116/lang2seg/blob/master/figure/overview.png" width="65%"> </p>


Referring Expression Object Segmentation with Caption-Aware Consistency <br /> Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin and Ming-Hsuan Yang <br /> British Machine Vision Conference (BMVC), 2019 <br />

Please cite our paper if you find it useful for your research.

  author = {Yi-Wen Chen and Yi-Hsuan Tsai and Tiantian Wang and Yen-Yu Lin and Ming-Hsuan Yang},
  booktitle = {British Machine Vision Conference (BMVC)},
  title = {Referring Expression Object Segmentation with Caption-Aware Consistency},
  year = {2019}



The processed data is uploaded in cache/prepro/.


  1. Train the baseline segmentation model with only 1 dynamic filter:
./experiments/scripts/train_baseline.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX>
  1. Train the model with spatial dynamic filters:
./experiments/scripts/train_spatial.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX>
  1. Train the model with spatial dynamic filters and caption loss:
./experiments/scripts/train_cycle.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> att2in2 <CAPTION_LOSS_WEIGHT>

The pretrained Mask R-CNN model should be placed at <DATASET>_<SPLITBY>/output_<OUTPUT_POSTFIX>. If there are multiple models in the directory, the model of the latest iteration will be loaded.

The pretrained caption model should be placed at <DATASET>_<SPLITBY>/caption_log_res5_2/, named as model-best.pth and infos-best.pkl.

  1. Train the model with spatial dynamic filters and response loss:
./experiments/scripts/train_response.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX>
  1. Train the model with spatial dynamic filters, response loss and caption loss:
./experiments/scripts/train_cycle_response.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> att2in2 <CAPTION_LOSS_WEIGHT>

The pretrained Mask R-CNN model should be placed at <DATASET>_<SPLITBY>/output_<OUTPUT_POSTFIX>. If there are multiple models in the directory, the model of the latest iteration will be loaded.

The pretrained caption model should be placed at <DATASET>_<SPLITBY>/caption_log_response/, named as model-best.pth and infos-best.pkl.

  1. Train the model with spatial dynamic filters and response loss for VGG16 and Faster R-CNN:

Download the pre-trained Faster R-CNN model here (coco_900k-1190k.tar), and put the .pth and .pkl files in pyutils/mask-faster-rcnn/output/vgg16/coco_2014_train+coco_2014_valminusminival/

./experiments/scripts/train_vgg.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX>


  1. Evaluate the baseline segmentation model:
./experiments/scripts/eval_baseline.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> <MODEL_ITER>

Evaluate the model at <DATASET>_<SPLITBY>/output_<OUTPUT_POSTFIX>, of trained iteration <MODEL_ITER>.

Detection and segmentation results will be saved at experiments/det_results.txt and experiments/mask_results.txt respectively.

  1. Evaluate the model with spatial dynamic filters (and caption loss):
./experiments/scripts/eval_spatial.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> <MODEL_ITER>
  1. Evaluate the model with spatial dynamic filters and response loss (and caption loss):
./experiments/scripts/eval_response.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> <MODEL_ITER>
  1. Evaluate the model with spatial dynamic filters and response loss for VGG16 and Faster R-CNN:
./experiments/scripts/eval_vgg.sh <GPUID> <DATASET> <SPLITBY> <OUTPUT_POSTFIX> <MODEL_ITER>


Thanks for the work of Licheng Yu. Our code is heavily borrowed from the implementation of MattNet.


The model and code are available for non-commercial research purposes only.