Home

Awesome

PyTorch Implementation of MAttNet

Extract features for NMTree

  1. Make sure the orignal MAttNet works well.
  2. ``python save_matt_gt_feats.py --dataset [dataset] --split_by [split]''
  3. ``python save_matt_det_feats.py --dataset [dataset] --split_by [split]''

Introduction

This repository is Pytorch implementation of MAttNet: Modular Attention Network for Referring Expression Comprehension in CVPR 2018. Refering Expressions are natural language utterances that indicate particular objects within a scene, e.g., "the woman in red sweater", "the man on the right", etc. For robots or other intelligent agents communicating with people in the world, the ability to accurately comprehend such expressions will be a necessary component for natural interactions. In this project, we address referring expression comprehension: localizing an image region described by a natural language expression. Check our paper and online demo for more details. Examples are shown as follows:

<p align="center"> <img src="http://bvisionweb1.cs.unc.edu/licheng/MattNet/mattnet_example.jpg" width="75%"/> </p>

Prerequisites

Installation

  1. Clone the MAttNet repository
git clone --recursive https://github.com/daqingliu/MAttNet
  1. Prepare the submodules and associated data

Training

  1. Prepare the training and evaluation data by running tools/prepro.py:
python tools/prepro.py --dataset refcoco --splitBy unc
  1. Extract features using Mask R-CNN, where the head_feats are used in subject module training and ann_feats is used in relationship module training.
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_head_feats.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_ann_feats.py --dataset refcoco --splitBy unc
  1. Detect objects/masks and extract features (only needed if you want to evaluate the automatic comprehension). We empirically set the confidence threshold of Mask R-CNN as 0.65.
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect.py --dataset refcoco --splitBy unc --conf_thresh 0.65
CUDA_VISIBLE_DEVICES=gpu_id python tools/run_detect_to_mask.py --dataset refcoco --splitBy unc
CUDA_VISIBLE_DEVICES=gpu_id python tools/extract_mrcn_det_feats.py --dataset refcoco --splitBy unc
  1. Train MAttNet with ground-truth annotation:
./experiments/scripts/train_mattnet.sh GPU_ID refcoco unc

During training, you may want to use cv/inpect_cv.ipynb to check the training/validation curves and do cross validation.

Evaluation

Evaluate MAttNet with ground-truth annotation:

./experiments/scripts/eval_easy.sh GPUID refcoco unc

If you detected/extracted the Mask R-CNN results already (step 3 above), now you can evaluate the automatic comprehension accuracy using Mask R-CNN detection and segmentation:

./experiments/scripts/eval_dets.sh GPU_ID refcoco unc
./experiments/scripts/eval_masks.sh GPU_ID refcoco unc

Pre-trained Models

In order to get the results in our paper, please follow Training Step 1-3 for data and feature preparation then run Evaluation Step 1. We provide the pre-trained models for RefCOCO, RefCOCO+ and RefCOCOg. Download and put them under ./output folder.

  1. RefCOCO: Pre-trained model (56M)
<table> <tr><th> Localization (gt-box) </th><th> Localization (Mask R-CNN) </th><th> Segmentation (Mask R-CNN) </th></tr> <tr><td>
valtest Atest B
85.57%85.95%84.36%
</td><td>
valtest Atest B
76.65%81.14%69.99%
</td><td>
valtest Atest B
75.16%79.55%68.87%
</td></tr> </table>
  1. RefCOCO+: Pre-trained model (56M)
<table> <tr><th> Localization (gt-box) </th><th> Localization (Mask R-CNN) </th><th> Segmentation (Mask R-CNN) </th></tr> <tr><td>
valtest Atest B
71.71%74.28%66.27%
</td><td>
valtest Atest B
65.33%71.62%56.02%
</td><td>
valtest Atest B
64.11%70.12%54.82%
</td></tr> </table>
  1. RefCOCOg: Pre-trained model (58M)
<table> <tr><th> Localization (gt-box) </th><th> Localization (Mask R-CNN) </th><th> Segmentation (Mask R-CNN) </th></tr> <tr><td>
valtest
78.96%78.51%
</td><td>
valtest
66.58%67.27%
</td><td>
valtest
64.48%65.60%
</td></tr> </table>

Pre-computed detections/masks

We provide the detected boxes/masks for those who are interested in automatic comprehension. This was done using Training Step 3. Note our Mask R-CNN is trained on COCO’s training images, excluding those in RefCOCO, RefCOCO+, and RefCOCOg’s validation+testing. That said it is unfair to use the other off-the-shelf detectors trained on whole COCO set for this task.

Demo

Run cv/example_demo.ipynb for demo example. You can also check our Online Demo.

Citation

@inproceedings{yu2018mattnet,
  title={MAttNet: Modular Attention Network for Referring Expression Comprehension},
  author={Yu, Licheng and Lin, Zhe and Shen, Xiaohui and Yang, Jimei and Lu, Xin and Bansal, Mohit and Berg, Tamara L},
  booktitle={CVPR},
  year={2018}
}

License

MAttNet is released under the MIT License (refer to the LICENSE file for details).

A few notes

I'd like to share several thoughts after working on Referring Expressions for 3 years (since 2015):

Authorship

This project is maintained by Licheng Yu.