Home

Awesome

License Framework

PWC PWC

The official implementation of the CVPR2022 paper:

<div align="center"> <h1> <b> Language as Queries for Referring <br> Video Object Segmentation </b> </h1> </div> <p align="center"><img src="docs/network.png" width="800"/></p>

Language as Queries for Referring Video Object Segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Abstract

In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Update

Demo

<img src="docs/davis_demo1.gif" width="400"/><img src="docs/davis_demo2.gif" width="400"/>

<img src="docs/ytvos_demo1.gif" width="400"/><img src="docs/ytvos_demo2.gif" width="400"/>

Requirements

We test the codes in the following environments, other versions may also be compatible:

Installation

Please refer to install.md for installation.

Data Preparation

Please refer to data.md for data preparation.

We provide the pretrained model for different visual backbones. You may download them here and put them in the directory pretrained_weights.

<!-- For the Swin Transformer and Video Swin Transformer backbones, the weights are intialized using the pretrained model provided in the repo [Swin-Transformer](https://github.com/microsoft/Swin-Transformer) and [Video-Swin-Transformer](https://github.com/SwinTransformer/Video-Swin-Transformer). For your convenience, we upload the pretrained model in the google drives [swin_pretrained](https://drive.google.com/drive/u/0/folders/1QWLayukDJYAxTFk7NPwerfso3Lrx35NL) and [video_swin_pretrained](https://drive.google.com/drive/u/0/folders/19qb9VbKSjuwgxsiPI3uv06XzQkB5brYM). -->

After the organization, we expect the directory struture to be the following:

ReferFormer/
ā”œā”€ā”€ data/
ā”‚   ā”œā”€ā”€ ref-youtube-vos/
ā”‚   ā”œā”€ā”€ ref-davis/
ā”‚   ā”œā”€ā”€ a2d_sentences/
ā”‚   ā”œā”€ā”€ jhmdb_sentences/
ā”œā”€ā”€ davis2017/
ā”œā”€ā”€ datasets/
ā”œā”€ā”€ models/
ā”œā”€ā”€ scipts/
ā”œā”€ā”€ tools/
ā”œā”€ā”€ util/
ā”œā”€ā”€ pretrained_weights/
ā”œā”€ā”€ eval_davis.py
ā”œā”€ā”€ main.py
ā”œā”€ā”€ engine.py
ā”œā”€ā”€ inference_ytvos.py
ā”œā”€ā”€ inference_davis.py
ā”œā”€ā”€ opts.py
...

Model Zoo

All the models are trained using 8 NVIDIA Tesla V100 GPU. You may change the --backbone parameter to use different backbones (see here).

Note: If you encounter the OOM error, please add the command --use_checkpoint (we add this command for Swin-L, Video-Swin-S and Video-Swin-B models).

Ref-Youtube-VOS

To evaluate the results, please upload the zip file to the competition server.

BackboneJ&FCFBI J&FPretrainModelSubmissionCFBI Submission
ResNet-5055.659.4weightmodellinklink
ResNet-10157.360.3weightmodellinklink
Swin-T58.761.2weightmodellinklink
Swin-L62.463.3weightmodellinklink
Video-Swin-T*56.0--modellink-
Video-Swin-T59.4-weightmodellink-
Video-Swin-S60.1-weightmodellink-
Video-Swin-B62.9-weightmodellink-

* indicates the model is trained from scratch.

Joint training with Ref-COCO/+/g datasets.

BackboneJ&FJFModelSubmission
ResNet-5058.757.460.1modellink
ResNet-10159.358.160.4modellink
Swin-L64.262.366.2modellink
Video-Swin-T62.659.963.3modellink
Video-Swin-S63.361.465.2modellink
Video-Swin-B64.962.867.0modellink

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

BackboneJ&FJFModel
ResNet-5058.555.861.3model
Swin-L60.557.663.4model
Video-Swin-B61.158.164.1model

A2D-Sentences

The pretrained models are the same as those provided for Ref-Youtube-VOS.

BackboneOverall IoUMean IoUmAPPretrainModel
Video-Swin-T*72.364.148.6-model | log
Video-Swin-T77.669.652.8weightmodel | log
Video-Swin-S77.769.853.9weightmodel | log
Video-Swin-B78.670.355.0weightmodel | log

* the model is trained from scratch and set --num_frames 6.

JHMDB-Sentences

As described in the paper, we report the results using the model trained on A2D-Sentences without finetune.

BackboneOverall IoUMean IoUmAPModel
Video-Swin-T*70.069.339.1model
Video-Swin-T71.971.042.2model
Video-Swin-S72.871.542.4model
Video-Swin-B73.071.843.7model

* the model is trained from scratch and set --num_frames 6.

RefCOCO/+/g

We also support evaluate on RefCOCO/+/g validation set by using the pretrained weights (num_frames=1). Specifically, we measure the P@0.5 and overall IoU (oIoU) for REC and RIS tasks, respectively.

REC (referring epression understanding):

BackboneRefCOCORefCOCO+RefCOCOgModel
ResNet-5085.079.279.0weight
ResNet-10185.475.879.9weight
Swin-T86.777.280.6weight
Swin-L89.880.083.9weight

RIS (referring image segmentation):

BackboneRefCOCORefCOCO+RefCOCOgModel
ResNet-5071.164.164.1weight
ResNet-10171.861.164.9weight
Swin-T72.962.466.1weight
Swin-L77.165.869.3weight

Get Started

Please see Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences for details.

Acknowledgement

This repo is based on Deformable DETR and VisTR. We also refer to the repositories MDETR and MTTR. Thanks for their wonderful works.

Citation

@article{wu2022referformer,
      title={Language as Queries for Referring Video Object Segmentation}, 
      author={Jiannan Wu and Yi Jiang and Peize Sun and Zehuan Yuan and Ping Luo},
      journal={arXiv preprint arXiv:2201.00487},
      year={2022},
}