Home

Awesome

<div align="center"> <h1> <b> OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation </b> </h1> </div> <p align="center"><img src="docs/onlinerefer.jpg" width="800"/></p>

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen

Abstract

Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

Update

Setup

The main setup of our code follows Referformer.

Please refer to install.md for installation.

Please refer to data.md for data preparation.

Training and Evaluation

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone ResNet50, please run the following command:

sh ./scripts/online_ytvos_r50.sh

If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone Swin-L, please run the following command:

sh ./scripts/online_ytvos_swinl.sh

If you want to use your own video sequence, please run the following command:

python inference_long_videos.py

Note: The models with ResNet50 are trained using 8 NVIDIA 2080Ti GPU, and the models with Swin-L are trained using 8 NVIDIA Tesla V100 GPU.

Model Zoo

Ref-Youtube-VOS

Please upload the zip file to the competition server.

BackboneJ&FJFPretrainModelSubmission
ResNet-5057.355.658.9weightmodellink
Swin-L63.561.665.5weightmodellink
Video Swin-B62.961.064.7--link

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

BackboneJ&FJFModel
ResNet-5059.355.762.9model
Swin-L64.861.667.7model

Citation

If you find OnlineRefer useful in your research, please consider citing:

@inproceedings{wu2023onlinerefer,
  title={OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation},
  author={Wu, Dongming and Wang, Tiancai and Zhang, Yuang and Zhang, Xiangyu and Shen, Jianbing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={2761--2770},
  year={2023}
}

Acknowledgement