Awesome
New: see the HTR for a stronger model and a new metric to measure temporal consistency.
The official implementation of the ICCV 2023 paper:
<div align="center"> <h1> <b> Spectrum-guided Multi-granularity Referring Video Object Segmentation </b> </h1> </div> <p align="center"><img src="docs/framework.png" width="800"/></p>Spectrum-guided Multi-granularity Referring Video Object Segmentation
Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Ajmal Mian
ICCV 2023
Introduction
We propose a Spectrum-guided Multi-granularity (SgMg) approach that follows a <em>segment-and-optimize</em> pipeline to tackle the feature drift issue found in previous decode-and-segment approaches. Extensive experiments show that SgMg achieves state-of-the-art overall performance on multiple benchmark datasets, outperforming the closest competitor by 2.8% points on Ref-YouTube-VOS with faster inference time.
Setup
The main setup of our code follows Referformer.
Please refer to install.md for installation.
Please refer to data.md for data preparation.
Training and Evaluation
All the models are trained using 2 RTX 3090 GPU. If you encounter the OOM error, please add the command --use_checkpoint
.
The training and evaluation scripts are included in the scripts
folder. If you want to train/evaluate SgMg, please run the following command:
sh dist_train_ytvos_videoswinb.sh
sh dist_test_ytvos_videoswinb.sh
Note: You can modify the --backbone
and --backbone_pretrained
to specify a backbone.
Model Zoo
We provide the pretrained model for different visual backbones and the checkpoints for SgMg (refer below).
You can put the models in the checkpoints
folder to start training/inference.
Results (Ref-YouTube-VOS & Ref-DAVIS)
To evaluate the results, please upload the zip file to the competition server.
Backbone | Ref-YouTube-VOS J&F | Ref-DAVIS J&F | Model | Submission |
---|---|---|---|---|
Video-Swin-T | 62.0 | 61.9 | model | link |
Video-Swin-B | 65.7 | 63.3 | model | link |
Results (A2D-Sentences & JHMDB-Sentences)
Backbone | (A2D) mAP | Mean IoU | Overall IoU | (JHMDB) mAP | Mean IoU | Overall IoU | Model |
---|---|---|---|---|---|---|---|
Video-Swin-T | 56.1 | 78.0 | 70.4 | 44.4 | 72.8 | 71.7 | model |
Video-Swin-B | 58.5 | 79.9 | 72.0 | 45.0 | 73.7 | 72.5 | model |
Results (RefCOCO/+/g)
The overall IoU is used as the metric, and the model is obtained from the pre-training stage mentioned in the paper.
Backbone | RefCOCO | RefCOCO+ | RefCOCOg | Model |
---|---|---|---|---|
Video-Swin-B | 76.3 | 66.4 | 70.0 | model |
Acknowledgements
Citation
@InProceedings{Miao_2023_ICCV,
author = {Miao, Bo and Bennamoun, Mohammed and Gao, Yongsheng and Mian, Ajmal},
title = {Spectrum-guided Multi-granularity Referring Video Object Segmentation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {920-930}
}
Contact
If you have any questions about this project, please feel free to contact bomiaobbb@gmail.com.