

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo*, Yicheng Xiao*, Yong Liu*, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

Tsinghua University Intelligent Interaction Group

<a href='https://arxiv.org/abs/2305.17011'><img src='https://img.shields.io/badge/ArXiv-2305.17011-red'></a>

šŸ“¢ Updates

šŸ“– Abstract

This paper studies referring video object segmentation (RVOS) by boosting videolevel visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct wellaligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.

šŸ“— FrameWork

<img src="asserts/framework.png" width="95%"/>

Visualization Result

<img src="asserts/visualization.png" width="95%"/>

Text expressions with temporal variations

(a) and (b) are segmentation results of our SOC and ReferFormer. For more details, please refer to <a href="https://arxiv.org/pdf/2305.17011.pdf">paper</a> <img src="asserts/temporal_comparison.png" width="95%">

šŸ› ļø Environment Setup

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata and please change it to xxx/rvosdata according to your own path.

ā””ā”€ā”€ a2d_sentences/ 
    ā”œā”€ā”€ Release/
    ā”‚   ā”œā”€ā”€ videoset.csv  (videos metadata file)
    ā”‚   ā””ā”€ā”€ CLIPS320/
    ā”‚       ā””ā”€ā”€ *.mp4     (video files)
    ā””ā”€ā”€ text_annotations/
        ā”œā”€ā”€ a2d_annotation.txt  (actual text annotations)
        ā”œā”€ā”€ a2d_missed_videos.txt
        ā””ā”€ā”€ a2d_annotation_with_instances/ 
            ā””ā”€ā”€ */ (video folders)
                ā””ā”€ā”€ *.h5 (annotations files)
ā””ā”€ā”€ refer_youtube_vos/ 
    ā”œā”€ā”€ train/
    ā”‚   ā”œā”€ā”€ JPEGImages/
    ā”‚   ā”‚   ā””ā”€ā”€ */ (video folders)
    ā”‚   ā”‚       ā””ā”€ā”€ *.jpg (frame image files) 
    ā”‚   ā””ā”€ā”€ Annotations/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.png (mask annotation files) 
    ā”œā”€ā”€ valid/
    ā”‚   ā””ā”€ā”€ JPEGImages/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    |           ā””ā”€ā”€ *.jpg (frame image files) 
    ā””ā”€ā”€ meta_expressions/
        ā”œā”€ā”€ train/
        ā”‚   ā””ā”€ā”€ meta_expressions.json  (text annotations)
        ā””ā”€ā”€ valid/
            ā””ā”€ā”€ meta_expressions.json  (text annotations)
ā””ā”€ā”€ coco/
      ā”œā”€ā”€ train2014/
      ā”œā”€ā”€ refcoco/
        ā”œā”€ā”€ instances_refcoco_train.json
        ā”œā”€ā”€ instances_refcoco_val.json
      ā”œā”€ā”€ refcoco+/
        ā”œā”€ā”€ instances_refcoco+_train.json
        ā”œā”€ā”€ instances_refcoco+_val.json
      ā”œā”€ā”€ refcocog/
        ā”œā”€ā”€ instances_refcocog_train.json
        ā”œā”€ā”€ instances_refcocog_val.json

Pretrained Model

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained according to your own path.

ā””ā”€ā”€ pretrained_swin_transformer
ā””ā”€ā”€ pretrained_roberta

Model Zoo

The checkpoints are as follows:


Output Dir

We put all outputs under a dir. Specifically, We set /mnt/data_16TB/lzy23/SOC as the output dir, so please change it to xxx/SOC.

šŸš€ Training

From scratch

We only use Video-Swin-T as backbone to train and eval the dataset.

With Pretrain

We perform pretrain and finetune on A2d-Sentences and Ref-Youtube-VOS dataset using Video-Swin-Tiny and Video-Swin-Base. Following previous work, we first pretrain on RefCOCO dataset and then finetune.

Joint training

We only perform Joint training on Ref-Youtube-VOS dataset with Video-Swin-Tiny and Video-Swin-Base.



We provide the interface for inference

bash ./scripts/demo_video.sh


Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR


If you find this work useful for your research, please cite:

  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
  booktitle    = {NeurIPS},
  year         = {2023},