Home

Awesome

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo*, Yicheng Xiao*, Yong Liu*, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

Tsinghua University Intelligent Interaction Group

<a href='https://arxiv.org/abs/2305.17011'><img src='https://img.shields.io/badge/ArXiv-2305.17011-red'></a>

šŸ“¢ Updates

šŸ“– Abstract

This paper studies referring video object segmentation (RVOS) by boosting videolevel visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct wellaligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.

šŸ“— FrameWork

<img src="asserts/framework.png" width="95%"/>

Visualization Result

<img src="asserts/visualization.png" width="95%"/>

Text expressions with temporal variations

(a) and (b) are segmentation results of our SOC and ReferFormer. For more details, please refer to <a href="https://arxiv.org/pdf/2305.17011.pdf">paper</a> <img src="asserts/temporal_comparison.png" width="95%">

šŸ› ļø Environment Setup

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata and please change it to xxx/rvosdata according to your own path.

rvosdata
ā””ā”€ā”€ a2d_sentences/ 
    ā”œā”€ā”€ Release/
    ā”‚   ā”œā”€ā”€ videoset.csv  (videos metadata file)
    ā”‚   ā””ā”€ā”€ CLIPS320/
    ā”‚       ā””ā”€ā”€ *.mp4     (video files)
    ā””ā”€ā”€ text_annotations/
        ā”œā”€ā”€ a2d_annotation.txt  (actual text annotations)
        ā”œā”€ā”€ a2d_missed_videos.txt
        ā””ā”€ā”€ a2d_annotation_with_instances/ 
            ā””ā”€ā”€ */ (video folders)
                ā””ā”€ā”€ *.h5 (annotations files)
ā””ā”€ā”€ refer_youtube_vos/ 
    ā”œā”€ā”€ train/
    ā”‚   ā”œā”€ā”€ JPEGImages/
    ā”‚   ā”‚   ā””ā”€ā”€ */ (video folders)
    ā”‚   ā”‚       ā””ā”€ā”€ *.jpg (frame image files) 
    ā”‚   ā””ā”€ā”€ Annotations/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.png (mask annotation files) 
    ā”œā”€ā”€ valid/
    ā”‚   ā””ā”€ā”€ JPEGImages/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    |           ā””ā”€ā”€ *.jpg (frame image files) 
    ā””ā”€ā”€ meta_expressions/
        ā”œā”€ā”€ train/
        ā”‚   ā””ā”€ā”€ meta_expressions.json  (text annotations)
        ā””ā”€ā”€ valid/
            ā””ā”€ā”€ meta_expressions.json  (text annotations)
ā””ā”€ā”€ coco/
      ā”œā”€ā”€ train2014/
      ā”œā”€ā”€ refcoco/
        ā”œā”€ā”€ instances_refcoco_train.json
        ā”œā”€ā”€ instances_refcoco_val.json
      ā”œā”€ā”€ refcoco+/
        ā”œā”€ā”€ instances_refcoco+_train.json
        ā”œā”€ā”€ instances_refcoco+_val.json
      ā”œā”€ā”€ refcocog/
        ā”œā”€ā”€ instances_refcocog_train.json
        ā”œā”€ā”€ instances_refcocog_val.json

Pretrained Model

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained according to your own path.

pretrained
ā””ā”€ā”€ pretrained_swin_transformer
ā””ā”€ā”€ pretrained_roberta

Model Zoo

The checkpoints are as follows:

SettingBackboneCheckpoint
a2d_from_scratchVideo-Swin-TModel
a2d_with_pretrainVideo-Swin-TModel
a2d_with_pretrainVideo-Swin-BModel
ytb_from_scratchVideo-Swin-TModel
ytb_with_pretrainVideo-Swin-TModel
ytb_with_pretrainVideo-Swin-BModel
ytb_joint_trainVideo-Swin-TModel
ytb_joint_trainVideo-Swin-BModel

Output Dir

We put all outputs under a dir. Specifically, We set /mnt/data_16TB/lzy23/SOC as the output dir, so please change it to xxx/SOC.

šŸš€ Training

From scratch

We only use Video-Swin-T as backbone to train and eval the dataset.

With Pretrain

We perform pretrain and finetune on A2d-Sentences and Ref-Youtube-VOS dataset using Video-Swin-Tiny and Video-Swin-Base. Following previous work, we first pretrain on RefCOCO dataset and then finetune.

Joint training

We only perform Joint training on Ref-Youtube-VOS dataset with Video-Swin-Tiny and Video-Swin-Base.

Evaluation

Inference

We provide the interface for inference

bash ./scripts/demo_video.sh

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR

Citations

If you find this work useful for your research, please cite:

@inproceedings{SOC,
  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
                  Segmentation},
  booktitle    = {NeurIPS},
  year         = {2023},
}