Awesome

SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo*, Yicheng Xiao*, Yong Liu*, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

Tsinghua University Intelligent Interaction Group

📢 Updates

Jan. 1, 2024: We Release the Code for the ICCV 2023 Workshop: The 5th Large-scale Video Object Segmentation Challenge.
Oct. 29, 2023: Code is released now.
Sep. 22, 2023: Our paper is accepted by NeurIPS 2023!

📖 Abstract

This paper studies referring video object segmentation (RVOS) by boosting videolevel visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct wellaligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.

📗 FrameWork

Visualization Result

Text expressions with temporal variations

(a) and (b) are segmentation results of our SOC and ReferFormer. For more details, please refer to <a href="https://arxiv.org/pdf/2305.17011.pdf">paper</a> <img src="asserts/temporal_comparison.png" width="95%">

🛠️ Environment Setup

install pytorch pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
install other dependencies pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
install transformers numpy pip install transformers==4.24.0 pip install numpy==1.23.5
install pycocotools pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

build up MultiScaleDeformableAttention

cd ./models/ops
python setup.py build install

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata and please change it to xxx/rvosdata according to your own path.

rvosdata
└── a2d_sentences/ 
    ├── Release/
    │   ├── videoset.csv  (videos metadata file)
    │   └── CLIPS320/
    │       └── *.mp4     (video files)
    └── text_annotations/
        ├── a2d_annotation.txt  (actual text annotations)
        ├── a2d_missed_videos.txt
        └── a2d_annotation_with_instances/ 
            └── */ (video folders)
                └── *.h5 (annotations files)
└── refer_youtube_vos/ 
    ├── train/
    │   ├── JPEGImages/
    │   │   └── */ (video folders)
    │   │       └── *.jpg (frame image files) 
    │   └── Annotations/
    │       └── */ (video folders)
    │           └── *.png (mask annotation files) 
    ├── valid/
    │   └── JPEGImages/
    │       └── */ (video folders)
    |           └── *.jpg (frame image files) 
    └── meta_expressions/
        ├── train/
        │   └── meta_expressions.json  (text annotations)
        └── valid/
            └── meta_expressions.json  (text annotations)
└── coco/
      ├── train2014/
      ├── refcoco/
        ├── instances_refcoco_train.json
        ├── instances_refcoco_val.json
      ├── refcoco+/
        ├── instances_refcoco+_train.json
        ├── instances_refcoco+_val.json
      ├── refcocog/
        ├── instances_refcocog_train.json
        ├── instances_refcocog_val.json

Pretrained Model

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained according to your own path.

pretrained
└── pretrained_swin_transformer
└── pretrained_roberta

For pretrained_swin_transformer folder download Video-Swin-Base
For pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)

Model Zoo

The checkpoints are as follows:

Setting	Backbone	Checkpoint
a2d_from_scratch	Video-Swin-T	Model
a2d_with_pretrain	Video-Swin-T	Model
a2d_with_pretrain	Video-Swin-B	Model
ytb_from_scratch	Video-Swin-T	Model
ytb_with_pretrain	Video-Swin-T	Model
ytb_with_pretrain	Video-Swin-B	Model
ytb_joint_train	Video-Swin-T	Model
ytb_joint_train	Video-Swin-B	Model

Output Dir

We put all outputs under a dir. Specifically, We set /mnt/data_16TB/lzy23/SOC as the output dir, so please change it to xxx/SOC.

🚀 Training

From scratch

We only use Video-Swin-T as backbone to train and eval the dataset.

A2D Run the scripts "./scripts/train_a2d.sh" and make sure that change the path "/mnt/data_16TB/lzy23" to your own path(same as the following).
```
bash ./scripts/train_a2d.sh
```
The key parameters are as follows and change the ./configs/a2d_sentences.yaml:

lr backbone_lr bs GPU_num Epoch lr_drop
5e-5 5e-6 2 2 40 15(0.2)
Ref-Youtube-VOS Run the "./scripts/train_ytb.sh.
```
bash ./scripts/train_ytb.sh
```
The main parameters are as follow:

lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
1e-4 1e-5 1 65 8 true 20(0.1) 30

Please change the ./configs/refer_youtube_vos.yaml according to the setting

Change the dataset_path according to your own path in ./datasets/refer_youtube_vos/refer_youtube_vos_dataset.py

lr	backbone_lr	bs	GPU_num	Epoch	lr_drop
5e-5	5e-6	2	2	40	15(0.2)

lr	backbone_lr	bs	num_class	GPU_num	freeze_text_encoder	lr_drop	Epoch
1e-4	1e-5	1	65	8	true	20(0.1)	30

With Pretrain

We perform pretrain and finetune on A2d-Sentences and Ref-Youtube-VOS dataset using Video-Swin-Tiny and Video-Swin-Base. Following previous work, we first pretrain on RefCOCO dataset and then finetune.

Pretrain

The followings are the key parameters for pretrain. When pretrain, please specify the corresponding backbone. (Video-Swin-T and Video-Swin-B)

lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
1e-4 1e-5 5e-6 8 1 8 False 15 20(0.1) 30
Ref-Youtube-VOS

We finetune the pretrained weight using the following key parameters:

lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
1e-4 1e-5 5e-6 8 1 8 False 10(0.1) 25
A2D-Sentences

We finetune the pretrained weight on A2D-Sentences using the following key parameters:

lr backbone_lr text_encoder_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
3e-5 3e-6 1e-6 1 1 8 true - 20

lr	backbone_lr	text_encoder_lr	bs	num_class	GPU_num	freeze_text_encoder	lr_drop	Epoch
1e-4	1e-5	5e-6	8	1	8	False	15 20(0.1)	30

lr	backbone_lr	text_encoder_lr	bs	num_class	GPU_num	freeze_text_encoder	lr_drop	Epoch
1e-4	1e-5	5e-6	8	1	8	False	10(0.1)	25

lr	backbone_lr	text_encoder_lr	bs	num_class	GPU_num	freeze_text_encoder	lr_drop	Epoch
3e-5	3e-6	1e-6	1	1	8	true	-	20

Joint training

We only perform Joint training on Ref-Youtube-VOS dataset with Video-Swin-Tiny and Video-Swin-Base.

Ref-Youtube-VOS

Run the scripts ./scripts/train_joint.sh. Remember to change the path and the backbone name before running.

The main parameters (Tiny and Base) are as follow:

lr backbone_lr bs num_class GPU_num freeze_text_encoder lr_drop Epoch
1e-4 1e-5 1 1 8 true 20(0.1) 30

lr	backbone_lr	bs	num_class	GPU_num	freeze_text_encoder	lr_drop	Epoch
1e-4	1e-5	1	1	8	true	20(0.1)	30

Evaluation

A2D-Sentences Run the scripts ./scripts/eval_a2d.sh and remember to specify the checkpoint_path in the config file.
JHMDB-Sentences Please refer to Link to prepare for the datasets and specify the checkpoint path in yaml file. Following the previous setting, we directly use the checkpoint trained on A2d-Sentences to test.
Ref-Youtube-VOS
```
bash ./scripts/infer_ref_ytb.sh
```
Remember to specify the checkpoint_path and the video backbone name.
Ref-DAVIS2017 Please refer to Link to prepare for the DAVIS dataset. We provide the infer_davis.sh to evaluate. Remember to specify the checkpoint_path and the video backbone name.

Inference

We provide the interface for inference

bash ./scripts/demo_video.sh

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR

Citations

If you find this work useful for your research, please cite:

@inproceedings{SOC,
  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
                  Segmentation},
  booktitle    = {NeurIPS},
  year         = {2023},
}