Home

Awesome

ICCV2023: The 5th Large-scale Video Object Segmentation Challenge

1st place solution for track three: Referring Video Object Segmentation Challenge.

Zhuoyan Luo<sup>*1</sup>, Yicheng Xiao<sup>*1</sup>, Yong Liu<sup>*12</sup>, Yitong Wang<sup>2</sup>, Yansong Tang<sup>1</sup>, Xiu Li<sup>1</sup>, Yujiu Yang<sup>1</sup>

<sup>1</sup> Tsinghua Shenzhen International Graduate School, Tsinghua University <sup>2</sup> ByteDance Inc.

<sup>*</sup> Equal Contribution

šŸ˜ŠšŸ˜ŠšŸ˜Š Paper

šŸ“¢ Updates:

šŸ“– Abstract

The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J &F on Ref-Youtube-VOS validation set and 70% J &F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3

šŸ“— FrameWork

<p align="center"> <img src="assets/Framework.png" width="95%"/> </p>

šŸ› ļø Environment Setup

As we use different RVOS models, we need to set up two version of environment.

First Environment (for SOC MUTR Referformer AOT DEAOT)

Second Environment (for UNINEXT)

Data Preparation

The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata please change to xxx/rvosdata

rvosdata
ā””ā”€ā”€ refer_youtube_vos/ 
    ā”œā”€ā”€ train/
    ā”‚   ā”œā”€ā”€ JPEGImages/
    ā”‚   ā”‚   ā””ā”€ā”€ */ (video folders)
    ā”‚   ā”‚       ā””ā”€ā”€ *.jpg (frame image files) 
    ā”‚   ā””ā”€ā”€ Annotations/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.png (mask annotation files) 
    ā”œā”€ā”€ valid/
    ā”‚   ā””ā”€ā”€ JPEGImages/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.jpg (frame image files)
    ā”œā”€ā”€ test/
    ā”‚   ā””ā”€ā”€ JPEGImages/
    ā”‚       ā””ā”€ā”€ */ (video folders)
    ā”‚           ā””ā”€ā”€ *.jpg (frame image files) 
    ā””ā”€ā”€ meta_expressions/
        ā”œā”€ā”€ train/
        ā”‚   ā””ā”€ā”€ meta_expressions.json  (text annotations)
        ā””ā”€ā”€ valid/
            ā””ā”€ā”€ meta_expressions.json  (text annotations)

UNINEXT needs to generate the extra valid.json and test.json for inference and please refer to DATA.md/Ref-Youtube-VOS

Pretrained Model Preparation

We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained

pretrained
ā””ā”€ā”€ pretrained_swin_transformer
ā””ā”€ā”€ pretrained_roberta
ā””ā”€ā”€ bert-base-uncased
wget -c https://huggingface.co/bert-base-uncased/resolve/main/config.json
wget -c https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
wget -c https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin

or download from huggingface

Model_Zoo

The Checkpoint we use are listed as follow: best organized that each model (backbone) corresponds to a folder.

ModelBackboneCheckpoint
SOCVideo-Swin-BaseModel
MUTRVideo-Swin-BaseModel
Referformer_ftVideo-Swin-BaseModel
UNINEXTVIT-HModel
UNINEXTConvnextModel
AOTSwin-LModel
DEAOTSwin-LModel

šŸš€ Training

We joint train the model SOC

Output_dir

Generally we put all output under the dir, Specifically, we set /mnt/data_16TB/lzy23 as the output dir, so, please change it to xxx/.

if you want to joint train SOC, run the scripts ./soc_test/train_joint.sh. Before that, you need to change the path according to your path:

šŸš€ Testing

First, We need to use the checkpoint mentioned above to inference to get the Annotations.

SOC

change the test_encoder path in ./soc_test/configs/refer_youtube_vos.yaml line 77

MUTR

Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./MUTR/models/mutr.py line 127

python3 ./MUTR/ptf.py

Referformer

Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./Referformer/models/referformer.py line 127

python3 ./Referformer/ptf.py

UNINEXT We adopt two different backbones as our RVOS models, so follow the step to get the Annotations and mask.pth First change the text encoder in (/mnt/data_16TB/lzy23/ -> xxx/)

  1. VIT-H
python3 ./UNINEXT/vit_ptf.py
  1. Convnext
python3 ./UNINEXT/convnext_ptf.py

After generating all Annotations, the results should be in the following format

test
ā””ā”€ā”€ soc/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
ā””ā”€ā”€ mutr/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
ā””ā”€ā”€ referformer_ft/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
ā””ā”€ā”€ vit-huge/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
ā””ā”€ā”€ convnext/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
ā””ā”€ā”€ model_pth/
    ā”œā”€ā”€ soc.pth
    ā”œā”€ā”€ mutr.pth
    ā”œā”€ā”€ referformer_ft.pth
    ā”œā”€ā”€ vit_huge.pth
    ā”œā”€ā”€ convnext.pth

Then as the pth is quite huge it is hard to load them in memory at a time, so run the following command to generate the split pth change the path in line 5 6

python3 split_pth.py

Post-Processing

We adopt AOT and DEAOT to post-process the mask results.

  1. AOT

First, change the model_pth path in

cd ./soc_test/AOT
bash eval_soc.sh
bash eval_mutr.sh
bash eval_referformer_ft.sh

if you have more GPU resources you can change the variable gpunum in the sh file. 2. DEAOT

change the model_pth path in

cd ./soc_test/DEAOT
bash eval_vith.sh
bash eval_convnext.sh
bash eval_referformer_ft.sh

First Round Ensemble

We first fuse three models. Remember to generate all annotations mentioned above. run the command below

Remember to change the path in the sh file test_swap_1.sh test_swap_2.sh line 2 3

cd ./soc_test/scripts
bash test_swap_1.sh
bash test_swap_2.sh

After we use AOT and DEOAT to post-process respectively run the scripts ./soc_test/AOT/eval_soc_mutr_referft.sh run the scripts ./soc_test/DEAOT/eval_vit_convext_soc.sh

Second Ensemble

First make sure that before doing the second ensemble, the format should be like

test
ā””ā”€ā”€ soc/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_AOT_class_index
ā””ā”€ā”€ mutr/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_AOT_class_index
ā””ā”€ā”€ referformer_ft/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_AOT_class_index
    ā”œā”€ā”€ Annotations_DEAOT_class_index
ā””ā”€ā”€ vit-huge/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_DEAOT_class_index
ā””ā”€ā”€ convnext/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_DEAOT_class_index
ā””ā”€ā”€ soc_mutr_referft/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_AOT_class_index
ā””ā”€ā”€ vit_convnext_soc/
    ā”œā”€ā”€ Annotations
    ā”œā”€ā”€ key_frame.json
    ā”œā”€ā”€ Annotations_DEAOT_class_index
ā””ā”€ā”€ model_pth/
    ā”œā”€ā”€ soc.pth
    ā”œā”€ā”€ mutr.pth
    ā”œā”€ā”€ referformer_ft.pth
    ā”œā”€ā”€ vit_huge.pth
    ā”œā”€ā”€ convnext.pth
ā””ā”€ā”€ model_split/
    ā”œā”€ā”€ soc
        ā”œā”€ā”€ soc0.pth 
        ā”œā”€ā”€ xxx
    ā”œā”€ā”€ mutr
        ā”œā”€ā”€ mutr0.pth 
        ā”œā”€ā”€ xxx
    ā”œā”€ā”€ referformer_ft.pth
        ā”œā”€ā”€ referformer_ft0.pth 
    ā”œā”€ā”€ vit_huge.pth
        ā”œā”€ā”€ vit_huge0.pth 
    ā”œā”€ā”€ convnext.pth
        ā”œā”€ā”€ convnext0.pth 

we will conduct two round ensemble.

  1. run the scripts ./soc_test/scripts/test_ensemble_1.sh change the path in sh file (line 1 2 3) to get the en2 Annotations.

  2. run the scripts ./soc_test/scripts/test_ensemble_2.sh also change the path in sh file (line 1 2 3) to get the final Annotations.

Finally the Annotations in second_ensemble folder and named vit_convnext_soc_deaot_vitdeaot_en2_referftdeaot is the submission

The Following table is the Annotations mentioned above

ModelAnnotations
SOCOirgin AOT
MUTROirgin AOT
ReferformerOirgin AOT DEAOT
Vit-HugeOirgin DEAOT
ConvnextOirgin DEAOT
soc_mutr_referftOirgin AOT
vit_convnext_socOirgin DEAOT
en2Annotations
FinalAnnotations

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful works.

If you find this work useful for your research, please cite:

@article{SOC,
  author       = {Zhuoyan Luo and
                  Yicheng Xiao and
                  Yong Liu and
                  Shuyan Li and
                  Yitong Wang and
                  Yansong Tang and
                  Xiu Li and
                  Yujiu Yang},
  title        = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
                  Segmentation},
  journal      = {CoRR},
  volume       = {abs/2305.17011},
  year         = {2023},
}