Awesome
ICCV2023: The 5th Large-scale Video Object Segmentation Challenge
1st place solution for track three: Referring Video Object Segmentation Challenge.
Zhuoyan Luo<sup>*1</sup>, Yicheng Xiao<sup>*1</sup>, Yong Liu<sup>*12</sup>, Yitong Wang<sup>2</sup>, Yansong Tang<sup>1</sup>, Xiu Li<sup>1</sup>, Yujiu Yang<sup>1</sup>
<sup>1</sup> Tsinghua Shenzhen International Graduate School, Tsinghua University <sup>2</sup> ByteDance Inc.
<sup>*</sup> Equal Contribution
ššš Paper
š¢ Updates:
- We Release the Code for the The 5th Large-scale Video Object Segmentation Challenge.
š Abstract
The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J &F on Ref-Youtube-VOS validation set and 70% J &F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3
š FrameWork
<p align="center"> <img src="assets/Framework.png" width="95%"/> </p>š ļø Environment Setup
As we use different RVOS models, we need to set up two version of environment.
First Environment (for SOC MUTR Referformer AOT DEAOT)
- install pytorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
- install other dependencies
pip install h5py opencv-python protobuf av einops ruamel.yaml timm joblib pandas matplotlib cython scipy
- install transformers
pip install transformers
- install pycocotools
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
- install Pytorch Correlation (Recommend to install from source instead of using
pip
) - build up MultiScaleDeformableAttention
cd soc_test/models/ops python setup.py build install
Second Environment (for UNINEXT)
- The environmet please refer to INSTALL.md for more details
- Follow each step to build up the environment
Data Preparation
The Overall data preparation is set as followed. We put rvosdata under the path /mnt/data_16TB/lzy23/rvosdata please change to xxx/rvosdata
rvosdata
āāā refer_youtube_vos/
āāā train/
ā āāā JPEGImages/
ā ā āāā */ (video folders)
ā ā āāā *.jpg (frame image files)
ā āāā Annotations/
ā āāā */ (video folders)
ā āāā *.png (mask annotation files)
āāā valid/
ā āāā JPEGImages/
ā āāā */ (video folders)
ā āāā *.jpg (frame image files)
āāā test/
ā āāā JPEGImages/
ā āāā */ (video folders)
ā āāā *.jpg (frame image files)
āāā meta_expressions/
āāā train/
ā āāā meta_expressions.json (text annotations)
āāā valid/
āāā meta_expressions.json (text annotations)
UNINEXT needs to generate the extra valid.json and test.json for inference and please refer to DATA.md/Ref-Youtube-VOS
Pretrained Model Preparation
We create a folder for storing all pretrained model and put them in the path /mnt/data_16TB/lzy23/pretrained, please change to xxx/pretrained
pretrained
āāā pretrained_swin_transformer
āāā pretrained_roberta
āāā bert-base-uncased
- for pretrained_swin_transformer folder download Video-Swin-Base
- for pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)
- for bert-base-uncased folder
wget -c https://huggingface.co/bert-base-uncased/resolve/main/config.json
wget -c https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
wget -c https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin
or download from huggingface
Model_Zoo
The Checkpoint we use are listed as follow: best organized that each model (backbone) corresponds to a folder.
Model | Backbone | Checkpoint |
---|---|---|
SOC | Video-Swin-Base | Model |
MUTR | Video-Swin-Base | Model |
Referformer_ft | Video-Swin-Base | Model |
UNINEXT | VIT-H | Model |
UNINEXT | Convnext | Model |
AOT | Swin-L | Model |
DEAOT | Swin-L | Model |
š Training
We joint train the model SOC
Output_dir
Generally we put all output under the dir, Specifically, we set /mnt/data_16TB/lzy23 as the output dir, so, please change it to xxx/.
if you want to joint train SOC, run the scripts ./soc_test/train_joint.sh. Before that, you need to change the path according to your path:
- ./soc_test/configs/refer_youtube.yaml (file)
- text_encoder_type (change /mnt/data_16TB/lzy23 to xxx) the follow is the same
- ./soc_test/datasets/refer_youtube_vos/
- dataset_path (variable name)
- line 164
- ./soc_test/utils.py
- line 23
- ./soc_test/train_joint.sh
- line 3
š Testing
First, We need to use the checkpoint mentioned above to inference to get the Annotations.
SOC
change the test_encoder path in ./soc_test/configs/refer_youtube_vos.yaml line 77
- run the scripts ./soc_test/scripts/infer_refytb.sh to get the Annotations and key_frame.json and need to change the path.
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./soc_test/infer_refytb.py
- line 56 68
- ./soc_test/scripts/infer_refytb.sh
- line 3 4
- run the scripts ./soc_test/scripts/infer_ensemble_test.sh to get masks.pth for following ensemble
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./soc_test/infer_refyrb_ensemble.py
- line 46 54
- ./soc_test/scripts/infer_ensemble_test.sh
- line 2 3
MUTR
Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./MUTR/models/mutr.py line 127
- run the scripts ./MUTR/inference_ytvos.sh to obtain the Annotations
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./MUTR/inference_ytvos.sh
- line 4 5 6
- run the scripts ./MUTR/infer_ytvos_ensemble.sh to obtain mask.pth
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./MUTR/infer_ytvos_ensemble.sh
- line 4 5 6 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./MUTR/ptf.py
Referformer
Before start change the text_encoder path (/mnt/data_16TB/lzy23/ -> xxx/) in ./Referformer/models/referformer.py line 127
- run the scripts ./Referformer/infer_ytvos.sh to obtain the Annotations
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./Referformer/inference_ytvos.py
- line 59
- ./Referformer/infer_ytvos.sh
- line 3 4
- run the scripts ./Referformer/scripts/ensemble_for_test.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./Referformer/ensemble_for_test.sh
- line 5 9 10 run the command to generate the key_frame.json (change the path ptf.py line 7 9 10)
python3 ./Referformer/ptf.py
UNINEXT We adopt two different backbones as our RVOS models, so follow the step to get the Annotations and mask.pth First change the text encoder in (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/uninext/models/deformable_detr/bert_model.py line 17 19
- ./UNINEXT/projects/UNINEXT/uninext/data/dataset_mapper_ytbvis.py line 172
- ./UNINEXT/projects/UNINEXT/uninext/uninext_vid.py line 151 Second change the image_root and annotations_path ./UNINEXT/projects/UNINEXT/uninext/data/datasets/ytvis.py line 382 383
- VIT-H
- run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/configs/video_joint_vit_huge.yaml
- line 4 51
- ./UNINEXT/detectron2/evaluation/evaluator.py
- line 209 save_path run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/vit_ptf.py
- Convnext
- run the scripts ./UNINEXT/assets/infer_huge_rvos.sh
- Path Changed (/mnt/data_16TB/lzy23/ -> xxx/)
- ./UNINEXT/projects/UNINEXT/configs/video_joint_convnext_large.yaml
- line 4 51
- ./UNINEXT/detectron2/evaluation/evaluator.py
- make sure that you change /mnt/data_16TB/lzy23/test/model_pth/vit_huge.pth to xxx/test/model_pth/convnext.pth run the command to generate the key_frame.json (change the path vit_ptf.py line 7 9 10)
python3 ./UNINEXT/convnext_ptf.py
After generating all Annotations, the results should be in the following format
test
āāā soc/
āāā Annotations
āāā key_frame.json
āāā mutr/
āāā Annotations
āāā key_frame.json
āāā referformer_ft/
āāā Annotations
āāā key_frame.json
āāā vit-huge/
āāā Annotations
āāā key_frame.json
āāā convnext/
āāā Annotations
āāā key_frame.json
āāā model_pth/
āāā soc.pth
āāā mutr.pth
āāā referformer_ft.pth
āāā vit_huge.pth
āāā convnext.pth
Then as the pth is quite huge it is hard to load them in memory at a time, so run the following command to generate the split pth change the path in line 5 6
python3 split_pth.py
Post-Processing
We adopt AOT and DEAOT to post-process the mask results.
- AOT
First, change the model_pth path in
- ./rvos_competition/soc_test/AOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/AOT
bash eval_soc.sh
bash eval_mutr.sh
bash eval_referformer_ft.sh
if you have more GPU resources you can change the variable gpunum in the sh file. 2. DEAOT
change the model_pth path in
- ./rvos_competition/soc_test/DEAOT/configs/default.py line 88 112 128 129 then run the following command
cd ./soc_test/DEAOT
bash eval_vith.sh
bash eval_convnext.sh
bash eval_referformer_ft.sh
First Round Ensemble
We first fuse three models. Remember to generate all annotations mentioned above. run the command below
Remember to change the path in the sh file test_swap_1.sh test_swap_2.sh line 2 3
cd ./soc_test/scripts
bash test_swap_1.sh
bash test_swap_2.sh
After we use AOT and DEOAT to post-process respectively run the scripts ./soc_test/AOT/eval_soc_mutr_referft.sh run the scripts ./soc_test/DEAOT/eval_vit_convext_soc.sh
Second Ensemble
First make sure that before doing the second ensemble, the format should be like
test
āāā soc/
āāā Annotations
āāā key_frame.json
āāā Annotations_AOT_class_index
āāā mutr/
āāā Annotations
āāā key_frame.json
āāā Annotations_AOT_class_index
āāā referformer_ft/
āāā Annotations
āāā key_frame.json
āāā Annotations_AOT_class_index
āāā Annotations_DEAOT_class_index
āāā vit-huge/
āāā Annotations
āāā key_frame.json
āāā Annotations_DEAOT_class_index
āāā convnext/
āāā Annotations
āāā key_frame.json
āāā Annotations_DEAOT_class_index
āāā soc_mutr_referft/
āāā Annotations
āāā key_frame.json
āāā Annotations_AOT_class_index
āāā vit_convnext_soc/
āāā Annotations
āāā key_frame.json
āāā Annotations_DEAOT_class_index
āāā model_pth/
āāā soc.pth
āāā mutr.pth
āāā referformer_ft.pth
āāā vit_huge.pth
āāā convnext.pth
āāā model_split/
āāā soc
āāā soc0.pth
āāā xxx
āāā mutr
āāā mutr0.pth
āāā xxx
āāā referformer_ft.pth
āāā referformer_ft0.pth
āāā vit_huge.pth
āāā vit_huge0.pth
āāā convnext.pth
āāā convnext0.pth
we will conduct two round ensemble.
-
run the scripts ./soc_test/scripts/test_ensemble_1.sh change the path in sh file (line 1 2 3) to get the en2 Annotations.
-
run the scripts ./soc_test/scripts/test_ensemble_2.sh also change the path in sh file (line 1 2 3) to get the final Annotations.
Finally the Annotations in second_ensemble folder and named vit_convnext_soc_deaot_vitdeaot_en2_referftdeaot is the submission
The Following table is the Annotations mentioned above
Model | Annotations |
---|---|
SOC | Oirgin AOT |
MUTR | Oirgin AOT |
Referformer | Oirgin AOT DEAOT |
Vit-Huge | Oirgin DEAOT |
Convnext | Oirgin DEAOT |
soc_mutr_referft | Oirgin AOT |
vit_convnext_soc | Oirgin DEAOT |
en2 | Annotations |
Final | Annotations |
Acknowledgement
Code in this repository is built upon several public repositories. Thanks for the wonderful works.
If you find this work useful for your research, please cite:
@article{SOC,
author = {Zhuoyan Luo and
Yicheng Xiao and
Yong Liu and
Shuyan Li and
Yitong Wang and
Yansong Tang and
Xiu Li and
Yujiu Yang},
title = {{SOC:} Semantic-Assisted Object Cluster for Referring Video Object
Segmentation},
journal = {CoRR},
volume = {abs/2305.17011},
year = {2023},
}