Home

Awesome

Introduction

Recurrent Dynamic Embedding for Video Object Segmentation [CVPR 2022]

Install

If you just implement our method, refer to Requirements.
If you want to evaluate our method on the davis 2017 validation set, refer to Requirements.

Model zoo

You can download the pretrained models from Google.

The predictions of our method can be download from Google.

Dataset

Following STCN, we train the network in the three stages. Firstly, we train the network on the static image dataset, which can be downloaded in download_datasets.py. Then we fine-tune the network with SAM on the BL30K dataset, which can be downloaded in download_bl30k.py. Note, BL30K is an extensive dataset introduced by MiVOS and is 700GB in total. Finally, we fine-tune the network with SAM on the mixed dataset (DAVIS 2017 and YouTube-VOS 2019).

I know it doesn't look straightforward., while you can just download DAVIS 2017 and have a quick start right away.

├── BL30K
├── DAVIS
│   ├── 2016
│   │   ├── Annotations
│   │   └── ...
│   └── 2017
│       ├── test-dev
│       │   ├── Annotations
│       │   └── ...
│       └── trainval
│           ├── Annotations
│           └── ...
├── static
│   ├── BIG_small
│   └── ...
├── YouTube
│   ├── all_frames
│   │   └── valid_all_frames
│   ├── train
│   ├── train_480p
│   └── valid
└── YouTube2018
    ├── all_frames
    │   └── valid_all_frames
    └── valid

Quick start

Take the inference on the Davis 2017 validation set as an example. The inference command is as follows:

python eval_davis.py --output ... --davis_path ... --model  ...   --mode two-frames-compress  --mem_every ... --top ... --amp

You can use this protocol.

python eval_davis.py --output prediction/s012 --davis_path ... --model  pretrain/model_s012_final.pth   --mode two-frames-compress  --mem_every 3 --amp

Quick evaluation

Take the evaluation on the Davis 2017 validation set as an example.

We modify this repo to evaluate our method.

python evaluation/2017/evaluation_ours.py --results_path ... --davis_path ...

Results

Without BL30K

DatasetSplitFPS
DAVIS 2016validation91.189.792.535.0
DAVIS 2017validation84.280.887.527.0
DAVIS 2017test-dev77.473.681.2-
DatasetSplit
YouTube 2019validation81.981.185.576.284.8

With BL30K

DatasetSplitFPS
DAVIS 2016validation91.690.093.235.0
DAVIS 2017validation86.182.190.027.0
DAVIS 2017test-dev78.974.992.9-
DatasetSplit
YouTube 2019validation83.381.986.378.086.9

Inference

By one gpu, you can infer these datasets as follows:

python eval_davis.py --output prediction/DAVIS-2017-val --davis_path ... --model  pretrain/model_s012_final.pth   --mode two-frames-compress  --mem_every 3 --amp
python eval_davis.py --output prediction/DAVIS-2017-test --davis_path ... --model  pretrain/model_s012_final.pth   --mode two-frames-compress  --mem_every 3 --top 40 --split testdev --amp
python eval_davis_2016.py --output prediction/DAVIS-2017-val --davis_path ... --model  pretrain/model_s012_final.pth   --mode two-frames-compress  --mem_every 3 --top 40 --split testdev --amp
python eval_youtube.py --output prediction/YV-19-val --yv_path ... --model  pretrain/model_s012_final_yv.pth  --mode two-frames-compress  --mem_every 4 --top 20 --amp

Training

Firstly, you must configure the paths to the dataset in util/hyper_para.py, which include --static_root, --bl_root, --yv_root and --davis_root.

stage 0

cd rootdir &&\
OMP_NUM_THREADS=4 python -m  torch.distributed.launch --master_port 9843 \
--nproc_per_node=4 \
train.py --id  s0 \
--stage 0 \
--perturb_max 1 \
--perturb_min 0.85 \
--save_interval  10000 \
--klloss_weight 10 \
--start_warm 5000 \
--end_warm 17500 \
--batch_size 16 \
--lr 2e-05 \
--steps 37500 \
--iterations 75000 \
--repeat 0  

stage 0 -> 3 (w/o BL30K)

cd rootdir &&\
OMP_NUM_THREADS=4 python -m  torch.distributed.launch --master_port 9844 \
--nproc_per_node=2 \
train.py --id  s03 \
--stage 3 \
--load_network pretrain/s0/model_75000.pth \
--perturb_max 1 \
--perturb_min 0.85 \
--save_interval  10000 \
--klloss_weight 10 \
--batch_size 4 \
--lr 2e-05 \
--steps 125000 \
--iterations 150000 \
--repeat 0  

stage 1

cd rootdir &&\
OMP_NUM_THREADS=4 python -m  torch.distributed.launch --master_port 9843 \
--nproc_per_node=2 \
train.py --id  s1 \
--stage 1 \
--load_network pretrain/s0/model_75000.pth \
--perturb_max 1 \
--perturb_min 0.85 \
--save_interval  10000 \
--klloss_weight 10 \
--start_warm 20000 \
--end_warm 70000 \
--batch_size 4 \
--lr 1e-05 \
--steps 400000 \
--iterations 500000 \
--repeat 0  

stage 2

cd rootdir &&\
OMP_NUM_THREADS=4 python -m  torch.distributed.launch --master_port 9843 \
--nproc_per_node=2 \
train.py --id  s2 \
--stage 2 \
--load_network /gdata/limx/VOS/SAM/cvpr-22-code/pretrain/s1/model_500000.pth \
--perturb_max 1 \
--perturb_min 0.85 \
--save_interval  10000 \
--klloss_weight 5 \
--decoder_f2_weight 5 \
--decoder_f4_weight 5 \
--start_warm 5000 \
--end_warm 17500 \
--batch_size 8 \
--lr 2e-05 \
--steps 62500 \
--iterations 75000 \
--repeat 0  

Note since I suffered temporary layoffs during my internship at Alibaba, there is uncertainty about the installation environment and the version of the code I applied for. I tried to reproduce the previous parameters on this version and got 0.857 on Davis 17 val (0.861 in the original paper) and 79.2 on Davis 17 test (0.789 in the original paper).

The original parameters:

klloss_weight = 10 (paper) -> 5 (now)
decoder_f2_weight = 10 (paper) -> 5 (now)
decoder_f4_weight = 10 (paper) -> 5 (now)

Acknowledgement

This project is built upon numerous previous projects. We'd like to thank the contributors of STCN and MiVOS.

To do