Home

Awesome

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021)

<img align="right" src="./assets/quantitative.png" width="350px" />

Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao

1. News

2. Introduction

Why?

Appearance and motion are two important sources of information in video object segmentation (VOS). Previous methods mainly focus on using simplex solutions, lowering the upper bound of feature collaboration among and across these two cues.

<p align="center"> <img src="./assets/motivation.jpg" width="450px"/> <br /> <em> Figure 1: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appear- ance) and our full-duplex strategy. In contrast, our FS- Net offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue. </em> </p>

What?

In this paper, we study a novel framework, termed the FSNet (Full-duplex Strategy Network), which designs a relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding subspaces. Furthermore, the bidirectional purification module (BPM) is introduced to update the inconsistent features between the spatial-temporal embeddings, effectively improving the model's robustness.

<p align="center"> <img src="./assets/framework.jpg" /> <br /> <em> Figure 2: The pipeline of our FSNet. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction. </em> </p>

How?

By considering the mutual restraint within the full-duplex strategy, our FSNet performs the cross-modal feature-passing (i.e., transmission and receiving) simultaneously before the fusion and decoding stage, making it robust to various challenging scenarios (e.g., motion blur, occlusion) in VOS. Extensive experiments on five popular benchmarks (i.e., DAVIS16, FBMS, MCL, SegTrack-V2, and DAVSOD19) show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.

<p align="center"> <img src="./assets/qualitative.png" /> <br /> <em> Figure 3: Qualitative results on five datasets, including DAVIS16, MCL, FBMS, SegTrack-V2, and DAVSOD19. </em> </p>

3. Usage

How to Inference?

How to train our model from scratch?

Download the train dataset from Baidu Driver (PSW: u01t) or Google Driver (VOS-TrainSet_StaticAndVideo.zip)/Google Driver (VOS-TrainSet_Video.zip) and save it at ./dataset/*. Our training pipeline consists of three steps:

4. Benchmark

Unsupervised/Zero-shot Video Object Segmentation (U/Z-VOS) task

NOTE: In the U-VOS, all the prediction results are strictly binary. We only adopt the naive binarization algorithm (i.e., threshold=0.5) in our experiments.

Video Salient Object Detection (V-SOD) task

NOTE: In the V-SOD, all the prediction results are non-binary.

4. Citation

@article{ji2022fsnet-CVMJ,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fan, Deng-Ping and Fu, Keren and Wu, Zhe and Shen, Jianbing and Shao, Ling},
  journal={Computational Visual Media},
  pages={155–175},
  volume={8},
  issue={1},
  year={2022},
  publisher={Springer}
}

@inproceedings{ji2021full,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fu, Keren and Wu, Zhe and Fan, Deng-Ping and Shen, Jianbing and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={4922--4933},
  year={2021}
}

5. Acknowledgements

Many thanks to my collaborator Ph.D. Zhe Wu, who provides excellent work SCRN and design inspirations.