

Towards Robust Video Object Segmentation with Adaptive Object Calibration (ACM Multimedia 2022)

Preview version paper of this work is available at Arxiv.

The conference poster is available at this github repo.

Long paper presentation video is available at GoogleDrive and YouTube.

Qualitative results and comparisons with previous SOTAs are available at YouTube.

- 2022.11.16: All the codes are cleaned and released ~ 
- 2022.10.21: Add the robustness evaluation dataloader for other models, e.g., AOT~
- 2022.10.1:Add the code of key implementations of this work~
- 2022.9.25:Add the poster of this work~
- 2022.8.27: Add presentation video and PPT for this work~
- 2022.7.10: Add future works towards robust VOS!
- 2022.7.5: Our ArXiv-version paper is available.
- 2022.7.1: Repo init. Please stay tuned~

Motivation for Robust Video Object Segmentation

<img width="527" alt="截屏2022-07-11 13 01 56" src="https://user-images.githubusercontent.com/65257938/178192237-96e9d50e-460c-4209-bb80-2d0dcf610160.png">


<img width="1101" alt="截屏2022-07-11 13 00 17" src="https://user-images.githubusercontent.com/65257938/178192065-6bd4cfad-184b-436c-850a-708177acd0eb.png">

Adaptive Object Proxy Representation (Component1)

<img width="600" alt="截屏2022-07-11 13 05 05" src="https://user-images.githubusercontent.com/65257938/178192541-3d3c8a66-95c4-45e0-8a11-35337b28a412.png">

Object Mask Calibration (Component2)

<img width="564" alt="截屏2022-07-11 13 04 23" src="https://user-images.githubusercontent.com/65257938/178192475-476835a3-40f8-47f4-a0f9-2da103a96a63.png">


In the booming video era, video segmentation attracts increasing research attention in the multimedia community.

Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects.

Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.

First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments via clustering at multi-levels for reference.

Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively.

Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits significantly superior robustness against perturbations.


You can also use the docker image below to set up your env directly. However, this docker image may contain some redundent packages.

docker image: xxiaoh/vos:10.1-cudnn7-torch1.4_v3

A more light-weight version can be created by modified the Dockerfile provided.



The key implementation of matching with adaptive-proxy-based representation is provided in THIS FILE. Other implementation and training/evaluation details can refer to PRCMVOS or CFBI.

The key implementation of the preliminary robust VOS benchmark evaluation is provided in THIS FILE.

The whole project code is provided in THIS FOLDER.

Feel free to contact me if you have any problems with the implementation~

Limitation & Directions for further exploration towards Robust VOS!

(to be continued...)


If you find this work is useful for your research, please consider citing:

   title={Towards Robust Video Object Segmentation with Adaptive Object Calibration},
   author={Xu, Xiaohao and Wang, Jinglu and Ming, Xiang and Lu, Yan},
   booktitle={Proceedings of the 30th ACM International Conference on Multimedia},


CFBI: https://github.com/z-x-yang/CFBI

Deeplab: https://github.com/VainF/DeepLabV3Plus-Pytorch

GCT: https://github.com/z-x-yang/GCT

Acknowledgement ❤️

This work is heavily built upon CFBI and RPCMVOS. Thanks to the author of CFBI to release such a wonderful code repo for further work to build upon!

Welcome to comments and discussions!!

Xiaohao Xu: xxh11102019@outlook.com


This project is released under the Mit license. See LICENSE for additional details.