Awesome
<div align="center"> <h3> Depth-aware Test-Time Training for Zero-shot Video Object Segmentation </h3> <br/> <a href='https://arxiv.org/abs/2403.04258'><img src='https://img.shields.io/badge/ArXiv-2403.04258-red' /></a> <a href='https://nifangbaage.github.io/DATTT/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <br/> <br/> <div> <a target='_blank'>Weihuang Liu <sup> 1</sup> </a>  <a href='https://xishen0220.github.io/' target='_blank'>Xi Shen <sup> 2</sup></a>  <a target='_blank'>Haolun Li <sup> 1</sup> </a>  <a target='_blank'>Xiuli Bi <sup> 3</sup> </a>  <a target='_blank'>Bo Liu <sup> 3</sup> </a>  <a href='https://www.cis.um.edu.mo/~cmpun/' target='_blank'>Chi-Man Pun <sup>*,1</sup></a>  <a href='https://vinthony.github.io/' target='_blank'>Xiaodong Cun <sup>*,4</sup></a>  </div> <br> <div> <sup>1</sup> University of Macau <sup>2</sup> Intellindust <br> <sup>3</sup> Chongqing University of Posts and Telecommunications <sup>4</sup> Tencent AI Lab </div> <br> <i><strong><a href='https://arxiv.org/abs/2403.04258' target='_blank'>CVPR 2024</a></strong></i> <br> <br> </div>Overview
<p align="center"> <img width="80%" alt="teaser" src="docs/static/images/teaser.jpg"> </p> Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. We introduce Depth-aware test-time training (DATTT) to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. During the test-time training, the model is required to predict consistent depth maps for the same video frame under different data augmentation. The model is progressively updated and provides more precise mask prediction.Pipeline
<p align="center"> <img width="80%" alt="framework" src="docs/static/images/framework.jpg"> </p> We add a depth decoder to commonly used two-stream ZSVOS architecture to learn 3D knowledge. The model is first trained on large-scale datasets for object segmentation and depth estimation. Then, for each test video, we employ photometric distortion-based data augmentation to the frames. The error between the predicted depth maps is backward to update the image encoder. Finally, the new model is applied to infer the object.Environment
This code was implemented with Python 3.6 and PyTorch 1.10.0. You can install all the requirements via:
pip install -r requirements.txt
Quick Start
- Download the YouTube-VOS dataset, DAVIS-16 dataset, FBMS dataset, Long-Videos dataset, MCL dataset, and SegTrackV2 dataset. You could get the processed data provided by HFAN. The depth maps are obtained by MonoDepth2, We also provide the processed data here.
- Download the pre-trained Mit-b1 or Swin-Tiny backbone.
- Training:
python train.py --config ./configs/train_sample.yaml
- Evaluation:
python ttt_demo.py --config configs/test_sample.yaml --model model.pth --eval_type base
- Test-time training:
python ttt_demo.py --config configs/test_sample.yaml --model model.pth --eval_type TTT-MWI
We provide our checkpoints here.
Citation
If you find this useful in your research, please consider citing:
@inproceedings{
title={Depth-aware Test-Time Training for Zero-shot Video Object Segmentation},
author={Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
Acknowledgements
EVP code borrows heavily from EVP, Swin and SegFormer. We thank the author for sharing their wonderful code.