Awesome

TCE-RVOS

The official implementation for the "Temporal Context Enhanced Referring Video Object Segmentation" accepted by WACV 2024

Structure

Temporal Context Enhanced Referring Video Object Segmentation<br> Xiao Hu, Basavaraj Hampiholi, Heiko Neumann, and Jochen Lang

Abstract

The goal of Referring Video Object Segmentation is to extract an object from a video clip based on a given expression. While previous methods have utilized the transformer's multi-modal learning capabilities to aggregate information from different modalities, they have mainly focused on spatial information and paid less attention to temporal information. To enhance the learning of temporal information, we propose TCE-RVOS with a novel frame token fusion (FTF) structure and a novel instance query transformer (IQT). Our technical innovations maximize the potential information gain of videos over single images. Our contributions also include a new classification of two widely used validation datasets for investigation of challenging cases.

Update

(2023/11/19) Code researsed.💥
(2023/10/24) TCE RVOS is accepted by WACV2024.🏄

Demo

Videos

Coming Soon

Image frames

The order of the rows are 1. MTTR 2. ReferFormer 3. TCE RVOS

"a white and red parachute blowing in the wind", shown in blue masks.
"the white toilet is between the white tub and green cabinet”, shown in purple masks.

Installation & Data Preparation

Please refer to the ReferFormer.

Model Zoo

Coming Soon

Acknowledgement

This repo is based on ReferFormer. We also refer to MTTR. Thanks for their great works.