Awesome

🔥🔥🔥Update 2023.02.19🔥🔥🔥

2022CVPR-Modeling-Motion-with-Multi-Modal-Features-for-Text-Based-Video-Segmentation

This is the code for CVPR2022 paper "Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation"

Framework

Usage

Download A2D-Sentences and JHMDB-Sentences. Then, please convert the raw data into image frames.
Please use RAFT to generate the opticla flow map (visualize in RGB format) from frame t to frame t+1. Since there are only a few frames annotated in A2D and JHMDB, we only need to generate optical flow maps for these frames.
Put them as follows:

your dataset dir/
└── A2D/ 
    ├── allframes/  
    ├── allframes_flow/
    ├── Annotations_visualize
    ├── a2d_txt
        └──train.txt
        └──test.txt
└── J-HMDB/ 
    ├── allframes/  
    ├── allframes_flow/
    ├── Annotations_visualize
    ├── jhmdb_txt
        └──train.txt
        └──test.txt

"Annotations_visualize" contains the GT masks for each target object. We have upload them to BaiduPan(lo50) for convenience.

Download pretrained ResNet-101 and BETR.
We provide the pretrained checkpoint B+M+T+L+A(u5hx)

Citation

Please consider to cite our work in your publications if you are interest in our research:

@inproceedings{zhao2022modeling,
  title={Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation},
  author={Zhao, Wangbo and Wang, Kai and Chu, Xiangxiang and Xue, Fuzhao and Wang, Xinchao and You, Yang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={11737--11746},
  year={2022}
}