Home

Awesome

πŸ”₯πŸ”₯πŸ”₯Update 2023.02.19πŸ”₯πŸ”₯πŸ”₯

2022CVPR-Modeling-Motion-with-Multi-Modal-Features-for-Text-Based-Video-Segmentation

This is the code for CVPR2022 paper "Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation"

Framework

image

Usage

  1. Download A2D-Sentences and JHMDB-Sentences. Then, please convert the raw data into image frames.

  2. Please use RAFT to generate the opticla flow map (visualize in RGB format) from frame t to frame t+1. Since there are only a few frames annotated in A2D and JHMDB, we only need to generate optical flow maps for these frames.

  3. Put them as follows:

your dataset dir/
└── A2D/ 
    β”œβ”€β”€ allframes/  
    β”œβ”€β”€ allframes_flow/
    β”œβ”€β”€ Annotations_visualize
    β”œβ”€β”€ a2d_txt
        └──train.txt
        └──test.txt
└── J-HMDB/ 
    β”œβ”€β”€ allframes/  
    β”œβ”€β”€ allframes_flow/
    β”œβ”€β”€ Annotations_visualize
    β”œβ”€β”€ jhmdb_txt
        └──train.txt
        └──test.txt

"Annotations_visualize" contains the GT masks for each target object. We have upload them to BaiduPan(lo50) for convenience.

  1. Download pretrained ResNet-101 and BETR.
  2. We provide the pretrained checkpoint B+M+T+L+A(u5hx)

Citation

Please consider to cite our work in your publications if you are interest in our research:

@inproceedings{zhao2022modeling,
  title={Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation},
  author={Zhao, Wangbo and Wang, Kai and Chu, Xiangxiang and Xue, Fuzhao and Wang, Xinchao and You, Yang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={11737--11746},
  year={2022}
}