Home

Awesome

<div align="center">

🚀 SimVTP: Simple Video Text Pre-training with Masked Autoencoders

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> License Python

</div>

🍃 Abstract

SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words.

Our SimVTP has several properties:

teaser

🔥 Main Results on Downstream Tasks

Text-to-video Retrieval on MSR-VTT

MethodVis Enc.InitPre-trained Data#pairsR@1R@5R@10MdR
HEROImageNet, KineticsHowTo100M136M16.843.457.7-
AVLnetImageNet, KineticsHowTo100M136M27.155.666.64
FrozenImageNetWebVid2M+CC3M5.5M31.059.570.53
OATransImageNetWebVid2M+CC3M5.5M35.863.476.53
RegionLearnerImageNetWebVid2M+CC3M5.5M36.363.972.53
LocVTPImageNetWebVid2M+CC3M5.5M36.564.376.83
BFormerImageNetWebVid2M+CC3M5.5M37.664.875.13
SimVTP(ours)KineticsWebVid2M2.5M53.682.890.81

🔨 Dependencies and Installation

⛺ Installation

  1. Clone repo
    git clone git@github.com:mayuelala/SimVTP.git
    cd SimVTP
    
  2. Install dependent packages
    pip install -r requirements.txt
    

🔅 Data Preparation

Please refer to DATA.md for pre-training and downstream evaluation datasets.

🌿 Pre-training

We pretrain our SimVTP on video dataset WebVid-2M with 64 V100 GPU (8 nodes x 8 GPUs). The implementation of our SimVTP supports multi-node distributed training. We provide the scripts in the scripts folder.

bash scripts/pretrain_webvid.sh

you could run the scripts respectively. --master_addr is set as the ip of the node 0 and --node_rank is set from 0 to 7.

🍄 Fine-tuning on MSRVTT

We finetune our SimVTP on MSRVTT with 8 V100. We provide the scripts in the scripts folder.

bash scripts/finetune_msrvtt.sh

You could also add the --only_test to evaluate our finetuned model.

🐧 Model Weight

We provide the pretrained weights and finetuned weight on msrvtt in google driver.

MethodBackboneEpochPre-trainFine-tuneR@1
SimVTPViT-B200script/log/checkpointscript/log/checkpoint53.6

👀 Visualization

We provide the script for visualization in vis.sh. Though not the exact same as original texts, the reconstructed texts are plausible and in harmony with the video content. Sometimes, they are even more accurate than original texts, like the white cat and little boy in the second and third columns

teaser

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file.

👏 Acknowledgement

This project is built upon MAE-pytorch and VideoMAE. Thanks to the contributors of these great codebases.

✏️ Citation

@article{ma2022simvtp,
  title={SimVTP: Simple Video Text Pre-training with Masked Autoencoders},
  author={Ma, Yue and Yang, Tianyu and Shan, Yin and Li, Xiu},
  journal={arXiv preprint arXiv:2212.03490},
  year={2022}
}