Awesome
[2023/03/09 Update] VIOLETv2
We have released our empirical study of masked visual modeling for VidL learning as VIOLETv2.
VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling
A PyTorch implementation of VIOLET
<img src='_imgs/intro.png' width='50%' />Overview
VIOLET is an implementation of <br> "VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling" <br> Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu
<img src='_imgs/violet.png' width='80%' />VIOLET contains 3 components: Video Swin Transformer (VT) computes video features; Language Embedder (LE) extracts word embeddings; Cross-modal Transformer (CT) performs cross-modal fusion. To benefit from large-scale data, we incorporate 3 pretraining tasks: Masked Language Modeling (MVM) predicts the masked word tokens; Masked Visual-token Modeling (MVM) recovers the masked video patches; Visual-Text Matching (VTM) learns the alignments between video and text modality.
Requirements
This code is implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8. <br>
Usage
Data preprocessing
As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.
cd _tools
# We use 4 frames during pretraining and 5 frames for downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl
# We use DALL-E to extract VQ tokens for MVM pretraining
wget https://cdn.openai.com/dall-e/encoder.pkl # download trained dall-e encoder
python extract_vq.py --path=msrvtt --frame=224 # output: msrvtt_vq.pkl
# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx
There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.
Pretraining
Put pretrained VT in ./_snapshot. This script pretrains on both video (WebVid2.5M) and image (CC3M) data via single-node multi-gpu distributed training.
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=7122 main_pretrain.py
We have our used datasets and the best pretrained checkpoint (YT180M+WebVid2.5M+CC3M).
Downstream
- Multiple-Choice Question Answering (TGIF-Action, TGIF-Transition, MSRVTT-MC, and LSMDC-MC)
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qamc.py _data/args_tgif-action.json
- Open-Ended Question Answering (TGIF-Frame, MSRVTT-QA, LSMDC-FiB, and MSVD-QA)
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_qaoe.py _data/args_msvd-qa.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python main_retrieval.py _data/args_msrvtt-retrieval.json
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval.py _data/args_msrvtt-retrieval.json
We also provide all downstream datasets and trained checkpoints.
Citation
@inproceedings{fu2023empirical-mvm,
author = {Tsu-Jui Fu* and Linjie Li* and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu},
title = {{An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling}},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}
@inproceedings{fu2021violet,
author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu},
title = {{VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}},
booktitle = {arXiv:2111.1268},
year = {2021}
}