Awesome
[CVPR 2023] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Paper | Slide | Poster | Video
<img src='_imgs/intro.png' width='100%' />This repo is the offcial implementation of CVPR 2023 paper <br> "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling" <br> Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu and Lijuan Wang
We explore a unified video-language framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can
- Seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned
- Generalize to various downstream tasks with limited training samples
- Enable zero-shot evaluation on video question answering tasks
Table of contents
Requirements
This code is largely based on the official pytorch implementation of VIOLET, implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.
Data preprocessing
Copied from VIOLET
As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.
cd _tools
# We use 5 frames for both pre-training and downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl
# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx
There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.
Pretraining
- Visit Video Swin Transformer to download pre-trained weights models. Place
swin_base_patch244_window877_kinetics*_22k.pth
under${REPO_DIR}/_models/video_swin_transformer
directory. The data structure should follow the hierarchy below.${REPO_DIR} |-- _models | |-- video_swin_transformer | | |-- swin_base_patch244_window877_kinetics600_22k.pth | | |-- swin_base_patch244_window877_kinetics400_22k.pth |-- _args |-- _datasets |-- _imgs |-- ... |-- ...
- Download pretraining datasets (WebVid2.5M & CC3M) provided by VIOLET to
./_datasets
. The data structure should follow the hierarchy below.${REPO_DIR} |-- _models |-- _args |-- _datasets | |-- txt_webvid2.5.json | |-- webvid2.5_val.tsv | |-- webvid2.5_val.lineidx | |-- webvid2.5_train_1.tsv | |-- webvid2.5_train_1.lineidx | |-- ... | |-- webvid2.5_train_9.tsv | |-- webvid2.5_train_9.lineidx | |-- txt_cc3m.json | |-- cc3m_val.tsv | |-- cc3m_val.lineidx | |-- cc3m_train_1.tsv | |-- cc3m_train_1.lineidx | |-- ... | |-- cc3m_train_9.tsv | |-- cc3m_train_9.lineidx |-- _imgs |-- ... |-- ...
- Pretrain via single-node multi-gpu distributed training.
Task-specific Baseline: Pre-training with Video-Text Matching (VTM) + MLM
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
- Pretrained checkpoint on WebVid2.5M+CC3M: link
LAVENDER: Unified Pre-training with VTM as MLM + MLM
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot
- Pretrained checkpoint on WebVid2.5M+CC3M: link
- Scale-up pre-trained checkpoint with 14M videos + 16M images: link
Downstream
Download downstream datasets to ./_datasets
.
Multiple-Choice Question Answering
- TGIF-Action
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-action.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- TGIF-Transition
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-transition.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- MSRVTT-MC
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_msrvtt-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- LSMDC-MC
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_lsmdc-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
For task-specific baseline, update the main script to main_qamc_task_specific.py
or main_retmc_task_specific.py
, and point --path_ckpt
to pre-trained task-specific baseline.
Open-Ended Question Answering
- TGIF-Frame
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_tgif-frame.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- MSRVTT-QA
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msrvtt-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- MSVD-QA
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msvd-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- LSMDC-FiB
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm_lsmdc_fib.py --config _args/args_lsmdc-fib.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
For task-specific baseline, update the main script to main_qaoe_task_specific.py
, and point --path_ckpt
to pre-trained task-specific baseline.
Text-to-Video Retrieval
- MSRVTT
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights> CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_ckpt <path to the finetuned msrvtt-retrieval model ckpt>
- DiDeMo
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights> CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
- MSVD
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights> CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_ckpt <path to the finetuned msvd-retrieval model ckpt>
- LSMDC
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights> CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>
For task-specific baseline, update the main script to main_retrieval_task_specific.py
or eval_retrieval_task_specific.py
, and point --path_ckpt
to task-specific checkpoints.
Video Captioning (MSRVTT, MSVD)
Finetuning on video captioning requires additional enviroment and dataset setup. We closely follow the instructions from SwinBERT. Please check their repo for more details.
Note that the data folder should have the following structure:
${REPO_DIR}
|-- _datasets
| |-- MSRVTT-v2
| | |-- *.yaml
| | |-- *.tsv
| |-- MSVD
| | |-- *.yaml
| | |-- *.tsv
|-- ...
|-- ...
Once the docker enviroment and the dataset has been setup correctly, run the following command for training.
- MSRVTT
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msrvtt-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
- MSVD
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msvd-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
For task-specific baseline, simply update --path_ckpt
to task-specific pre-trained weights.
Multi-task Training
Data Filtering
As mentioned in our paper, the testing splits of all above tasks may overlap. We perform a data filtering step first to remove the testing data of a task from the training data of other tasks.
python _tools/multi_task_vid_filter.py --dataset lsmdc
python _tools/multi_task_vid_filter.py --dataset msrvtt
python _tools/multi_task_vid_filter.py --dataset msvd
python _tools/multi_task_vid_filter.py --dataset tgif
Training
CUDA_VISIBLE_DEVICES='0,1,2,3,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_multi_task_mlm.py --config _args/args_multi-task_all.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
For task-specific baseline, update the main script to main_multi_task_multi_head.py
and point --path_ckpt
to task-specific pre-trained weights.
Citation
If you find this code useful, please consider citing the following papers:
@inproceedings{li2023lavender,
author = {Linjie Li and Zhe Gan and Kevin Lin and Chung-Ching Lin and Ce Liu and Zicheng Liu and Lijuan Wang},
title = {LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}
@inproceedings{fu2021violet,
author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu},
title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling},
booktitle = {arXiv:2111.1268},
year = {2021}
}
License
Our research code is released under MIT license.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.