Awesome

[CVPR 2023] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

This repo is the offcial implementation of CVPR 2023 paper <br> "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling" <br> Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu and Lijuan Wang

We explore a unified video-language framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can

Seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned
Generalize to various downstream tasks with limited training samples
Enable zero-shot evaluation on video question answering tasks

Requirements
Data preprocessing
Pretraining
Downstream
Multi-task Training
Citation
License

Requirements

This code is largely based on the official pytorch implementation of VIOLET, implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Data preprocessing

Copied from VIOLET

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 5 frames for both pre-training and downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Visit Video Swin Transformer to download pre-trained weights models. Place swin_base_patch244_window877_kinetics*_22k.pth under ${REPO_DIR}/_models/video_swin_transformer directory. The data structure should follow the hierarchy below.

${REPO_DIR}  
|-- _models  
|   |-- video_swin_transformer
|    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
|    |   |-- swin_base_patch244_window877_kinetics400_22k.pth
|-- _args 
|-- _datasets
|-- _imgs 
|-- ... 
|-- ...

Download pretraining datasets (WebVid2.5M & CC3M) provided by VIOLET to ./_datasets. The data structure should follow the hierarchy below.

${REPO_DIR}  
|-- _models 
|-- _args 
|-- _datasets
|   |-- txt_webvid2.5.json
|   |-- webvid2.5_val.tsv
|   |-- webvid2.5_val.lineidx
|   |-- webvid2.5_train_1.tsv
|   |-- webvid2.5_train_1.lineidx
|   |-- ...
|   |-- webvid2.5_train_9.tsv
|   |-- webvid2.5_train_9.lineidx
|   |-- txt_cc3m.json
|   |-- cc3m_val.tsv
|   |-- cc3m_val.lineidx
|   |-- cc3m_train_1.tsv
|   |-- cc3m_train_1.lineidx
|   |-- ...
|   |-- cc3m_train_9.tsv
|   |-- cc3m_train_9.lineidx
|-- _imgs 
|-- ... 
|-- ...

Pretrain via single-node multi-gpu distributed training.

Task-specific Baseline: Pre-training with Video-Text Matching (VTM) + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot

Pretrained checkpoint on WebVid2.5M+CC3M: link

LAVENDER: Unified Pre-training with VTM as MLM + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot

Pretrained checkpoint on WebVid2.5M+CC3M: link
Scale-up pre-trained checkpoint with 14M videos + 16M images: link

Downstream

Download downstream datasets to ./_datasets.

Multiple-Choice Question Answering

TGIF-Action

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-action.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

TGIF-Transition

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qamc_mlm_gen_ans_idx.py --config _args/args_tgif-transition.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

MSRVTT-MC

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_msrvtt-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

LSMDC-MC

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retmc_mlm_head.py --config _args/args_lsmdc-mc.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_qamc_task_specific.py or main_retmc_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Open-Ended Question Answering

TGIF-Frame

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_tgif-frame.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

MSRVTT-QA

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msrvtt-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

MSVD-QA

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm.py --config _args/args_msvd-qa.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

LSMDC-FiB

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_qaoe_mlm_lsmdc_fib.py --config _args/args_lsmdc-fib.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_qaoe_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Text-to-Video Retrieval

MSRVTT

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msrvtt-retrieval.json --path_ckpt <path to the finetuned msrvtt-retrieval model ckpt>

DiDeMo

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>

MSVD

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_msvd-retrieval.json --path_ckpt <path to the finetuned msvd-retrieval model ckpt>

LSMDC

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>
CUDA_VISIBLE_DEVICES='0,1,2,3' python eval_retrieval_tsv.py --config _args/args_lsmdc-retrieval.json --path_ckpt <path to the finetuned lsmdc-retrieval model ckpt>

For task-specific baseline, update the main script to main_retrieval_task_specific.py or eval_retrieval_task_specific.py, and point --path_ckpt to task-specific checkpoints.

Video Captioning (MSRVTT, MSVD)

Finetuning on video captioning requires additional enviroment and dataset setup. We closely follow the instructions from SwinBERT. Please check their repo for more details.

Note that the data folder should have the following structure:

${REPO_DIR}  
    |-- _datasets  
    |   |-- MSRVTT-v2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |-- MSVD  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |-- ... 
    |-- ...

Once the docker enviroment and the dataset has been setup correctly, run the following command for training.

MSRVTT

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msrvtt-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

MSVD

CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch --nproc_per_node=4 --master_port=5566 main_caption.py --config _args/args_msvd-cap.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, simply update --path_ckpt to task-specific pre-trained weights.

Multi-task Training

Data Filtering

As mentioned in our paper, the testing splits of all above tasks may overlap. We perform a data filtering step first to remove the testing data of a task from the training data of other tasks.

python _tools/multi_task_vid_filter.py --dataset lsmdc

python _tools/multi_task_vid_filter.py --dataset msrvtt

python _tools/multi_task_vid_filter.py --dataset msvd 

python _tools/multi_task_vid_filter.py --dataset tgif

Training

CUDA_VISIBLE_DEVICES='0,1,2,3,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_multi_task_mlm.py --config _args/args_multi-task_all.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_multi_task_multi_head.py and point --path_ckpt to task-specific pre-trained weights.

Citation

If you find this code useful, please consider citing the following papers:

@inproceedings{li2023lavender, 
  author = {Linjie Li and Zhe Gan and Kevin Lin and Chung-Ching Lin and Ce Liu and Zicheng Liu and Lijuan Wang}, 
  title = {LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}

@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

License

Our research code is released under MIT license.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.