Home

Awesome

[CVPR 2023] LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Paper | Slide | Poster | Video

<img src='_imgs/intro.png' width='100%' />

This repo is the offcial implementation of CVPR 2023 paper <br> "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling" <br> Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu and Lijuan Wang

We explore a unified video-language framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can

Table of contents

Requirements

This code is largely based on the official pytorch implementation of VIOLET, implemented under Python 3.8, PyTorch 1.7, and Torchvision 0.8.

Data preprocessing

Copied from VIOLET

As using outer datasets (cannot be shared by us), we provide preprocessing tools to extract sparse-sampled video frames into our compressed format.

cd _tools

# We use 5 frames for both pre-training and downstream tasks
python extract_video-frame.py --path=msrvtt --sample=5 # output: msrvtt.pkl

# We adopt file.seek() instead of loading entire data to reduce the memory cost during distributed pretraining
python extract_tsv.py --path=msrvtt # output: msrvtt.tsv, msrvtt.lineidx

There are partial examples (WebVid2.5M, CC3M, TGIF-Action, MSVD-QA, and MSRVTT-Retrieval) to help formulate the input data.

Pretraining

Task-specific Baseline: Pre-training with Video-Text Matching (VTM) + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot

LAVENDER: Unified Pre-training with VTM as MLM + MLM

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_pretrain_mlm.py --config _args/args_pretrain.json --path_output _snapshot

Downstream

Download downstream datasets to ./_datasets.

Multiple-Choice Question Answering

For task-specific baseline, update the main script to main_qamc_task_specific.py or main_retmc_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Open-Ended Question Answering

For task-specific baseline, update the main script to main_qaoe_task_specific.py, and point --path_ckpt to pre-trained task-specific baseline.

Text-to-Video Retrieval

For task-specific baseline, update the main script to main_retrieval_task_specific.py or eval_retrieval_task_specific.py, and point --path_ckpt to task-specific checkpoints.

Video Captioning (MSRVTT, MSVD)

Finetuning on video captioning requires additional enviroment and dataset setup. We closely follow the instructions from SwinBERT. Please check their repo for more details.

Note that the data folder should have the following structure:

${REPO_DIR}  
    |-- _datasets  
    |   |-- MSRVTT-v2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |-- MSVD  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |-- ... 
    |-- ... 

Once the docker enviroment and the dataset has been setup correctly, run the following command for training.

For task-specific baseline, simply update --path_ckpt to task-specific pre-trained weights.

Multi-task Training

Data Filtering

As mentioned in our paper, the testing splits of all above tasks may overlap. We perform a data filtering step first to remove the testing data of a task from the training data of other tasks.

python _tools/multi_task_vid_filter.py --dataset lsmdc

python _tools/multi_task_vid_filter.py --dataset msrvtt

python _tools/multi_task_vid_filter.py --dataset msvd 

python _tools/multi_task_vid_filter.py --dataset tgif

Training

CUDA_VISIBLE_DEVICES='0,1,2,3,5,6,7' python -m torch.distributed.launch --nproc_per_node=8 --master_port=5566 main_multi_task_mlm.py --config _args/args_multi-task_all.json --path_output _snapshot --path_ckpt <path to downloaded pre-trained weights>

For task-specific baseline, update the main script to main_multi_task_multi_head.py and point --path_ckpt to task-specific pre-trained weights.

Citation

If you find this code useful, please consider citing the following papers:

@inproceedings{li2023lavender, 
  author = {Linjie Li and Zhe Gan and Kevin Lin and Chung-Ching Lin and Ce Liu and Zicheng Liu and Lijuan Wang}, 
  title = {LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling}, 
  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2023} 
}
@inproceedings{fu2021violet, 
  author = {Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang Wang and Lijuan Wang and Zicheng Liu}, 
  title = {VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling}, 
  booktitle = {arXiv:2111.1268}, 
  year = {2021} 
}

License

Our research code is released under MIT license.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.