Awesome
VindLU <img src="./imgs/vindlu.png" style="width: 40px">
VindLU <img src="./imgs/vindlu.png" style="width: 20px">: A Recipe for Effective Video-and-Language Pretraining [arXiv] [project page]
Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius
Official PyTorch code for VindLU, a recipe for effective Video-and-Language (VidL) Pretraining.
News:
- 2022-12-07: Our annotation files and trained checkpoints are available on Google Drive.
Highlights:
- Revealed the importance of each component in VidL pretraining (see our paper for details).
- Cheap to train: 82 V100 GPU days to train on joint 10M video and 15M image datasets; 15 V100 days on 5M datasets.
- State-of-the-art performance on video retrieval task and VidQA task. Specifically, our model achieves 61.2%(+7.8%) R@1 on DiDeMo and 55.0%(+6.1%) on ActivityNet-Captions.
Results
Text-to-Video Retrieval (R@1 accuracy).
Pretrained Data | MSR-VTT | DiDeMo | ANet | SSV2-Label | SSv2-Template | Checkpoints |
---|---|---|---|---|---|---|
5M | 43.8 | 54.6 | 51.1 | 51.2 | 82.2 | model |
17M | 45.3 | 59.2 | 54.4 | 53.0 | 86.2 | model |
25M | 46.5 | 61.2 | 55.0 | 53.1 | 83.3 | model |
Video Question Answering (Top-1 accuracy).
Pretrained Data | ANet-QA | MSRVTT-QA | MSRVTT-MC | TVQA | Checkpoints |
---|---|---|---|---|---|
5M | 44.2 | 43.6 | 95.2 | 79.0 | model |
17M | 44.6 | 43.8 | 96.7 | 78.8 | model |
25M | 44.7 | 44.6 | 97.1 | 79.0 | model |
Setup
The specific packages used in our experiment are detailed in vl.yml, you can easily create a conda env containing these packages.
# create
conda env create -f vl.yml
# activate
conda activate vl
In your ~/.bashrc
file, set the environment variables:
export VL_EXP_DIR="/path/to/ckpts_and_logs"
export VL_DATA_DIR="/path/to/data"
The datasets are stored under $VL_DATA_DIR
and experiment outputs are stored under $VL_EXP_DIR
.
These variables are accessed by the config files in the configs/ directory.
[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable
in the configs to be True
.
Data
Put your data following the following structure:
$VL_DATA_DIR
|-- anno_pretrain
|-- webvid_train.sqlite.db
|-- ...
|-- anno_downstream
|-- didemo_ret_train.json
|-- ...
|-- videos_images
|-- webvid_2fps_224
|-- 1053400385.mp4
|-- ...
|-- ...
Our prepared annotations are available on Google Drive.
Refer DATA.md to check how to prepare the image/video datasets.
The annotation file is in json
format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption}
for image-text dataset, and is {'image': path_to_video, 'caption': video_caption}
for video-text dataset. Note that we use the same key image
for both image-text and video-text datasets for simplicity.
We store the pretraining annotation files using file-based database SQLite. SQLite allows us to load the captions on demand and thus save lots of CPU memory. If using json format, the Dataloader will cost more than 200GB CPU memory for 8 GPUs and 3 workers per GPU process. This is because each worker needs to maintain a copy of the json files in memory and the json files are too large (~5GB, and will be even larger when loaded as python objects).
You can use create_sqlite_db.py to convert the json annotation files into SQLite files.
Training and Inference
All the tasks can be launched via the python script tools/run.py
.
- Support slurm and run locally.
If there is no slurm, you need to submit the training script to each node.
It will use slurm if command
sbatch
exists. You can force to run locally by add the argument--no_slurm
.
Usage:
python tools/run.py --slurm_args SLURM_ARGS --jobname JOBNAME \
--dep_jobname DEP_JOBNAME \
--nnodes NNODES --ngpus NGPUS --task TASK \
--config CONFIG_FILE --model_args MODEL_ARGS
SLURM_ARGS
: the additional arguments for slurm. You can set the default arguments (DEFAULT_SLURM_ARGS
in tools/run.py).SLURM_ARGS
will override the default arguments.JOBNAME
: The experiment name and job_name in slurm. All the outputs (checkpoint and logs) will be write to$VL_EXP_DIR/JOBNAME
.DEP_JOBNAME
: The dependent job. This job will start only whenDEP_JOBNAME
is finished. You can use this feature to submit your pretraining, finetuning and evaluation jobs in the same time. Only valid when slurm is available.NNODES
: The number of nodes to use.NGPUS
: How many GPUs to use in each node.TASK
: This job will run the scripttasks/TASK.py
in tasks. Supported tasks:- "pretrain": for pretraining.
- "retrieval": for text-to-video retrieval task.
- "retrieval_mc": for multi-choice VidQA on MSRVTT-MC dataset.
- "vqa": for open-ended V(id)QA task.
CONFIG_FILE
: The path to the config file. For example, configs/pretrain.py for pretrain and configs/ret_didemo.py for video retrieval task on DiDeMo dataset.MODEL_ARGS
: The arguments to override the predefined arguments inCONFIG_FILE
. Format: "key1 value1 key2 value2 ...". The value of format "eval(SOME_CODE)" will be evaluated using python's eval function.
Pre-Training
Example for pretraining on webvid_cc3m (5M):
corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
--jobname $pt_name \
--config configs/pretrain.py \
--model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0"
You can use this script if 1) with slurm or 2) no slurm but only 1 node is used.
If using slurm, remember to add
--slurm_args SLURM_ARGS
according to your cluster's settings. The same for the following examples.
You can change corpus
to "webvid_14m" for 17M corpus and "webvid10m_14m" for 25M corpus.
See variable
available_corpus
in configs/data.py for all the supported pretraining corpus. You can add your own datasets by adding them toavailable_corpus
.
Multi-node pretrain without slurm
The following example will do pretrain on 2 nodes with 4 GPUs per node without slurm.
When running locally without slurm, you need
- specify the
MASTER_ADDR
andMASTER_PORT
explicitly to make sure all the nodes use the same endpoint. - run the script on each node. The logs will only display on the master node.
export MASTER_ADDR="ip address of master node" # change to your real ip.
export MASTER_PORT=40041 # some unused port.
corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
--jobname $pt_name \
--config configs/pretrain.py \
--model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0" \
--no_slurm
Finetuning and Evaluation
Our following examples are based on the pretrained model in the above section.
Text-to-video retrieval
Supported datasets: msrvtt
, msrvtt-9k
, didemo
, anet
.
Example for msrvtt
dataset:
dataset=msrvtt
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_${dataset}
if [[ "$dataset" == *"msrvtt"* ]]; then ngpus=4; else ngpus=1; fi
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi
# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
--jobname ${ft_name} --dep_jobname ${pt_name} \
--config configs/ret_${dataset}.py \
--model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"
# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
--jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
--config configs/ret_${dataset}.py \
--model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
evaluate True test_types 'eval([\"test\"])' num_frames_test ${nfrm_test}"
Video Question Answering
- Open-ended QA:
dataset=msrvtt # supported: msrvtt, anet
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-qa_${dataset}
ngpus=1
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi
# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
--jobname ${ft_name} --dep_jobname ${pt_name} \
--config configs/qa_${dataset}.py \
--model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"
# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
--jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
--config configs/qa_${dataset}.py \
--model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
evaluate True test_types 'eval([\"test\"])' num_frames_test ${nfrm_test}"
- MSRVTT-MC (multiple-choice). We directly evaluate using the fintuned retrieval model.
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_msrvtt
# evaluation
python tools/run.py --nnodes 1 --ngpus 1 --task retrieval_mc \
--jobname ${ft_name}/eval_${nfrm_test}frm-mc --dep_jobname ${ft_name} \
--config configs/ret_msrvtt_mc.py \
--model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
evaluate True test_types 'eval([\"test\"])' num_frames_test 12"
Acknowledgement
This code used resources from Singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.
Citation
If you find this project useful for your research, please use the following BibTeX entry.
@article{cheng2022vindlu,
title={VindLU: A Recipe for Effective Video-and-Language Pretraining},
author={Cheng, Feng and Wang, Xizi and Lei, Jie and Crandall, David and Bansal, Mohit and Bertasius, Gedas},
journal={arXiv preprint arXiv:2212.05051},
year={2022}
}