Awesome

VindLU <img src="./imgs/vindlu.png" style="width: 40px">

VindLU <img src="./imgs/vindlu.png" style="width: 20px">: A Recipe for Effective Video-and-Language Pretraining [arXiv] [project page]

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Official PyTorch code for VindLU, a recipe for effective Video-and-Language (VidL) Pretraining.

News:

2022-12-07: Our annotation files and trained checkpoints are available on Google Drive.

Highlights:

Revealed the importance of each component in VidL pretraining (see our paper for details).
Cheap to train: 82 V100 GPU days to train on joint 10M video and 15M image datasets; 15 V100 days on 5M datasets.
State-of-the-art performance on video retrieval task and VidQA task. Specifically, our model achieves 61.2%(+7.8%) R@1 on DiDeMo and 55.0%(+6.1%) on ActivityNet-Captions.

Results

Text-to-Video Retrieval (R@1 accuracy).

Pretrained Data	MSR-VTT	DiDeMo	ANet	SSV2-Label	SSv2-Template	Checkpoints
5M	43.8	54.6	51.1	51.2	82.2	model
17M	45.3	59.2	54.4	53.0	86.2	model
25M	46.5	61.2	55.0	53.1	83.3	model

Video Question Answering (Top-1 accuracy).

Pretrained Data	ANet-QA	MSRVTT-QA	MSRVTT-MC	TVQA	Checkpoints
5M	44.2	43.6	95.2	79.0	model
17M	44.6	43.8	96.7	78.8	model
25M	44.7	44.6	97.1	79.0	model

Setup

The specific packages used in our experiment are detailed in vl.yml, you can easily create a conda env containing these packages.

# create 
conda env create -f vl.yml
# activate
conda activate vl

In your ~/.bashrc file, set the environment variables:

export VL_EXP_DIR="/path/to/ckpts_and_logs"
export VL_DATA_DIR="/path/to/data"

The datasets are stored under $VL_DATA_DIR and experiment outputs are stored under $VL_EXP_DIR. These variables are accessed by the config files in the configs/ directory.

[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable in the configs to be True.

Data

Put your data following the following structure:

$VL_DATA_DIR
    |-- anno_pretrain     
        |-- webvid_train.sqlite.db
        |-- ...
    |-- anno_downstream
        |-- didemo_ret_train.json
        |-- ...
    |-- videos_images
        |-- webvid_2fps_224
            |-- 1053400385.mp4
            |-- ...
        |-- ...

Our prepared annotations are available on Google Drive.

Refer DATA.md to check how to prepare the image/video datasets.

The annotation file is in json format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption} for image-text dataset, and is {'image': path_to_video, 'caption': video_caption} for video-text dataset. Note that we use the same key image for both image-text and video-text datasets for simplicity.

We store the pretraining annotation files using file-based database SQLite. SQLite allows us to load the captions on demand and thus save lots of CPU memory. If using json format, the Dataloader will cost more than 200GB CPU memory for 8 GPUs and 3 workers per GPU process. This is because each worker needs to maintain a copy of the json files in memory and the json files are too large (~5GB, and will be even larger when loaded as python objects).

You can use create_sqlite_db.py to convert the json annotation files into SQLite files.

Training and Inference

All the tasks can be launched via the python script tools/run.py.

Support slurm and run locally.

If there is no slurm, you need to submit the training script to each node.

It will use slurm if command sbatch exists. You can force to run locally by add the argument --no_slurm.

Usage:

python tools/run.py --slurm_args SLURM_ARGS --jobname JOBNAME \
    --dep_jobname DEP_JOBNAME \
    --nnodes NNODES --ngpus NGPUS --task TASK \
    --config CONFIG_FILE --model_args MODEL_ARGS

SLURM_ARGS: the additional arguments for slurm. You can set the default arguments (DEFAULT_SLURM_ARGS in tools/run.py). SLURM_ARGS will override the default arguments.
JOBNAME: The experiment name and job_name in slurm. All the outputs (checkpoint and logs) will be write to $VL_EXP_DIR/JOBNAME.
DEP_JOBNAME: The dependent job. This job will start only when DEP_JOBNAME is finished. You can use this feature to submit your pretraining, finetuning and evaluation jobs in the same time. Only valid when slurm is available.
NNODES: The number of nodes to use.
NGPUS: How many GPUs to use in each node.
TASK: This job will run the script tasks/TASK.py in tasks. Supported tasks:
- "pretrain": for pretraining.
- "retrieval": for text-to-video retrieval task.
- "retrieval_mc": for multi-choice VidQA on MSRVTT-MC dataset.
- "vqa": for open-ended V(id)QA task.
CONFIG_FILE: The path to the config file. For example, configs/pretrain.py for pretrain and configs/ret_didemo.py for video retrieval task on DiDeMo dataset.
MODEL_ARGS: The arguments to override the predefined arguments in CONFIG_FILE. Format: "key1 value1 key2 value2 ...". The value of format "eval(SOME_CODE)" will be evaluated using python's eval function.

Pre-Training

Example for pretraining on webvid_cc3m (5M):

corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0"

You can use this script if 1) with slurm or 2) no slurm but only 1 node is used.

If using slurm, remember to add --slurm_args SLURM_ARGS according to your cluster's settings. The same for the following examples.

You can change corpus to "webvid_14m" for 17M corpus and "webvid10m_14m" for 25M corpus.

See variable available_corpus in configs/data.py for all the supported pretraining corpus. You can add your own datasets by adding them to available_corpus.

Multi-node pretrain without slurm

The following example will do pretrain on 2 nodes with 4 GPUs per node without slurm.

When running locally without slurm, you need

specify the MASTER_ADDR and MASTER_PORT explicitly to make sure all the nodes use the same endpoint.
run the script on each node. The logs will only display on the master node.

export MASTER_ADDR="ip address of master node" # change to your real ip.
export MASTER_PORT=40041 # some unused port.
corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0" \
    --no_slurm

Finetuning and Evaluation

Our following examples are based on the pretrained model in the above section.

Text-to-video retrieval

Supported datasets: msrvtt, msrvtt-9k, didemo, anet. Example for msrvtt dataset:

dataset=msrvtt
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_${dataset}

if [[ "$dataset" == *"msrvtt"* ]]; then ngpus=4; else ngpus=1; fi
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}"

Video Question Answering

Open-ended QA:

dataset=msrvtt # supported: msrvtt, anet
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-qa_${dataset}

ngpus=1
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}"

MSRVTT-MC (multiple-choice). We directly evaluate using the fintuned retrieval model.

pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_msrvtt

# evaluation
python tools/run.py --nnodes 1 --ngpus 1 --task retrieval_mc \
    --jobname ${ft_name}/eval_${nfrm_test}frm-mc --dep_jobname ${ft_name} \
    --config configs/ret_msrvtt_mc.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test 12"

Acknowledgement

This code used resources from Singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{cheng2022vindlu,
  title={VindLU: A Recipe for Effective Video-and-Language Pretraining},
  author={Cheng, Feng and Wang, Xizi and Lei, Jie and Crandall, David and Bansal, Mohit and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2212.05051},
  year={2022}
}