Home

Awesome

VindLU <img src="./imgs/vindlu.png" style="width: 40px">

VindLU <img src="./imgs/vindlu.png" style="width: 20px">: A Recipe for Effective Video-and-Language Pretraining [arXiv] [project page]

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Official PyTorch code for VindLU, a recipe for effective Video-and-Language (VidL) Pretraining.

News:

Highlights:

<p align="center"> <img src="./imgs/teaser.jpg" style="width: 95%"> </p>

Results

Text-to-Video Retrieval (R@1 accuracy).
Pretrained DataMSR-VTTDiDeMoANetSSV2-LabelSSv2-TemplateCheckpoints
5M43.854.651.151.282.2model
17M45.359.254.453.086.2model
25M46.561.255.053.183.3model
Video Question Answering (Top-1 accuracy).
Pretrained DataANet-QAMSRVTT-QAMSRVTT-MCTVQACheckpoints
5M44.243.695.279.0model
17M44.643.896.778.8model
25M44.744.697.179.0model
<!-- ##### Text-to-video Retrieval --> <!-- <p align="center"> --> <!-- <img src="./imgs/t2v_acc.png" style="width: 95%"> --> <!-- </p> --> <!-- ##### More Tasks --> <!-- <p align="center"> --> <!-- <img src="./imgs/vqa_res.png" style="width: 95%"> --> <!-- </p> -->

Setup

The specific packages used in our experiment are detailed in vl.yml, you can easily create a conda env containing these packages.

# create 
conda env create -f vl.yml
# activate
conda activate vl

In your ~/.bashrc file, set the environment variables:

export VL_EXP_DIR="/path/to/ckpts_and_logs"
export VL_DATA_DIR="/path/to/data"

The datasets are stored under $VL_DATA_DIR and experiment outputs are stored under $VL_EXP_DIR. These variables are accessed by the config files in the configs/ directory.

[Optional] Our codebase support using wandb to monitor training. If you want to use wandb, you will need to set up it following this very short instruction, and also set wandb.enable in the configs to be True.

Data

Put your data following the following structure:

$VL_DATA_DIR
    |-- anno_pretrain     
        |-- webvid_train.sqlite.db
        |-- ...
    |-- anno_downstream
        |-- didemo_ret_train.json
        |-- ...
    |-- videos_images
        |-- webvid_2fps_224
            |-- 1053400385.mp4
            |-- ...
        |-- ...

Our prepared annotations are available on Google Drive.

Refer DATA.md to check how to prepare the image/video datasets.

The annotation file is in json format, which can be loaded as a list of dictionaries. Each dictionary is {'image': path_to_image, 'caption': image_caption} for image-text dataset, and is {'image': path_to_video, 'caption': video_caption} for video-text dataset. Note that we use the same key image for both image-text and video-text datasets for simplicity.

We store the pretraining annotation files using file-based database SQLite. SQLite allows us to load the captions on demand and thus save lots of CPU memory. If using json format, the Dataloader will cost more than 200GB CPU memory for 8 GPUs and 3 workers per GPU process. This is because each worker needs to maintain a copy of the json files in memory and the json files are too large (~5GB, and will be even larger when loaded as python objects).

You can use create_sqlite_db.py to convert the json annotation files into SQLite files.

Training and Inference

All the tasks can be launched via the python script tools/run.py.

If there is no slurm, you need to submit the training script to each node.

It will use slurm if command sbatch exists. You can force to run locally by add the argument --no_slurm.

Usage:

python tools/run.py --slurm_args SLURM_ARGS --jobname JOBNAME \
    --dep_jobname DEP_JOBNAME \
    --nnodes NNODES --ngpus NGPUS --task TASK \
    --config CONFIG_FILE --model_args MODEL_ARGS

Pre-Training

Example for pretraining on webvid_cc3m (5M):

corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0"

You can use this script if 1) with slurm or 2) no slurm but only 1 node is used.

If using slurm, remember to add --slurm_args SLURM_ARGS according to your cluster's settings. The same for the following examples.

You can change corpus to "webvid_14m" for 17M corpus and "webvid10m_14m" for 25M corpus.

See variable available_corpus in configs/data.py for all the supported pretraining corpus. You can add your own datasets by adding them to available_corpus.

Multi-node pretrain without slurm

The following example will do pretrain on 2 nodes with 4 GPUs per node without slurm.

When running locally without slurm, you need

export MASTER_ADDR="ip address of master node" # change to your real ip.
export MASTER_PORT=40041 # some unused port.
corpus="webvid_cc3m"
pt_name=pt_${corpus}_8x64
python tools/run.py --nnodes 2 --ngpus 4 --task pretrain \
    --jobname $pt_name \
    --config configs/pretrain.py \
    --model_args "train_corpus ${corpus} criterion.loss_weight.vtc 1.0" \
    --no_slurm

Finetuning and Evaluation

Our following examples are based on the pretrained model in the above section.

Text-to-video retrieval

Supported datasets: msrvtt, msrvtt-9k, didemo, anet. Example for msrvtt dataset:

dataset=msrvtt
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_${dataset}

if [[ "$dataset" == *"msrvtt"* ]]; then ngpus=4; else ngpus=1; fi
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task retrieval \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/ret_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
Video Question Answering
dataset=msrvtt # supported: msrvtt, anet
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-qa_${dataset}

ngpus=1
if [[ "$dataset" == *"anet"* ]]; then nfrm_test=32; else nfrm_test=12; fi

# finetune
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name} --dep_jobname ${pt_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${pt_name}/ckpt_09.pth"

# evaluation
python tools/run.py --nnodes 1 --ngpus ${ngpus} --task vqa \
    --jobname ${ft_name}/eval_${nfrm_test}frm --dep_jobname ${ft_name} \
    --config configs/qa_${dataset}.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test ${nfrm_test}" 
pt_name=pt_webvid_cc3m_8x64
ft_name=ft_12frm-${pt_name}-ret_msrvtt

# evaluation
python tools/run.py --nnodes 1 --ngpus 1 --task retrieval_mc \
    --jobname ${ft_name}/eval_${nfrm_test}frm-mc --dep_jobname ${ft_name} \
    --config configs/ret_msrvtt_mc.py \
    --model_args "pretrained_path $VL_EXP_DIR/${ft_name}/ckpt_best.pth \
        evaluate True test_types 'eval([\"test\"])'  num_frames_test 12"

Acknowledgement

This code used resources from Singularity, transformers, ALBEF, ClipBERT, frozen. The code is implemented using PyTorch. We thank the authors for open-sourcing their awesome projects.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{cheng2022vindlu,
  title={VindLU: A Recipe for Effective Video-and-Language Pretraining},
  author={Cheng, Feng and Wang, Xizi and Lei, Jie and Crandall, David and Bansal, Mohit and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2212.05051},
  year={2022}
}