Home

Awesome

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

<div align=center><img src=img/img_radar.png/ width="75%" height="75%"></div>

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

<div align=center><img src=img/img_model.png/></div>

Building Environment

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
sh preinstall.sh

Download Checkpoints

ModelPretrained CkptFinetuned Ckpt on MSRVTT-RetrievalFinetuned Ckpt on MSRVTT-Caption
VALOR-BVALOR-baseVALOR_base_msr_ret.ptVALOR_base_msr_cap.pt
VALOR-LVALOR-largeVALOR_large_msr_ret.ptVALOR_large_msr_cap.pt

Put VALOR-base and VALOR-large under the output dir. (VALOR/output/VALOR-base, VALOR/output/VALOR-large)

Prepare Datasets

VALOR is pretrained and tested on multiple vision-language, audio-language and audiovisual-language datasets. e.g. PRETRAIN: VALOR-1M, WebVid-2.5M, CC-3M (VALOR-base) TEST: VALOR-32K, MSRVTT, MSVD, DiDeMo, LSMDC, ActivityNet, VATEX, AudioCaps, ClothoV1, TGIF-Frame, MSCOCO, VQAV2... We here take MSRVTT as an example to show the data processing procedures, other datasets take a similar way.

The processed dataset path should be as follows:

   ├── datasets
   │   ├── msrvtt
   │   │   ├── raw_videos
   │   │   │    ├── video0.mp4
   │   │   │    └── video1.mp4
   │   │   ├── frames_fps4
   │   │   │    ├── video0
   │   │   │    │   ├──img_0001.jpg
   │   │   │    │   └──img_0002.jpg
   │   │   │    └── video1
   │   │   │    │   ├──img_0001.jpg
   │   │   │    │   └──img_0002.jpg
   │   │   ├── audio_22050hz
   │   │   │    ├── video1.wav
   │   │   │    └── video3.wav
   │   │   ├── standardsplit_train_id.json
   │   │   ├── standardsplit_test_id.json
   │   │   ├── 1KAsplit_train_id.json
   │   │   ├── 1KAsplit_test_id.json
   │   │   ├── txt_mapper.json
   │   │   ├── txt_mapper_1kAsplit_test.json    
   │   │   ├── txt_mapper_vqa.json    
   │   │   └── caption_annotation.json    

We provide processed json files for most finetuneing datasets here, and you only need to download and extract raw videos of each dataset.

Finetune Model

sh scripts/finetune_ret.sh $pretrain_path(output/VALOR_base)
sh scripts/finetune_cap.sh $pretrain_path(output/VALOR_base)
sh scripts/finetune_qa.sh $pretrain_path(output/VALOR_base)

The finetuning output path will be the subdir of $pretrain_path

Test Model

For example, the cmd for finetuning retrieval model in scripts/finetune_ret.sh is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8   --master_port 32711 ./train.py \
--pretrain_dir $basedir \
--config ./config/fast-retrieval-msrvtt.json \
--output_dir $basedir'/ret-msrvtt-lr2e-5-bs64-epoch5'   \
--learning_rate 2e-5  \
--train_video_sample_num 4 \
--test_video_sample_num 8  \
--save_best true \

if you want to test model, just add following two rows to the cmd:

--zero_shot \
--checkpoint $checkpoint_save_path(.pt)

Pretrain Model

sh scripts/pretrain.sh

Inference

For QA task

python inference.py --video_path $VIDEOPATH --task 'qa%tva' --model_dir $MODELDIR --question 'what is in the video'

For caption task

python inference.py --video_path $VIDEOPATH --task 'cap%tva' --model_dir $MODELDIR 

Customize

VALOR's framework is easy to expand new tasks/datasets. what you need to do is

  1. prepare dataset as illustrated above
  2. write config file (copy a config file and change 'data_cfg')

Citation

If you find this code useful for your research, please consider citing:

@article{chen2023valor,
  title={VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset},
  author={Chen, Sihan and He, Xingjian and Guo, Longteng and Zhu, Xinxin and Wang, Weining and Tang, Jinhui and Liu, Jing},
  journal={arXiv preprint arXiv:2304.08345},
  year={2023}
}

License

MIT -->