Home

Awesome

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts

Project page | arXiv

Model

ViTiS consists of a frozen video encoder, a visual mapping network, a frozen text embedding layer, a frozen language model and a frozen classifier head. Given input video frames and text, video encoder extracts frame features and the visual mapping network maps them to the same space as the text embeddings obtained by text embedding layer. Then, the language model takes the video and text embeddings as input and predicts the masked input tokens.

The language model incorporates learnable text prompts in the key and value of multi-head-attention and adapter layers after each self-attention and feed-forward layer, before LayerNorm.

Our visual mapping network consists of a number of layers, each performing cross-attention between learnable visual prompts and video frame features followed by self-attention.

Setup

To set up a conda environment:

conda env create -f vitis.yml 
conda activate vitis
pip install git+https://github.com/openai/CLIP.git
conda update ffmpeg

Data Preparation

This repository contains both ready-to-use data and guidelines for processing raw data.

Processed Data
Raw Data Processing Guidelines
<details> <summary>Click for more details.</summary>
Feature Extraction for downstream datasets
python extract/prepare_video_ids_for_all_datasets.py
python extract/extract_video_features.py --dataset_name <dataset_name> \ 
--feature_extraction_csv data/<DATASET_PATH>/video_id_list.csv \
--feature_extraction_video_main_path data/<DATASET_PATH>/videos \
--feature_extraction_features_main_path data/<DATASET_PATH>/features
python extract/merge_features.py --dataset <dataset_name> \
--folder data/<DATASET_PATH>/features \ 
--output_path data/<DATASET_PATH>/features/clipvitl14.pth
python extract/create_hdf5.py
</details>

Pre-training

python -m torch.distributed.launch --nproc_per_node 8 --use_env main.py \
--combine_datasets webvid --combine_datasets_val webvid --save_dir==output_webvid --lr=2e-5 --different_lr_embedding_layers \
--batch_size=16 --batch_size_val=16 --epochs=10  --amp  \ 
--mapping_network_feedforward  --text_prompt_projection_layer

The other parameters are set to default. You can also check our paper. Note that pre-training is done on 8 Tesla V100 GPUs (32 GB).

Zero-shot evaluation

python -m torch.distributed.launch --nproc_per_node 1 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--batch_size_val=32  --amp --mapping_network_feedforward  --text_prompt_projection_layer \
--<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json --load checkpoints/vitis_pretraining_zero_shot.pth --eval --test \

Few-shot fine-tuning

All trainable model parameters fine-tuned

python -m torch.distributed.launch --nproc_per_node 4 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--save_dir==output_few_shot --lr=1e-5 --different_lr_embedding_layers \
--amp --mapping_network_feedforward  --text_prompt_projection_layer \
--batch_size=8 --batch_size_val=32 --epochs=20  --<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json   \ 
--load checkpoints/vitis_pretraining_few_shot.pth

Only prompts fine-tuned

python -m torch.distributed.launch --nproc_per_node 4 --use_env python videoqa.py --combine_datasets <dataset_name> --combine_datasets_val <dataset_name> \
--save_dir==output_few_shot --lr=1e-2 --amp --mapping_network_feedforward --batch_size=8 --batch_size_val=32 --epochs=20 \ 
--<dataset_name>_vocab_path=data/<DATASET_PATH>`/vocab1000.json   \ 
--load checkpoints/vitis_pretraining_few_shot.pth --loaded_prompts text --only_finetune_loaded_prompts visual_text \

License

This code is released under the Apache License 2.0.

Acknowledgments

The code is written based on <a href="https://github.com/antoyang/FrozenBiLM" target="_blank">FrozenBiLM</a>.
The prompt learning code is inspired by <a href="https://github.com/THUDM/P-tuning-v2/" target="_blank"> P-tuning-v2</a>.

Citation

If this code is helpful for you, please cite the following:

@inproceedings{engin_2023_ICCV,
    title={Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts},
    author={Engin, Deniz and Avrithis, Yannis},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    year={2023}
}