Awesome
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning
Code for the LOVEU@CVPR2023 Workshop Generic Event Boundary Captioning (GEBC) Chanllenge. Our proposed method achieved a 76.14 score on the test set and won the $1^{st}$ place in the challenge. The technical report can be found here.
Introduction
We proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM.
<p align="center" width="100%"> <a target="_blank"><img src="figs/model.png" alt="LLMVA-GEBC" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p>Enviroment Preparation
First, you should create a conda environment:
conda env create -f environment.yml
conda activate llmvagebc
Prerequisite Checkpoints
Before using the repository, make sure you have obtained the following checkpoints:
Remember to change the path of checkpoints ckpt
in the config file.
Data
Download the Kinetic-GEBC dataset from https://sites.google.com/view/loveucvpr23/track2.
For primary visual feature:
Using BLIP-2 to extract primary visual features. We use feature_extraction.py
to do so. Remember to change the video_dir
and save_dir
in train_configs/blip2_feature_extract.yaml
before you run:
python feature_extraction.py --cfg-path train_configs/blip2_feature_extract.yaml
For other visual features: CLIP to extract frame-level features and Omnivore to extract clip-level features. We use this pipeline to extract features.
Then, put the extracted features under these three folders:
data/features/eva_vit_g_q_former_tokens_12
data/features/clip_fps_15_stride_1_rename,
data/features/omnivore_fps_15_len_16_stride_1_rename
You can also directly download the official provided features here. But, remember to change the q_former_feature_folder
, other_feat_total_size
, other_feature_names
and other_feature_folders
in the config file.
Using VinVL to extract region-level features. The region feature of a video is saved to multiple .npy
files, where each single file contains the region feature of a sampled frame. Merge the feature file paths into video_to_frame_index.json
in the following format:
{
"video_id": [
"frame_1_feat.npy",
"frame_2_feat.npy",
...
],
...
}
Then put this file under data/features/
.
Training and Validation
Firstly, set the configs in train_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml
.
Then run the script
CUDA_VISIBLE_DEVICES=${YOUR_GPU_ID} python train.py \
--cfg-path train_configs/${NAME_OF_YOUR_CONFIG_FILE}.yaml
The results can be found in video_llama/output/
.
Acknowledgement
We are grateful for the following awesome projects our LLMVA-GEBC arising from:
Citation
If you find our code useful, please cite the repo as follows:
@article{tang2023llmva,
title={LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning},
author={Tang, Yunlong and Zhang, Jinrui and Wang, Xiangchen and Wang, Teng and Zheng, Feng},
journal={arXiv preprint arXiv:2306.10354},
year={2023}
}