Home

Awesome

<p align="center"> <img src="demo/assets/profile.png" width="150" style="margin-bottom: 0.2;"/> <p> <div align="center">

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> <a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a> <a href="https://huggingface.co/docs/transformers/index/"><img alt="Lightning" src="https://img.shields.io/badge/-Transformers-ffd21e?logo=huggingface&logoColor=white"></a> <a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a><br>

Paper Conference

</div>

Updates

Overview

This is a chat agent based on our work Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge for Long Video Understanding. This work is finetuned on video-instruction datasets and image-instruction datasets.

We have meticulously chosen two distinct architectural paradigms for our study: the encoder-decoder architecture, exemplified by BLIP2-Flan-T5-xl (original version), and the decoder-only architecture, represented by InstructBLIP-Vicuna-7B (original version). For further exploration, we also provide the code to tune the LLM with LoRA.

<img src='demo/assets/framework.png'>

Installation

# clone project
git clone https://github.com/bigai-nlco/VideoTGB
cd VideoTGB

# create conda environment
conda create -n VideoTGB
conda activate VideoTGB

# install requirements
pip install -r requirements.txt

Data Preparation

You can download all the instruction data and evaluation data from Video-LLaVA/DATA

inputs/ivinstruct
ā”œā”€ā”€ llava_image_tune
ā””ā”€ā”€ videochatgpt_tune

How to run

Our training framework offers tailored scripts to meet the diverse needs of researchers.

Train model

# run on local
python src/train.py experiment=LSTP_SF_blip2flant5xl_videoinstruct # blip2-flan-t5-xl + video-instruct
python src/train.py experiment=LSTP_SF_instructblipvicuna7b_videoinstruct # instructblip-vicuna-7b + video-instruct

# run on cluster
sbatch scripts/videoinstruct_train.slurm # blip2-flan-t5-xl + video-instruct
sbatch scripts/videoinstruct_vicuna_train.slurm # instructblip-vicuna-7b + video-instruct

For those with limited GPU resources, we also provide the pipeline to shorten the training procedure

# step 1: generate the pseudo labels from the base-model, and extract the optical flow in advance

# step 2: train the temporal sampler
python src/train.py experiment=LSTP_TG_blip2flant5xl_videoinstruct

# step 3: train VideoTGB with fixed temporal sampler
python src/train.py experiment=LSTP_blip2flant5xl_ivinstruct # blip2-flan-t5-xl + video-instruct + image-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivinstruct # instructblip-vicuna-7b + video-instruct + image-instruct
python src/train.py experiment=LSTP_blip2flant5xl_ivtinstruct # blip2-flan-t5-xl (LoRA) + video-instruct + image-instruct + text-instruct
python src/train.py experiment=LSTP_instructblipvicuna7b_ivtinstruct # instructblip-vicuna-7b (LoRA) + video-instruct + image-instruct + text-instruct

Evaluate model

# run inference for VideoTGB-Vicuna-7B
bash eval/scripts/run_qa_msvd_vicuna.sh
bash eval/scripts/run_qa_msrvtt_vicuna.sh
bash eval/scripts/run_qa_activitynet_vicuna.sh

# run inference for VideoTGB-Flan-T5-xl
bash eval/scripts/run_qa_msvd.sh
bash eval/scripts/run_qa_msrvtt.sh
bash eval/scripts/run_qa_activitynet.sh

# run evaluation
bash eval/scripts/eval_qa_msvd.sh
bash eval/scripts/eval_qa_msrvtt.sh
bash eval/scripts/eval_qa_activitynet.sh

Configures

data:
  - text_dir
  - video_dir
  - processor_name
  - sampler_processor_name
  - nframe # final sampled frames
  - target_size # image size
  - batch_size
model:
  - model_name_or_path
  - sampler_name_or_path
  - of_extractor_name_or_path
  - optimizer
  - scheduler
  - generate_configs
path:
  - data_dir
  - video_dir
  - text_dir
  - output_dir
trainer: 
  - strategy
  - accelerator
  - devices
  - num_nodes
  - precision

Evaluation Results

Metrics: Accuracy/Score

MethodsLLM sizeMSVD-QAMSRVTT-QAActivityNet-QA
FrozenBiLM1B32.2/-16.8/-24.7/-
VideoChat7B56.4/2.845.0/2.5-/2.2
LLaMA-Adapter7B54.9/3.143.8/2.734.2/2.7
Video-LLaMA7B51.6/2.529.6/1.812.4/1.1
Video-ChatGPT7B64.9/3.349.3/2.835.2/2.7
Video-LLaVA7B70.7/3.959.2/3.545.3/3.3
VideoTGB-7B7B71.3/3.957.3/3.343.9/3.3

Demo

We provide the chat demo supported by Gradio. We also provide some checkpoints, you can download it an put it to ckpts/VideoTGB-Chat/.

Model Zoo

ModelBase ModelTraining DataStrategy for LLMDownload Link
LSTP-7BInstructBlip-Vicuna-7BVideo-ChatGPT, LLaVAfixedHuggingface
LSTP-FlanT5xlFlanT5-xlVideo-ChatGPT, LLaVAfixedHuggingface
python -m demo.demo
<img src='demo/assets/demo.png'>

Acknowledgement

Citation

If you find our work helpful, please consider ā­ļø and cite our work:

@article{wang2024videotgb,
    title={Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge},
    author={Wang, Yuxuan and Wang, Yueqian and Wu, Pengfei and Liang, Jianxin and Zhao, Dongyan and Liu, Yang and Zheng, Zilong},
    year={2024},
    journal = {arXiv preprint arXiv:2402.16050}
}