Home

Awesome

<h2 align="center"> <a href="https://arxiv.org/abs/2405.13382">VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding</a></h2> <h5 align="center"> If our project helps you, please give us a star ⭐ and cite our <a href="#bibliography">paper</a>!</h2> <h5 align="center">

hf_space hf_checkpoint hf_data arxiv Hits

News

Overview

We introduce

<div align="center"> <img src="figures/vtg-lm-overview.png" alt="Overview of VTG-LLM" width="700"/> <br/> <figcaption>Overview of VTG-LLM.</figcaption> </div>

Enviroments

We recommend utilizing NPU environments for training, evaluation, and fine-tuning. The environment we use can be found in environment-npu.yaml. Additionally, we have discovered that executing the script below is sufficient for most scenarios.

bash install_requirements.sh

If an NPU is not available, a V100 can also be employed for training and evaluation, but it cannot be used for fine-tuning checkpoints trained by an NPU. The necessary environments can be found in requirements-v100.txt.

Model Checkpoints

The model checkpoint (without finetuning) is avaliable at huggingface:

git lfs install

git clone https://huggingface.co/Yongxin-Guo/VTG-LLM

Data

See DATA.md for details. The data annotations are avaliable at huggingface:

git lfs install

git clone https://huggingface.co/datasets/Yongxin-Guo/VTG-IT

Requirments

Please download the following model checkpoints:

Script

Tuning

Config the checkpoint and dataset paths in pretrain-slot-sample-fmt-96.yaml. Config the bert checkpoint paths in blip2.py and vtgllm.py

torchrun --nproc_per_node=16 train.py --cfg-path  train_configs/videollama/pretrain-slot-sample-fmt-96.yaml

Evaluation

Config the checkpoint and dataset paths in videollama-slot-96.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Results

Youcook2CIDERMETEORSODA_cF1
t=1.0 (paper)5.01.91.517.5
t=0.15.41.81.618.4
Charades-STA0.30.50.7
t=1.0 (paper)52.033.815.7
t=0.153.936.316.6
QVHighlightsmAPHit@1
t=1.0 (paper)16.533.5
t=0.116.230.7
ActivityNetCIDERMETEORSODA_cF1
t=1.0 (paper)18.25.74.734.0
t=0.120.75.95.134.8

Demo

# cat_and_chicken.mp4

# Describe this video

A cute little kitten is sleeping on a couch. A little chicken is sitting on the cats chest and looking at the camera. The cat is purring and the chicken is moving its head.

# Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.

0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird lands on the couch and gently touches the kitten's head. 0020.0 - 0030.0 seconds, The bird picks up the kitten and starts to play with it. 0030.0 - 0040.0 seconds, The kitten tries to push the bird away, but the bird continues to play with it. 0040.0 - 0050.0 seconds, The kitten falls asleep on the couch.

Gradio Demo

You need to firstly change the path of videos and model checkpoints to your path.

python gradio_demo.py

Recommended GPUs

Acknowledgement

We are grateful for the following awesome projects:

Bibliography

If you find this repository helpful for your project, please consider citing:

@article{guo2024vtg,
  title={VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding},
  author={Guo, Yongxin and Liu, Jingyu and Li, Mingda and Tang, Xiaoying and Chen, Xi and Zhao, Bo},
  journal={arXiv preprint arXiv:2405.13382},
  year={2024}
}