Awesome
<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Casual Event Modeling</a></h2> <h5 align="center"> If our project helps you, please give us a star ⭐ and cite our <a href="#bibliography">paper</a>!</h2> <h5 align="center">News
- 01/11/2024, 🔥We are excited to announce the release of trace-uni, which has been enhanced by incorporating additional general video understanding data from a subset of LLaVA-Video-178k. Our results indicate that (1) the TRACE architecture is still capable of handling general video understanding tasks (53.8 on MVBench and 49.6 on VideoMME); (2) although not adding more VTG data, trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
- 31/10/2024, 🔥 We evaluated the TRACE nodel on VideoMME benchmark and updated the evaluation code.
- 25/10/2024, 🔥 We evaluated the TRACE model on the MVBench benchmark and updated the evaluation code accordingly. Our findings indicate that, despite not being trained on extensive multi-task datasets, TRACE is still capable of effectively handling general QA tasks.
- 19/10/2024, 🔥 We release trace-retrieval by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks
- 10/10/2024, 🔥 Annotation files of training data are released!
- 10/10/2024, 🔥 Our model checkpoints and code are released!
TODO
- Release the model checkpoints
- Release the inference and evaluation code
- Release the training and fine-tuning code
- Release the training data
- Release the TRACE-Retrieval, which outputs timestamps of input frames instead of predict unseen timestamps.
- Train TRACE models on more tasks.
Overview
In this work
- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.
Enviroments
We use NPU environments for training and fine-tuning, and use V100 GPUs for evaluation. The environment we use can be found in npu-requirements and gpu-requirements.
Model Zoo
Checkpoints | Description | URL |
---|---|---|
Initialization | Weights initialized from VideoLLaMA2 | trace-init |
Stage-1 | Model checkpoints trained after stage-1 | trace-stage1 |
Stage-2 | Model checkpoints trained after stage-2 | trace |
FT-Charades | Fine-tuned on Charades-STA dataset | trace-ft-charades |
FT-Youcook2 | Fine-tuned on Youcook2 dataset | trace-ft-youcook2 |
FT-QVHighlights | Fine-tuned on QVHighlights dataset | trace-ft-qvhighlights |
TRACE-retrieval | Forcing the predicted timestamps to be align with input timestamps | trace-retrieval |
TRACE-uni | Incorporating additional general video understanding data from a subset of LLaVA-Video-178k. | trace-uni |
Inference and Evaluation
Please make sure the model and video paths are correct before running the code.
- Inference codes are provided in inference.py.
- Evaluation codes are provided in trace/eval
- Evaluation of VTG tasks: Provided in eval/eval.sh.
- Evaluation of MVBench: Provided in eval/mvbench/eval.sh
- Evaluation of VideoMME: Provided in eval/videomme/eval.sh
Data
We have provided the annotation files, and the raw videos can be prepared by the following projects
Training
Stage 1 training
bash TRACE/scripts/train/pretrain-128.sh
Stage 2 training
bash TRACE/scripts/train/sft-128.sh
Fine-tune on downsteam task
bash TRACE/scripts/train/sft-youcook2.sh
Please config the data and model paths before running the scrips.
Results
Youcook2 (Zero-Shot) | CIDER | METEOR | SODA_c | F1 |
---|---|---|---|---|
TRACE | 8.1 | 2.8 | 2.2 | 22.4 |
TRACE-retrieal | 8.3 | 2.9 | 2.3 | 24.1 |
TRACE-uni | 8.6 | 2.9 | 2.3 | 22.4 |
Charades-STA (Zero-Shot) | 0.3 | 0.5 | 0.7 | mIOU |
---|---|---|---|---|
TRACE | 58.6 | 40.3 | 19.4 | 38.7 |
TRACE-retrieval | 57.9 | 37.4 | 17.3 | 37.4 |
TRACE-uni | 63.7 | 43.7 | 21.0 | 41.5 |
QVHighlights (Zero-Shot) | mAP | Hit@1 |
---|---|---|
TRACE | 26.8 | 42.7 |
TRACE-retrieval | 27.9 | 44.3 |
TRACE-uni | 27.5 | 43.9 |
ActivityNet-DVC | CIDER | METEOR | SODA_c | F1 |
---|---|---|---|---|
TRACE | 25.9 | 6.0 | 6.4 | 39.3 |
TRACE-retrieval | 25.7 | 5.9 | 6.5 | 40.1 |
TRACE-uni | 29.2 | 6.9 | 6.4 | 40.4 |
ActivityNet-MR | 0.3 | 0.5 | 0.7 | mIOU |
---|---|---|---|---|
TRACE | 54.0 | 37.7 | 24.0 | 39.0 |
TRACE-retrieval | 54.4 | 39.8 | 24.9 | 40.2 |
TRACE-uni | 53.2 | 38.2 | 24.7 | 39.4 |
MVBench | Avg | AS | AP | AA | FA | UA | OE | OI | OS | MD | AL | ST | AC | MC | MA | SC | FP | CO | EN | ER | CI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TRACE | 48.1 | 61.2 | 56.5 | 72.5 | 46.5 | 61.0 | 48.0 | 69.5 | 40.0 | 22.0 | 31.0 | 86.5 | 37.5 | 37.0 | 51.0 | 45.0 | 40.5 | 39.0 | 31.0 | 43.5 | 44.5 |
TRACE-uni | 53.8 | 68.1 | 58.5 | 72.5 | 41.5 | 73.5 | 55.1 | 71.5 | 40.5 | 25.0 | 53.0 | 88.5 | 63.5 | 38.5 | 51.0 | 52.5 | 49.0 | 59.5 | 33.5 | 49.5 | 32.5 |
VideoMME (w/o subtitle) | Short | Midium | Long | Avg |
---|---|---|---|---|
TRACE | 49.5 | 42.5 | 39.3 | 43.8 |
TRACE-uni | 58.2 | 48.1 | 42.3 | 49.6 |
Demo
<div align="center"> <img src="assets/trace-demo.png" alt="Demo of TRACE" width="700"/> <br/> <figcaption>Demo of TRACE.</figcaption> </div>AcknowledgementWe are grateful for the following awesome projects:
Bibliography
If you find this repository helpful for your project, please consider citing:
@misc{guo2024tracetemporalgroundingvideo,
title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling},
author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
year={2024},
eprint={2410.05643},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.05643},
}