Home

Awesome

<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Casual Event Modeling</a></h2> <h5 align="center"> If our project helps you, please give us a star ⭐ and cite our <a href="#bibliography">paper</a>!</h2> <h5 align="center">

hf_space trace_checkpoint hf_data arxiv Hits

News

TODO

Overview

In this work

<div align="center"> <img src="assets/trace-overview.png" alt="Overview of TRACE" width="700"/> <br/> <figcaption>Overview of TRACE.</figcaption> </div>

Enviroments

We use NPU environments for training and fine-tuning, and use V100 GPUs for evaluation. The environment we use can be found in npu-requirements and gpu-requirements.

Model Zoo

CheckpointsDescriptionURL
InitializationWeights initialized from VideoLLaMA2trace-init
Stage-1Model checkpoints trained after stage-1trace-stage1
Stage-2Model checkpoints trained after stage-2trace
FT-CharadesFine-tuned on Charades-STA datasettrace-ft-charades
FT-Youcook2Fine-tuned on Youcook2 datasettrace-ft-youcook2
FT-QVHighlightsFine-tuned on QVHighlights datasettrace-ft-qvhighlights
TRACE-retrievalForcing the predicted timestamps to be align with input timestampstrace-retrieval
TRACE-uniIncorporating additional general video understanding data from a subset of LLaVA-Video-178k.trace-uni

Inference and Evaluation

Please make sure the model and video paths are correct before running the code.

Data

We have provided the annotation files, and the raw videos can be prepared by the following projects

Training

Stage 1 training

bash TRACE/scripts/train/pretrain-128.sh

Stage 2 training

bash TRACE/scripts/train/sft-128.sh

Fine-tune on downsteam task

bash TRACE/scripts/train/sft-youcook2.sh

Please config the data and model paths before running the scrips.

Results

Youcook2 (Zero-Shot)CIDERMETEORSODA_cF1
TRACE8.12.82.222.4
TRACE-retrieal8.32.92.324.1
TRACE-uni8.62.92.322.4
Charades-STA (Zero-Shot)0.30.50.7mIOU
TRACE58.640.319.438.7
TRACE-retrieval57.937.417.337.4
TRACE-uni63.743.721.041.5
QVHighlights (Zero-Shot)mAPHit@1
TRACE26.842.7
TRACE-retrieval27.944.3
TRACE-uni27.543.9
ActivityNet-DVCCIDERMETEORSODA_cF1
TRACE25.96.06.439.3
TRACE-retrieval25.75.96.540.1
TRACE-uni29.26.96.440.4
ActivityNet-MR0.30.50.7mIOU
TRACE54.037.724.039.0
TRACE-retrieval54.439.824.940.2
TRACE-uni53.238.224.739.4
MVBenchAvgASAPAAFAUAOEOIOSMDALSTACMCMASCFPCOENERCI
TRACE48.161.256.572.546.561.048.069.540.022.031.086.537.537.051.045.040.539.031.043.544.5
TRACE-uni53.868.158.572.541.573.555.171.540.525.053.088.563.538.551.052.549.059.533.549.532.5
VideoMME (w/o subtitle)ShortMidiumLongAvg
TRACE49.542.539.343.8
TRACE-uni58.248.142.349.6

Demo

<div align="center"> <img src="assets/trace-demo.png" alt="Demo of TRACE" width="700"/> <br/> <figcaption>Demo of TRACE.</figcaption> </div>

AcknowledgementWe are grateful for the following awesome projects:

Bibliography

If you find this repository helpful for your project, please consider citing:

@misc{guo2024tracetemporalgroundingvideo,
      title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling}, 
      author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
      year={2024},
      eprint={2410.05643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.05643}, 
}