Awesome
<div align="center"> <h1>TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment <p> <a href="https://www.arxiv.org/pdf/2405.13911" target="_blank">NeurIPS 2024 Spollight</a> </h1> </div> <div align="center"> <img src="pics/topa_framework.jpg" width="900px" /> </div>Data Preparation:
Prepare the data as follows.
TextVID: download TextVID at TextVid only. download TextVID and preprocessed features at TextVid and features
NeXTQA, STAR and TVQA: The prepocessed feautures are available at here.
EgoScehma: Download raw videos from EgoSchema. We provide prepocessed feature here
MVBench: Download raw videos from Hugging Face.
MSRVTT: Download raw videos from MSRVTT.
./data
|─ nextqa
| |─ train.csv
| |─ val.csv
| └─ clipvitl14.pth
|─ star
| :
|─ tvqa
| :
└─ egos
:
Model Preparation:
Prepare the model as follows.
LLMs: Download the pretrained Llama models from Llama2 and Llama3.
TOPA Checkpoints: Download our pretrained models
./pretrained
└─ llama2
| |─ 7B
| | |─ consolidated.00.pth
| | └─ params.json
| |─ 13B
| | :
| | :
| └─ tokenizer.model
└─ llama3
|─ 8B
| |─ consolidated.00.pth
| └─ params.json
└─ tokenizer.model
./vqa_checkpoint
└─ checkpoint_pretrain
|─ llama2_7b
|─ llama2_13b
└─ llama3_8b
Training & Evaluation
Text-only Pre-alignment
./scripts/pretrain/llama2_7b.sh
Zero-shot inference
./scripts/eval/zeroshot_eval_egos.sh
./scripts/eval/zeroshot_eval_nextqa.sh
./scripts/eval/zeroshot_eval_star.sh
./scripts/eval/zeroshot_eval_tvqa.sh
Evaluate on MVBench
Evaluate on video captioning benchmarks
Acknowledgements
This repo is built upon Flipped-VQA and benefits from LLaMA-Adapter, DeCap, MVBench, Llama2 and Llama3.
Citations
@article{li2024topa,
title={TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment},
author={Li, Wei and Fan, Hehe and Wong, Yongkang and Kankanhalli, Mohan and Yang, Yi},
journal={arXiv preprint arXiv:2405.13911},
year={2024}
}