Awesome
<div style="display: flex; align-items: center;"> <img src="assets/hawk.png" alt="logo" width="50" height="50" style="margin: 0 10;"> <span style="margin: 0 10;">βHawkEye: Training Video-Text LLMs for Grounding Text in Videos</span> </div>
[Paper] [Checkpoint] [Dataset]
Updates
- 2024/04/29: Update the model loading process, merged trained params of videochat2 to
hawkeye.pth
. Now only ckpts of vicuna-7b-v-0 andhawkeye.pth
are needed to load Hawkeye.
Introduction
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images.
We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
Datasets and Models
We release our HawkEye and our impl. VideoChat2 Model Checkpoints, and InternVid-G Dataset at π€HuggingFace.
Demo
Live Demo In progress
You can use demo.ipynb
to test HawkEye on your data.
Training
Download model checkpoints
- Create a directory
model/
for model checkpoints:mkdir model/
- Follow here to prepare vicuna-7b-v0
- Download the HawkEye checkpoint
- (Optional) If you want to reproduce the instruction tuning process, download umt_l16_qformer.pth and videochat2_7b_stage2.pth from VideoChat2
After downloading all model checkpoints, the model/
folder should looks like this:
βββ hawkeye.pth
βββ vicuna-7b-v0/
βββ VideoChat2/ (optional)
βββ umt_l16_qformer.pth
βββ videochat2_7b_stage2.pth
Data preparation
Download from Dataset Homepage at π€HuggingFace, and save in data/HawkEye-IT/
folder. We also provide data proessing code in data_preparing/
, you can use it for reference.
Note that you also need to download the videos of each dataset from their original links, which is further explained in dataset homepage (this may take quite a while π). Use soft links to link the video folder under data/videos/
.
After data preparation, the data/
folder should looks like this:
βββ HawkEye-IT/
βββ image/ # inherited from VideoChat2-IT, but not used in training HawkEye
βββ video/
βββ temporal/
βββ internvid_grounding/, charades_sta_grounding/, anetc_grounding/
βββ instructions.json, questions.json, train.json
βββ internvid_caption/
βββ instructions.json, train.json
βββ caption/, classification/, conversation/, vqa/, reasoning/
βββ videos/
βββ internvid-g/, clevrer/, webvid/, activitynet/, tgif/,
βββ nextqa/, textvr/, youcook2/, kinetics/, ssv2/, charades/
Note that image/, caption/, classification/, conversation/, vqa/, reasoning/
folders of HawkEye-IT are identical to VideoChat2-IT.
Run the instruction tuning process
bash ./scripts/train/run_7b_stage3.sh OUTPUT_PATH
The instruction-tuned HawkEye checkpoint will be saved in OUTPUT_PATH/ckpt_${ckpt}.pth
, where ${ckpt}
is the number of iterations you train.
Check the script to ensure the hyperparameters fit your computing device.
Run the finetuning process
We also provide the scripts to finetune on Charades-STA and ActivityNet-Captions:
# IT_CKPT: the instruction-tuned HawkEye checkpoint
bash ./scripts/train/charades_sta.sh OUTPUT_PATH IT_CKPT
bash ./scripts/train/anetc.sh OUTPUT_PATH IT_CKPT
Check the script to ensure the hyperparameters fit your computing device.
Testing
Data preparation
-
Download MVBench and save in
data/MVBench/
folder. -
Download the annotation of other benchmarks from Google Drive and unzip to
data/test-anno/
. We also provide data proessing code indata_preparing/
, you can use it for reference. -
Download TVQA videos and link it at
data/videos/tvqa
After downloading all benchmarks, the data/
folder should like this:
βββ HawkEye-IT/ # instruct tuning datasets
βββ MVBench/
βββ test-anno/
βββ charades_sta-recursive_grounding.json, anetc-recursive_grounding.json
βββ nextgqa-recursive_grounding.json
βββ nextqa-test.json, tvqa-test.json, star-test.json
βββ videos/
βββ nextqa/, tvqa/, charades/, activitynet/, ...
Test on video qa benchmarks
bash ./scripts/test/videoqa.sh
refer to data_preparing/videoqa.py
to convert the model outputs to the format required by STAR evaluation and TVQA evaluation w/ ts.
Test on temporal video grounding benchmarks with recursive grounding
bash ./scripts/test/recursive_grounding.sh
To analyze the results of each recursive grounding step, refer to data_preparing/check_grounding_results.ipynb
.
Citation
If you find this code useful in your research, please consider citing:
@misc{wang2024hawkeye,
title={HawkEye: Training Video-Text LLMs for Grounding Text in Videos},
author={Yueqian Wang and Xiaojun Meng and Jianxin Liang and Yuxuan Wang and Qun Liu and Dongyan Zhao},
year={2024},
eprint={2403.10228},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgments
This project is based on VideoChat and VideoChat2. Thanks for their great work!