Awesome
VTimeLLM [Paper]
Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
:loudspeaker: Latest Updates
- Jan-2: Thanks to Xiao Xia , Shengbo Tong and Beining Wang, we have refactored the code to now support both the LLAMA and ChatGLM3 architectures. We translated the training data into Chinese and fine-tuned a Chinese version based on the ChatGLM3-6b.
- Dec-14: Released the training code and data. All the resources including models, datasets and extracted features are available here. :fire::fire:
- Dec-4: VTimeLLM: demo released.
VTimeLLM Overview :bulb:
VTimeLLM is a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary.
VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents.
Contributions :trophy:
- We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.
- We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.
- We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.
Installation :wrench:
We recommend setting up a conda environment for the project:
conda create --name=vtimellm python=3.10
conda activate vtimellm
git clone https://github.com/huangb23/VTimeLLM.git
cd VTimeLLM
pip install -r requirements.txt
Additionally, install additional packages for training cases.
pip install ninja
pip install flash-attn --no-build-isolation
Running Demo Offline :cd:
To run the demo offline, please refer to the instructions in offline_demo.md.
Training :train:
For training instructions, check out train.md.
Qualitative Analysis :mag:
A Comprehensive Evaluation of VTimeLLM's Performance across Multiple Tasks.
Video Understanding and Conversational Tasks :speech_balloon:
Creative Tasks :paintbrush:
Fine-grained Understanding Tasks :globe_with_meridians:
Video Reasoning Tasks :question:
Acknowledgements :pray:
We are grateful for the following awesome projects our VTimeLLM arising from:
- LLaVA: Large Language and Vision Assistant
- FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
- LLaMA: Open and Efficient Foundation Language Models
- Vid2seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
- InternVid: A Large-scale Video-Text dataset
If you're using VTimeLLM in your research or applications, please cite using this BibTeX:
@inproceedings{huang2024vtimellm,
title={Vtimellm: Empower llm to grasp video moments},
author={Huang, Bin and Wang, Xin and Chen, Hong and Song, Zihan and Zhu, Wenwu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14271--14280},
year={2024}
}
License :scroll:
<a rel="license" href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/80x15.png" /></a>
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License</a>.
Looking forward to your feedback, contributions, and stars! :star2: