Awesome
<p align="center" width="100%"> <a target="_blank"><img src="example/material/stllm_logo.png" alt="ST-LLM" style="width: 50%; min-width: 150px; display: block; margin: auto;"></a> </p> <h2 align="center"> <a href="https://arxiv.org/abs/2404.00308">ST-LLM: Large Language Models Are Effective Temporal Learners</a></h2> <h5 align=center> </h5>News :loudspeaker:
- [2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.
Introduction :bulb:
- ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
- (1) Joint spatial-temporal modeling within large language models for effective video understanding.
- (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
- (3) Global-local input module for long video understanding.
- ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:
Demo 🤗
Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:
CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight
We have also prepared local scripts that are easy to modify:demo.py
<div align=center> <img src="example/material/Mabaoguo.gif" width="70%" /> </div> <div align=center> <img src="example/material/Driving.gif" width="70%" /> </div>Examples 👀
- Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.
- Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.
- Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers. <p align="center"> <img src="example/BaoguoMa.gif" width="26%" style="display:inline-block" /> <img src="example/baoguoma.jpg" width="66%" style="display:inline-block" />
Installation 🛠️
Git clone our repository, creating a Python environment and activate it via the following command
git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt
Training & Validation :bar_chart:
The instructions of data, training and evaluating can be found in trainval.md.
Acknowledgement 👍
- Video-ChatGPT and MVBench Great job contributing video LLM benchmark.
- InstuctBLIP and MiniGPT4 The codebase and the basic image LLM we built upon.
Citation ✏️
If you find the code and paper useful for your research, please consider staring this repo and citing our paper:
@article{liu2023one,
title={One for all: Video conversation is feasible without video instruction tuning},
author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
journal={arXiv preprint arXiv:2309.15785},
year={2023}
}
@article{liu2023one,
title={ST-LLM: Large Language Models Are Effective Temporal Learners},
author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
journal={https://arxiv.org/abs/2404.00308},
year={2023}
}