Home

Awesome

<p align="center" width="100%"> <a target="_blank"><img src="example/material/stllm_logo.png" alt="ST-LLM" style="width: 50%; min-width: 150px; display: block; margin: auto;"></a> </p> <h2 align="center"> <a href="https://arxiv.org/abs/2404.00308">ST-LLM: Large Language Models Are Effective Temporal Learners</a></h2> <h5 align=center>

hf arXiv License

</h5>

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

News :loudspeaker:

Introduction :bulb:

<div align="center"> <table border="1" width="100%"> <tr align="center"> <th rowspan="2">Method</th><th rowspan="2">MVBench</th><th colspan="6">VcgBench</th><th colspan="3">VideoQABench</th> </tr> <tr align="center"> <th>Avg</th><th>Correct</th><th>Detail</th><th>Context</th><th>Temporal</th><th>Consist</th><th>MSVD</th><th>MSRVTT</th><th>ANet</th> </tr> <tr align="center"> <td>VideoLLaMA</td><td>34.1</td><td>1.96</td><td>2.18</td><td>2.16</td><td>1.82</td><td>1.79</td><td>1.98</td><td>51.6</td><td>29.6</td><td>12.4</td> </tr> <tr align="center"> <td>LLaMA-Adapter</td><td>31.7</td><td>2.03</td><td>2.32</td><td>2.30</td><td>1.98</td><td>2.15</td><td>2.16</td><td>54.9</td><td>43.8</td><td>34.2</td> </tr> <tr align="center"> <td>VideoChat</td><td>35.5</td><td>2.23</td><td>2.50</td><td>2.53</td><td>1.94</td><td>2.24</td><td>2.29</td><td>56.3</td><td>45.0</td><td>26.5</td> </tr> <tr align="center"> <td>VideoChatGPT</td><td>32.7</td><td>2.38</td><td>2.40</td><td>2.52</td><td>2.62</td><td>1.98</td><td>2.37</td><td>64.9</td><td>49.3</td><td>35.7</td> </tr> <tr align="center"> <td>MovieChat</td><td>-</td><td>2.76</td><td>2.93</td><td>3.01</td><td>2.24</td><td>2.42</td><td>2.67</td><td>74.2</td><td>52.7</td><td>45.7</td> </tr> <tr align="center"> <td>Vista-LLaMA</td><td>-</td><td>2.44</td><td>2.64</td><td>3.18</td><td>2.26</td><td>2.31</td><td>2.57</td><td>65.3</td><td>60.5</td><td>48.3</td> </tr> <tr align="center"> <td>LLaMA-VID</td><td>-</td><td>2.89</td><td>2.96</td><td>3.00</td><td>3.53</td><td>2.46</td><td>2.51</td><td>69.7</td><td>57.7</td><td>47.4</td> </tr> <tr align="center"> <td>Chat-UniVi</td><td>-</td><td>2.99</td><td>2.89</td><td>2.91</td><td>3.46</td><td>2.89</td><td>2.81</td><td>65.0</td><td>54.6</td><td>45.8</td> </tr> <tr align="center"> <td>VideoChat2</td><td>51.1</td><td>2.98</td><td>3.02</td><td>2.88</td><td>3.51</td><td>2.66</td><td>2.81</td><td>70.0</td><td>54.1</td><td>49.1</td> </tr> <tr align="center"> <td>ST-LLM</td><td><b>54.9</b></td><td><b>3.15</b></td><td><b>3.23</b></td><td><b>3.05</b></td><td><b>3.74</b></td><td><b>2.93</b></td><td><b>2.81</b></td><td><b>74.6</b></td><td><b>63.2</b></td><td><b>50.9</b></td> </tr> </table> </div>

Demo 🤗

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

We have also prepared local scripts that are easy to modify:demo.py

<div align=center> <img src="example/material/Mabaoguo.gif" width="70%" /> </div> <div align=center> <img src="example/material/Driving.gif" width="70%" /> </div>

Examples 👀

<p align="center"> <img src="example/driving.gif" width="25%" style="display:inline-block" /> <img src="example/driving.jpg" width="65%" style="display:inline-block" /> </p> <p align="center"> <img src="example/cooking.gif" width="21%" style="display:inline-block" /> <img src="example/cooking.jpg" width="68%" style="display:inline-block" /> </p> <p align="center"> <img src="example/TVshow.gif" width="21%" style="display:inline-block" /> <img src="example/TVshow.jpg" width="68%" style="display:inline-block" /> </p> <p align="center"> <img src="example/monkey.gif" width="21%" style="display:inline-block" /> <img src="example/monkey.jpg" width="68%" style="display:inline-block" /> </p> </p>

Installation 🛠️

Git clone our repository, creating a Python environment and activate it via the following command

git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt

Training & Validation :bar_chart:

The instructions of data, training and evaluating can be found in trainval.md.

Acknowledgement 👍

Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@article{liu2023one,
  title={One for all: Video conversation is feasible without video instruction tuning},
  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
  journal={arXiv preprint arXiv:2309.15785},
  year={2023}
}
@article{liu2023one,
  title={ST-LLM: Large Language Models Are Effective Temporal Learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  journal={https://arxiv.org/abs/2404.00308},
  year={2023}
}