Awesome
Awesome-LLMs-for-Video-Understanding
π₯π₯π₯ Video Understanding with Large Language Models: A Survey
Yunlong Tang<sup>1,*</sup>, Jing Bi<sup>1,*</sup>, Siting Xu<sup>2,*</sup>, Luchuan Song<sup>1</sup>, Susan Liang<sup>1</sup> , Teng Wang<sup>2,3</sup> , Daoan Zhang<sup>1</sup> , Jie An<sup>1</sup> , Jingyang Lin<sup>1</sup> , Rongyi Zhu<sup>1</sup> , Ali Vosoughi<sup>1</sup> , Chao Huang<sup>1</sup> , Zeliang Zhang<sup>1</sup> , Pinxin Liu<sup>1</sup> , Mingqian Feng<sup>1</sup> , Feng Zheng<sup>2</sup> , Jianguo Zhang<sup>2</sup> , Ping Luo<sup>3</sup> , Jiebo Luo<sup>1</sup>, Chenliang Xu<sup>1,β </sup>. (*Core Contributors, β Corresponding Authors)
<h5 align="center"> </h5><sup>1</sup>University of Rochester, <sup>2</sup>Southern University of Science and Technology, <sup>3</sup>The University of Hong Kong
π’ News
[07/23/2024]
π’ We've recently updated our survey: βVideo Understanding with Large Language Models: A Surveyβ!
β¨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.
π What's New in This Update: <br>β Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024. <br>β Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality. <br>β Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section. <br>β Added a new Training Strategies chapter, removing adapters as a factor for model classification. <br>β All figures and tables have been redesigned.
Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback β€οΈ
<font size=5><center><b> Table of Contents </b> </center></font>
- Awesome-LLMs-for-Video-Understanding
Why we need Vid-LLMs?
π Vid-LLMs: Models
π Citation
If you find our survey useful for your research, please cite the following paper:
@article{vidllmsurvey,
title={Video Understanding with Large Language Models: A Survey},
author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
journal={arXiv preprint arXiv:2312.17432},
year={2023},
}
ποΈ Taxonomy 1
πΉοΈ Video Analyzer Γ LLM
LLM as Summarizer
LLM as Manager
πΎ Video Embedder Γ LLM
LLM as Text Decoder
LLM as Regressor
<!-- | [**title**](link) | model | date | [code](link) | venue | -->LLM as Hidden Layer
Title | Model | Date | Code | Venue |
---|---|---|---|---|
VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding | VTG-LLM | 05/2024 | code | arXiv |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing | VITRON | 04/2024 | project page | NeurIPS |
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT | VTG-GPT | 03/2024 | code | arXiv |
Momentor advancing video large language model with fine-grained temporal reasoning | Momentor | 02/2024 | code | ICML |
Detours for navigating instructional videos | VidDetours | 01/2024 | code | CVPR |
OneLLM: One Framework to Align All Modalities with Language | OneLLM | 12/2023 | code | arXiv |
GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation | GPT4Video | 11/2023 | code | ACMMM |
π§ (Analyzer + Embedder) Γ LLM
LLM as Manager
Title | Model | Date | Code | Venue |
---|---|---|---|---|
MM-VID: Advancing Video Understanding with GPT-4V(ision) | MM-VID | 10/2023 | - | arXiv |
LLM as Summarizer
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos | SUM-shot | 12/2023 | code | arXiv |
LLM as Regressor
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Vript: A Video Is Worth Thousands of Words | Vriptor | 06/2024 | code | NeurIPS |
Merlin:Empowering Multimodal LLMs with Foresight Minds | Merlin | 12/2023 | project page | ECCV |
VideoChat: Chat-Centric Video Understanding | VideoChat | 05/2023 | code | arXiv |
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | Vid2Seq | 02/2023 | code | CVPR |
LLM as Text Decoder
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Contextual AD Narration with Interleaved Multimodal Sequence | Uni-AD | 03/2024 | code | arXiv |
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning | MM-narrator | 11/2023 | project page | arXiv |
Vamos: Versatile Action Models for Video Understanding | Vamos | 11/2023 | project page | ECCV |
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description | Auto-AD II | 10/2023 | project page | ICCV |
LLM as Hidden Layer
Title | Model | Date | Code | Venue |
---|---|---|---|---|
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | PG-Video-LLaVA | 11/2023 | code | arXiv |
ποΈ Taxonomy 2
π€ LLM-based Video Agents
π₯ Vid-LLM Pretraining
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Learning Video Representations from Large Language Models | LaViLa | 12/2022 | code | CVPR |
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | Vid2Seq | 02/2023 | code | CVPR |
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | VAST | 05/2023 | code | NeurIPS |
Merlin:Empowering Multimodal LLMs with Foresight Minds | Merlin | 12/2023 | - | arXiv |
π Vid-LLM Instruction Tuning
Fine-tuning with Connective Adapters
Fine-tuning with Insertive Adapters
Title | Model | Date | Code | Venue |
---|---|---|---|---|
Otter: A Multi-Modal Model with In-Context Instruction Tuning | Otter | 06/2023 | code | arXiv |
VideoLLM: Modeling Video Sequence with Large Language Models | VideoLLM | 05/2023 | code | arXiv |
Fine-tuning with Hybrid Adapters
Title | Model | Date | Code | Venue |
---|---|---|---|---|
VTimeLLM: Empower LLM to Grasp Video Moments | VTimeLLM | 11/2023 | code | arXiv |
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation | GPT4Video | 11/2023 | - | arXiv |
π¦Ύ Hybrid Methods
Title | Model | Date | Code | Venue |
---|---|---|---|---|
VideoChat: Chat-Centric Video Understanding | VideoChat | 05/2023 | code demo | arXiv |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models | PG-Video-LLaVA | 11/2023 | code | arXiv |
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | TimeChat | 12/2023 | code | CVPR |
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding | Video-GroundingDINO | 12/2023 | code | arXiv |
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot | Video4096 | 05/2023 | EMNLP |
π Training-free Methods
Title | Model | Date | Code | Venue |
---|---|---|---|---|
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | SlowFast-LLaVA | 07/2024 | - | arXiv |
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | TS-LLaVA | 11/2024 | code | arXiv |
Tasks, Datasets, and Benchmarks
Recognition and Anticipation
Name | Paper | Date | Link | Venue |
---|---|---|---|---|
Charades | Hollywood in homes: Crowdsourcing data collection for activity understanding | 2016 | Link | ECCV |
YouTube8M | YouTube-8M: A Large-Scale Video Classification Benchmark | 2016 | Link | - |
ActivityNet | ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding | 2015 | Link | CVPR |
Kinetics-GEBC | GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval | 2022 | Link | ECCV |
Kinetics-400 | The Kinetics Human Action Video Dataset | 2017 | Link | - |
VidChapters-7M | VidChapters-7M: Video Chapters at Scale | 2023 | Link | NeurIPS |
Captioning and Description
Grounding and Retrieval
Name | Paper | Date | Link | Venue |
---|---|---|---|---|
Epic-Kitchens-100 | Rescaling Egocentric Vision | 2021 | Link | IJCV |
VCR (Visual Commonsense Reasoning) | From Recognition to Cognition: Visual Commonsense Reasoning | 2019 | Link | CVPR |
Ego4D-MQ and Ego4D-NLQ | Ego4D: Around the World in 3,000 Hours of Egocentric Video | 2021 | Link | CVPR |
Vid-STG | Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences | 2020 | Link | CVPR |
Charades-STA | TALL: Temporal Activity Localization via Language Query | 2017 | Link | ICCV |
DiDeMo | Localizing Moments in Video with Natural Language | 2017 | Link | ICCV |
Question Answering
Name | Paper | Date | Link | Venue |
---|---|---|---|---|
MSVD-QA | Video Question Answering via Gradually Refined Attention over Appearance and Motion | 2017 | Link | ACM Multimedia |
MSRVTT-QA | Video Question Answering via Gradually Refined Attention over Appearance and Motion | 2017 | Link | ACM Multimedia |
TGIF-QA | TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering | 2017 | Link | CVPR |
ActivityNet-QA | ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering | 2019 | Link | AAAI |
Pororo-QA | DeepStory: Video Story QA by Deep Embedded Memory Networks | 2017 | Link | IJCAI |
TVQA | TVQA: Localized, Compositional Video Question Answering | 2018 | Link | EMNLP |
MAD-QA | Encoding and Controlling Global Semantics for Long-form Video Question Answering | 2024 | Link | EMNLP |
Ego-QA | Encoding and Controlling Global Semantics for Long-form Video Question Answering | 2024 | Link | EMNLP |
Video Instruction Tuning
Pretraining Dataset
Name | Paper | Date | Link | Venue |
---|---|---|---|---|
VidChapters-7M | VidChapters-7M: Video Chapters at Scale | 2023 | Link | NeurIPS |
VALOR-1M | VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | 2023 | Link | arXiv |
Youku-mPLUG | Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks | 2023 | Link | arXiv |
InternVid | InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation | 2023 | Link | arXiv |
VAST-27M | VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | 2023 | Link | NeurIPS |
Fine-tuning Dataset
Name | Paper | Date | Link | Venue |
---|---|---|---|---|
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | 2023 | Link | arXiv |
VideoInstruct100K | Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | 2023 | Link | arXiv |
TimeIT | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding | 2023 | Link | CVPR |
Video-based Large Language Models Benchmark
Contributing
We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!
π Star History
β₯οΈ Contributors
Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.
Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Siting Xu @ Southern University of Science and Technology
Luchuan Song @ University of Rochester
Susan Liang @ University of Rochester
Teng Wang @ The University of Hong Kong
Daoan Zhang @ University of Rochester
Jie An @ University of Rochester
Jingyang Lin @ University of Rochester
Rongyi Zhu @ University of Rochester
Ali Vosoughi @ University of Rochester
Chao Huang @ University of Rochester
Zeliang Zhang @ University of Rochester
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Feng Zheng @ Southern University of Science and Technology
Jianguo Zhang @ Southern University of Science and Technology
Ping Luo @ University of Hong Kong
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester