Home

Awesome

Awesome-LLMs-for-Video-Understanding Awesome

🔥🔥🔥 Video Understanding with Large Language Models: A Survey

Yunlong Tang<sup>1,*</sup>, Jing Bi<sup>1,*</sup>, Siting Xu<sup>2,*</sup>, Luchuan Song<sup>1</sup>, Susan Liang<sup>1</sup> , Teng Wang<sup>2,3</sup> , Daoan Zhang<sup>1</sup> , Jie An<sup>1</sup> , Jingyang Lin<sup>1</sup> , Rongyi Zhu<sup>1</sup> , Ali Vosoughi<sup>1</sup> , Chao Huang<sup>1</sup> , Zeliang Zhang<sup>1</sup> , Pinxin Liu<sup>1</sup> , Mingqian Feng<sup>1</sup> , Feng Zheng<sup>2</sup> , Jianguo Zhang<sup>2</sup> , Ping Luo<sup>3</sup> , Jiebo Luo<sup>1</sup>, Chenliang Xu<sup>1,†</sup>. (*Core Contributors, †Corresponding Authors)

<sup>1</sup>University of Rochester, <sup>2</sup>Southern University of Science and Technology, <sup>3</sup>The University of Hong Kong

<h5 align="center">

Paper | Project Page

</h5>

image

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update: <br>✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024. <br>✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality. <br>✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section. <br>✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification. <br>✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

<font size=5><center><b> Table of Contents </b> </center></font>

Why we need Vid-LLMs?

image

😎 Vid-LLMs: Models

image

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🗒️ Taxonomy

<!-- 模版: | [**文章标题**](文章链接) | 模型名称 | 时间 | [code](代码链接) | 来源 | -->

🕹️ Video Analyzer × LLM

LLM as Summarizer

TitleModelDateCodeVenue
Seeing the Unseen: Visual Metaphor Captioning for VideosGIT-LLaVA06/2024codearXiv
Zero-shot long-form video understanding through screenplayMM-Screenplayer06/2024project pageCVPR
MoReVQA exploring modular reasoning models for video question answeringMoReVQA04/2024project pageCVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMIG-VLM03/2024codearXiv
Language repository for long video understandingLangRepo03/2024codearXiv
Understanding long videos in one multimodal language model passMVU03/2024codearXiv
Video ReCap recursive captioning of hour-long videosVideo ReCap02/2024codeCVPR
A Simple LLM Framework for Long-Range Video Question-AnsweringLLoVi12/2023codearXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videosGrounding-prompter12/2023codearXiv
Learning object state changes in videos an open-world perspectiveVIDOSC12/2023codeCVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?AntGPT07/2023codeICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStarVAST05/2023codeNeurIPS
VLog: Video as a Long DocumentStarVLog04/2023code-
Learning Video Representations from Large Language ModelsStarLaViLa12/2022codeCVPR

LLM as Manager

<!-- 模版: | [**文章标题**](文章链接) | 模型名称 | 时间 | [code](代码链接) | 来源 | -->
TitleModelDateCodeVenue
DrVideo: Document Retrieval Based Long Video UnderstandingDrVideo06/2024codearXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquerOmAgent06/2024codearXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QALVNet06/2024codearXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videosVideoTree05/2024codearXiv
Harnessing Large Language Models for Training-free Video Anomaly DetectionLAVAD04/2024codeCVPR
TraveLER a multi-LMM agent framework for video question-answeringTraveLER04/2024codearXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity featuresGPTSee03/2024codearXiv
Reframe anything LLM agent for open world video reframingRAVA03/2024codearXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videosSCHEMA03/2024codeICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoningTV-TREES02/2024codearXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video UnderstandingVideoAgent03/2024project pagearXiv
VideoAgent long-form video understanding with large language model as agentVideoAgent03/2024codearXiv
VURF a general-purpose reasoning and self-refinement framework for video understandingVURF03/2024codearXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videosKEPP03/2024codeCVPR
DoraemonGPT toward understanding dynamic scenes with large language modelsDoraemonGPT01/2024codearXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric VideosLifelongMemory12/2023codearXiv
Zero-Shot Video Question Answering with Procedural ProgramsProViQ12/2023codearXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and LearnAssistGPT06/2023codearXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding SystemChatVideo04/2023project pagearXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStarVideo ChatCaptioner04/2023codearXiv
ViperGPT: Visual Inference via Python Execution for ReasoningViperGPT03/2023codearXiv
Hawk: Learning to Understand Open-World Video AnomaliesHawk05/2024codearXiv

🤖 LLM-based Video Agents

TitleModelDateCodeVenue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageSocratic Models04/2022project pagearXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStarVideo ChatCaptioner04/2023codearXiv
VLog: Video as a Long DocumentStarVLog04/2023code-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding SystemChatVideo04/2023project pagearXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)MM-VID10/2023-arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStarMISAR10/2023project pageICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long VideosGrounding-Prompter12/2023-arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language NavigationNaVid02/2024project page -RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video UnderstandingVideoAgent03/2024project pagearXiv

👾 Vid-LLM Pretraining

TitleModelDateCodeVenue
Learning Video Representations from Large Language ModelsStarLaViLa12/2022codeCVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningVid2Seq02/2023codeCVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStarVAST05/2023codeNeurIPS
Merlin:Empowering Multimodal LLMs with Foresight MindsMerlin12/2023-arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

TitleModelDateCodeVenue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding StarVideo-LLaMA06/2023codearXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStarVALLEY06/2023code-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStarVideo-ChatGPT06/2023codearXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStarMacaw-LLM06/2023codearXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning StarLLMVA-GEBC06/2023codeCVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks StarmPLUG-video06/2023codearXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStarMovieChat07/2023codearXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStarLLaMA-VQA10/2023codeEMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStarVideo-LLaVA11/2023codearXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStarChat-UniVi11/2023codearXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStarLLaMA-VID11/2023codearXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual TokensVISTA-LLAMA12/2023-arXiv
Audio-Visual LLM for Video Understanding-12/2023-arXiv
AutoAD: Movie Description in ContextAutoAD06/2023codeCVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio DescriptionAutoAD II10/2023-ICCV
AutoAD III: The Prequel -- Back to the PixelsAutoAD III04/2024-CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStarFAVOR10/2023codearXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsStarVideoLLaMA206/2024codearXiv

Fine-tuning with Insertive Adapters

TitleModelDateCodeVenue
Otter: A Multi-Modal Model with In-Context Instruction TuningStarOtter06/2023codearXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStarVideoLLM05/2023codearXiv

Fine-tuning with Hybrid Adapters

TitleModelDateCodeVenue
VTimeLLM: Empower LLM to Grasp Video MomentsStarVTimeLLM11/2023codearXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware GenerationGPT4Video11/2023-arXiv

🦾 Hybrid Methods

TitleModelDateCodeVenue
VideoChat: Chat-Centric Video UnderstandingStarVideoChat05/2023code demoarXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStarPG-Video-LLaVA11/2023codearXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStarTimeChat12/2023codeCVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStarVideo-GroundingDINO12/2023codearXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero ShotVideo409605/2023EMNLP

🦾 Training-free Methods

TitleModelDateCodeVenue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsSlowFast-LLaVA07/2024-arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

NamePaperDateLinkVenue
CharadesHollywood in homes: Crowdsourcing data collection for activity understanding2016LinkECCV
YouTube8MYouTube-8M: A Large-Scale Video Classification Benchmark2016Link-
ActivityNetActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding2015LinkCVPR
Kinetics-GEBCGEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval2022LinkECCV
Kinetics-400The Kinetics Human Action Video Dataset2017Link-
VidChapters-7MVidChapters-7M: Video Chapters at Scale2023LinkNeurIPS

Captioning and Description

NamePaperDateLinkVenue
Microsoft Research Video Description Corpus (MSVD)Collecting Highly Parallel Data for Paraphrase Evaluation2011LinkACL
Microsoft Research Video-to-Text (MSR-VTT)MSR-VTT: A Large Video Description Dataset for Bridging Video and Language2016LinkCVPR
Tumblr GIF (TGIF)TGIF: A New Dataset and Benchmark on Animated GIF Description2016LinkCVPR
CharadesHollywood in Homes: Crowdsourcing Data Collection for Activity Understanding2016LinkECCV
Charades-EgoActor and Observer: Joint Modeling of First and Third-Person Videos2018LinkCVPR
ActivityNet CaptionsDense-Captioning Events in Videos2017LinkICCV
HowTo100mHowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips2019LinkICCV
Movie Audio Descriptions (MAD)MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions2021LinkCVPR
YouCook2Towards Automatic Learning of Procedures from Web Instructional Videos2017LinkAAAI
MovieNetMovieNet: A Holistic Dataset for Movie Understanding2020LinkECCV
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks2023LinkarXiv
Video Timeline Tags (ViTT)Multimodal Pretraining for Dense Video Captioning2020LinkAACL-IJCNLP
TVSumTVSum: Summarizing web videos using titles2015LinkCVPR
SumMeCreating Summaries from User Videos2014LinkECCV
VideoXumVideoXum: Cross-modal Visual and Textural Summarization of Videos2023LinkIEEE Trans Multimedia
Multi-Source Video Captioning (MSVC)VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs2024LinkarXiv

Grounding and Retrieval

NamePaperDateLinkVenue
Epic-Kitchens-100Rescaling Egocentric Vision2021LinkIJCV
VCR (Visual Commonsense Reasoning)From Recognition to Cognition: Visual Commonsense Reasoning2019LinkCVPR
Ego4D-MQ and Ego4D-NLQEgo4D: Around the World in 3,000 Hours of Egocentric Video2021LinkCVPR
Vid-STGWhere Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences2020LinkCVPR
Charades-STATALL: Temporal Activity Localization via Language Query2017LinkICCV
DiDeMoLocalizing Moments in Video with Natural Language2017LinkICCV

Question Answering

NamePaperDateLinkVenue
MSVD-QAVideo Question Answering via Gradually Refined Attention over Appearance and Motion2017LinkACM Multimedia
MSRVTT-QAVideo Question Answering via Gradually Refined Attention over Appearance and Motion2017LinkACM Multimedia
TGIF-QATGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering2017LinkCVPR
ActivityNet-QAActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering2019LinkAAAI
Pororo-QADeepStory: Video Story QA by Deep Embedded Memory Networks2017LinkIJCAI
TVQATVQA: Localized, Compositional Video Question Answering2018LinkEMNLP
MAD-QAEncoding and Controlling Global Semantics for Long-form Video Question Answering2024LinkEMNLP
Ego-QAEncoding and Controlling Global Semantics for Long-form Video Question Answering2024LinkEMNLP

Video Instruction Tuning

Pretraining Dataset
NamePaperDateLinkVenue
VidChapters-7MVidChapters-7M: Video Chapters at Scale2023LinkNeurIPS
VALOR-1MVALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset2023LinkarXiv
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks2023LinkarXiv
InternVidInternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation2023LinkarXiv
VAST-27MVAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset2023LinkNeurIPS
Fine-tuning Dataset
NamePaperDateLinkVenue
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction Tuning2023LinkarXiv
VideoInstruct100KVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models2023LinkarXiv
TimeITTimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding2023LinkCVPR

Video-based Large Language Models Benchmark

TitleDateCodeVenue
LVBench: An Extreme Long Video Understanding Benchmark06/2024code-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models11/2023code-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models05/2023codeNeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star07/2023code-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star11/2023codeNeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding12/2023code-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark12/2023code-
TempCompass: Do Video LLMs Really Understand Videos? Star03/2024codeACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Star06/2024code-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Star06/2024code-

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

Star History Chart

♥️ Contributors

<a href="https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding/graphs/contributors"> <img src="https://contrib.rocks/image?repo=yunlong10/Awesome-LLMs-for-Video-Understanding" /> </a>