Home

Awesome

Awesome-LLMs-for-Video-Understanding Awesome

πŸ”₯πŸ”₯πŸ”₯ Video Understanding with Large Language Models: A Survey

Yunlong Tang<sup>1,*</sup>, Jing Bi<sup>1,*</sup>, Siting Xu<sup>2,*</sup>, Luchuan Song<sup>1</sup>, Susan Liang<sup>1</sup> , Teng Wang<sup>2,3</sup> , Daoan Zhang<sup>1</sup> , Jie An<sup>1</sup> , Jingyang Lin<sup>1</sup> , Rongyi Zhu<sup>1</sup> , Ali Vosoughi<sup>1</sup> , Chao Huang<sup>1</sup> , Zeliang Zhang<sup>1</sup> , Pinxin Liu<sup>1</sup> , Mingqian Feng<sup>1</sup> , Feng Zheng<sup>2</sup> , Jianguo Zhang<sup>2</sup> , Ping Luo<sup>3</sup> , Jiebo Luo<sup>1</sup>, Chenliang Xu<sup>1,†</sup>. (*Core Contributors, †Corresponding Authors)

<sup>1</sup>University of Rochester, <sup>2</sup>Southern University of Science and Technology, <sup>3</sup>The University of Hong Kong

<h5 align="center">

Paper | Project Page

</h5>

image

πŸ“’ News

[07/23/2024]

πŸ“’ We've recently updated our survey: β€œVideo Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

πŸš€ What's New in This Update: <br>βœ… Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024. <br>βœ… Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality. <br>βœ… Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section. <br>βœ… Added a new Training Strategies chapter, removing adapters as a factor for model classification. <br>βœ… All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❀️

<font size=5><center><b> Table of Contents </b> </center></font>

Why we need Vid-LLMs?

image

😎 Vid-LLMs: Models

image

πŸ“‘ Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

πŸ—’οΈ Taxonomy 1

πŸ•ΉοΈ Video Analyzer Γ— LLM

LLM as Summarizer
TitleModelDateCodeVenue
Seeing the Unseen: Visual Metaphor Captioning for VideosGIT-LLaVA06/2024codearXiv
Zero-shot long-form video understanding through screenplayMM-Screenplayer06/2024project pageCVPR
MoReVQA exploring modular reasoning models for video question answeringMoReVQA04/2024project pageCVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLMIG-VLM03/2024codearXiv
Language repository for long video understandingLangRepo03/2024codearXiv
Understanding long videos in one multimodal language model passMVU03/2024codearXiv
Video ReCap recursive captioning of hour-long videosVideo ReCap02/2024codeCVPR
A Simple LLM Framework for Long-Range Video Question-AnsweringLLoVi12/2023codearXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videosGrounding-prompter12/2023codearXiv
Learning object state changes in videos an open-world perspectiveVIDOSC12/2023codeCVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?AntGPT07/2023codeICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStarVAST05/2023codeNeurIPS
VLog: Video as a Long DocumentStarVLog04/2023code-
Learning Video Representations from Large Language ModelsStarLaViLa12/2022codeCVPR
LLM as Manager
TitleModelDateCodeVenue
DrVideo: Document Retrieval Based Long Video UnderstandingDrVideo06/2024codearXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquerOmAgent06/2024codearXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QALVNet06/2024codearXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videosVideoTree05/2024codearXiv
Harnessing Large Language Models for Training-free Video Anomaly DetectionLAVAD04/2024codeCVPR
TraveLER a multi-LMM agent framework for video question-answeringTraveLER04/2024codearXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity featuresGPTSee03/2024codearXiv
Reframe anything LLM agent for open world video reframingRAVA03/2024codearXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videosSCHEMA03/2024codeICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoningTV-TREES02/2024codearXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video UnderstandingVideoAgent03/2024project pagearXiv
VideoAgent long-form video understanding with large language model as agentVideoAgent03/2024codearXiv
VURF a general-purpose reasoning and self-refinement framework for video understandingVURF03/2024codearXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videosKEPP03/2024codeCVPR
DoraemonGPT toward understanding dynamic scenes with large language modelsDoraemonGPT01/2024codearXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric VideosLifelongMemory12/2023codearXiv
Zero-Shot Video Question Answering with Procedural ProgramsProViQ12/2023codearXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and LearnAssistGPT06/2023codearXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding SystemChatVideo04/2023project pagearXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStarVideo ChatCaptioner04/2023codearXiv
ViperGPT: Visual Inference via Python Execution for ReasoningViperGPT03/2023codearXiv
Hawk: Learning to Understand Open-World Video AnomaliesHawk05/2024codearXiv

πŸ‘Ύ Video Embedder Γ— LLM

LLM as Text Decoder
TitleModelDateCodeVenue
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkAuroraCap10/2024project pagearXiv
Artemis towards referential understanding in complex videosArtemis06/2024codearXiv
EmoLLM multimodal emotional understanding meets large language modelsEmoLLM06/2024codearXiv
Fewer tokens and fewer videos extending video understanding abilities in large vision-language modelsFTFV-LLM06/2024-arXiv
Flash-VStream: Memory-Based Real-Time Understanding for Long Video StreamsFlash-VStream06/2024codearXiv
LLAVIDAL benchmarking large language vision models for daily activities of livingLLAVIDAL06/2024codearXiv
Long context transfer from language to visionLongVA06/2024codearXiv
ShareGPT4Video improving video understanding and generation with better captionsShareGPT4Video06/2024codearXiv
Towards event-oriented long video understandingVIM06/2024codearXiv
Video-SALMONN speech-enhanced audio-visual large language modelsVideo-SALMONN06/2024codeICML
VideoGPT+ integrating image and video encoders for enhanced video understandingVideoGPT+06/2024codearXiv
VideoLLaMA 2 advancing spatial-temporal modeling and audio understanding in video-LLMsVideoLLaMA 206/2024codearXiv
MotionLLM: Understanding Human Behaviors from Human Motions and VideosMotionLLM05/2024project pagearXiv
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkVideoChat211/2023codeCVPR
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationShotluck Holmes05/2024-arXiv
Streaming long video understanding with large language modelsVideoStreaming05/2024-arXiv
Synchronized Video Storytelling: Generating Video Narrations with Structured StorylineVideoNarrator05/2024-arXiv
TOPA extend large language models for video understanding via text-only pre-alignmentTOPA05/2024codeNeurIPS
MovieChat+: Question-aware Sparse Memory for Long Video Question AnsweringMovieChat+04/2024codearXiv
AutoAD III: The Prequel – Back to the PixelsAutoAD III04/2024project pageCVPR
Direct Preference Optimization of Video Large Multimodal Models from Language Model RewardLLaVA-Hound-DPO04/2024codearXiv
From image to video, what do we need in multimodal LLMsRED-VILLM04/2024-arXiv
Koala key frame-conditioned long video-LLMKoala04/2024project pageCVPR
LongVLM efficient long video understanding via large language modelsLongVLM04/2024codeECCV
MA-LMM memory-augmented large multimodal model for long-term video understandingMA-LMM04/2024codeCVPR
MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokensMiniGPT4-Video04/2024codearXiv
Pegasus-v1 technical reportPegasus-v104/2024codearXiv
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningPLLaVA04/2024codearXiv
ST-LLM: Large Language Models Are Effective Temporal LearnersST-LLM04/2024codearXiv
Tarsier recipes for training and evaluating large video description modelsTarsier07/2024codearXiv
X-VARS introducing explainability in football refereeing with multi-modal large language modelX-VARS04/2024codearXiv
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosCAT03/2024codearXiv
InternVideo2 scaling video foundation models for multimodal video understandingInternVideo203/2024codeECCV
MovieLLM enhancing long video understanding with AI-generated moviesMovieLLM03/2024codearXiv
LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMsIVAwithLLM02/2024codearXiv
LSTP language-guided spatial-temporal prompt learning for long-form video-text understandingLSTP02/2024codeEMNLP
LVCHAT facilitating long video comprehensionLVCHAT02/2024codearXiv
OSCaR: Object State Captioning and State Change RepresentationOSCaR02/2024codeNAACL
Slot-VLM SlowFast slots for video-language modelingSlot-VLM02/2024codearXiv
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-TrainingCOSMO01/2024codearXiv
Weakly supervised gaussian contrastive grounding with large multimodal models for video question answeringGCG01/2024codeACMMM
Audio-Visual LLM for Video UnderstandingAV-LLM12/2023codearXiv
Generative Multimodal Models are In-Context LearnersEmu212/2023project pageCVPR
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context ExamplesMMICT12/2023codeTOMM
VaQuitA : Enhancing Alignment in LLM-Assisted Video UnderstandingVaQuitA12/2023codearXiv
VILA: On Pre-training for Visual Language ModelsVILA12/2023codeCVPR
Vista-LLaMA reliable video narrator via equal distance to visual tokensVista-LLaMA12/2023project pagearXiv
Chat-UniVi unified visual representation empowers large language models with image and video understandingChat-UniVi11/2023codeCVPR
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsLLaMA-VID11/2023codearXiv
Video-LLaVA learning united visual representation by alignment before projectionVideo-LLaVA11/2023codearXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringLLaMA-VQA10/2023codeEMNLP
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingMovieChat07/2023codeCVPR
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary CaptioningLLMVA-GEBC06/2023codeCVPR
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationMacaw-LLM06/2023project pagearXiv
Valley: Video Assistant with Large Language model Enhanced abilitYVALLEY06/2023codearXiv
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsVideo-ChatGPT06/2023codeACL
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingVideo-LLaMA06/2023codeEMNLP
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksmPLUG-video06/2023codearXiv
ChatBridge: Bridging Modalities with Large Language Model as a Language CatalystChatBridge05/2023codearXiv
Otter: A Multi-Modal Model with In-Context Instruction TuningOtter05/2023codearXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsVideoLLM05/2023codearXiv
LLM as Regressor
<!-- | [**title**](link) | model | date | [code](link) | venue | -->
TitleModelDateCodeVenue
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLMHolmes-VAD06/2024codearXiv
VideoLLM-online online video large language model for streaming videoVideoLLM-online06/2024codeCVPR
HOI-Ref: Hand-Object Interaction Referral in Egocentric VisionVLM4HOI04/2024project pagearXiv
V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuningV2Xum-LLaMA04/2024codearXiv
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential DialogueAVicuna03/2024codearXiv
Elysium exploring object-level perception in videos via MLLMElysium03/2024codearXiv
HawkEye training video-text LLMs for grounding text in videosHawkEye03/2024codearXiv
LITA language instructed temporal-localization assistantLITA03/2024codearXiv
OmniViD: A Generative Framework for Universal Video UnderstandingOmniViD03/2024codeCVPR
GroundingGPT: Language Enhanced Multi-modal Grounding ModelGroundingGPT01/2024[code](https: //github.com/lzw-lzw/GroundingGPT)arXiv
TimeChat a time-sensitive multimodal large language model for long video understandingTimeChat12/2023codeCVPR
Self-Chained Image-Language Model for Video Localization and Question AnsweringSeViLA11/2023codeNeurIPS
VTimeLLM: Empower LLM to Grasp Video MomentsVTimeLLM11/2023codearXiv
LLM as Hidden Layer
TitleModelDateCodeVenue
VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal groundingVTG-LLM05/2024codearXiv
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingVITRON04/2024project pageNeurIPS
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPTVTG-GPT03/2024codearXiv
Momentor advancing video large language model with fine-grained temporal reasoningMomentor02/2024codeICML
Detours for navigating instructional videosVidDetours01/2024codeCVPR
OneLLM: One Framework to Align All Modalities with LanguageOneLLM12/2023codearXiv
GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generationGPT4Video11/2023codeACMMM

🧭 (Analyzer + Embedder) Γ— LLM

LLM as Manager
TitleModelDateCodeVenue
MM-VID: Advancing Video Understanding with GPT-4V(ision)MM-VID10/2023-arXiv
LLM as Summarizer
TitleModelDateCodeVenue
Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videosSUM-shot12/2023codearXiv
LLM as Regressor
TitleModelDateCodeVenue
Vript: A Video Is Worth Thousands of WordsVriptor06/2024codeNeurIPS
Merlin:Empowering Multimodal LLMs with Foresight MindsMerlin12/2023project pageECCV
VideoChat: Chat-Centric Video UnderstandingVideoChat05/2023codearXiv
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningVid2Seq02/2023codeCVPR
LLM as Text Decoder
TitleModelDateCodeVenue
Contextual AD Narration with Interleaved Multimodal SequenceUni-AD03/2024codearXiv
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context LearningMM-narrator11/2023project pagearXiv
Vamos: Versatile Action Models for Video UnderstandingVamos11/2023project pageECCV
AutoAD II: The Sequel -- Who, When, and What in Movie Audio DescriptionAuto-AD II10/2023project pageICCV
LLM as Hidden Layer
TitleModelDateCodeVenue
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStarPG-Video-LLaVA11/2023codearXiv

πŸ—’οΈ Taxonomy 2

πŸ€– LLM-based Video Agents

TitleModelDateCodeVenue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageSocratic Models04/2022project pagearXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStarVideo ChatCaptioner04/2023codearXiv
VLog: Video as a Long DocumentStarVLog04/2023code-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding SystemChatVideo04/2023project pagearXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)MM-VID10/2023-arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStarMISAR10/2023project pageICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long VideosGrounding-Prompter12/2023-arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language NavigationNaVid02/2024project page -RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video UnderstandingVideoAgent03/2024project pagearXiv
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMsVideoINSTA09/2024codeEMNLP

πŸŽ₯ Vid-LLM Pretraining

TitleModelDateCodeVenue
Learning Video Representations from Large Language ModelsStarLaViLa12/2022codeCVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningVid2Seq02/2023codeCVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStarVAST05/2023codeNeurIPS
Merlin:Empowering Multimodal LLMs with Foresight MindsMerlin12/2023-arXiv

πŸ‘€ Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters
TitleModelDateCodeVenue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding StarVideo-LLaMA06/2023codearXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStarVALLEY06/2023code-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStarVideo-ChatGPT06/2023codearXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStarMacaw-LLM06/2023codearXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning StarLLMVA-GEBC06/2023codeCVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks StarmPLUG-video06/2023codearXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStarMovieChat07/2023codearXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStarLLaMA-VQA10/2023codeEMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStarVideo-LLaVA11/2023codearXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStarChat-UniVi11/2023codearXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStarLLaMA-VID11/2023codearXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual TokensVISTA-LLAMA12/2023-arXiv
Audio-Visual LLM for Video Understanding-12/2023-arXiv
AutoAD: Movie Description in ContextAutoAD06/2023codeCVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio DescriptionAutoAD II10/2023-ICCV
AutoAD III: The Prequel -- Back to the PixelsAutoAD III04/2024-CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStarFAVOR10/2023codearXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsStarVideoLLaMA206/2024codearXiv
Fine-tuning with Insertive Adapters
TitleModelDateCodeVenue
Otter: A Multi-Modal Model with In-Context Instruction TuningStarOtter06/2023codearXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStarVideoLLM05/2023codearXiv
Fine-tuning with Hybrid Adapters
TitleModelDateCodeVenue
VTimeLLM: Empower LLM to Grasp Video MomentsStarVTimeLLM11/2023codearXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware GenerationGPT4Video11/2023-arXiv

🦾 Hybrid Methods

TitleModelDateCodeVenue
VideoChat: Chat-Centric Video UnderstandingStarVideoChat05/2023code demoarXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStarPG-Video-LLaVA11/2023codearXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStarTimeChat12/2023codeCVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStarVideo-GroundingDINO12/2023codearXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero ShotVideo409605/2023EMNLP

πŸ’Ž Training-free Methods

TitleModelDateCodeVenue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language ModelsSlowFast-LLaVA07/2024-arXiv
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language ModelsTS-LLaVA11/2024codearXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

NamePaperDateLinkVenue
CharadesHollywood in homes: Crowdsourcing data collection for activity understanding2016LinkECCV
YouTube8MYouTube-8M: A Large-Scale Video Classification Benchmark2016Link-
ActivityNetActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding2015LinkCVPR
Kinetics-GEBCGEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval2022LinkECCV
Kinetics-400The Kinetics Human Action Video Dataset2017Link-
VidChapters-7MVidChapters-7M: Video Chapters at Scale2023LinkNeurIPS

Captioning and Description

NamePaperDateLinkVenue
Microsoft Research Video Description Corpus (MSVD)Collecting Highly Parallel Data for Paraphrase Evaluation2011LinkACL
Microsoft Research Video-to-Text (MSR-VTT)MSR-VTT: A Large Video Description Dataset for Bridging Video and Language2016LinkCVPR
Tumblr GIF (TGIF)TGIF: A New Dataset and Benchmark on Animated GIF Description2016LinkCVPR
CharadesHollywood in Homes: Crowdsourcing Data Collection for Activity Understanding2016LinkECCV
Charades-EgoActor and Observer: Joint Modeling of First and Third-Person Videos2018LinkCVPR
ActivityNet CaptionsDense-Captioning Events in Videos2017LinkICCV
HowTo100mHowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips2019LinkICCV
Movie Audio Descriptions (MAD)MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions2021LinkCVPR
YouCook2Towards Automatic Learning of Procedures from Web Instructional Videos2017LinkAAAI
MovieNetMovieNet: A Holistic Dataset for Movie Understanding2020LinkECCV
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks2023LinkarXiv
Video Timeline Tags (ViTT)Multimodal Pretraining for Dense Video Captioning2020LinkAACL-IJCNLP
TVSumTVSum: Summarizing web videos using titles2015LinkCVPR
SumMeCreating Summaries from User Videos2014LinkECCV
VideoXumVideoXum: Cross-modal Visual and Textural Summarization of Videos2023LinkIEEE Trans Multimedia
Multi-Source Video Captioning (MSVC)VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs2024LinkarXiv

Grounding and Retrieval

NamePaperDateLinkVenue
Epic-Kitchens-100Rescaling Egocentric Vision2021LinkIJCV
VCR (Visual Commonsense Reasoning)From Recognition to Cognition: Visual Commonsense Reasoning2019LinkCVPR
Ego4D-MQ and Ego4D-NLQEgo4D: Around the World in 3,000 Hours of Egocentric Video2021LinkCVPR
Vid-STGWhere Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences2020LinkCVPR
Charades-STATALL: Temporal Activity Localization via Language Query2017LinkICCV
DiDeMoLocalizing Moments in Video with Natural Language2017LinkICCV

Question Answering

NamePaperDateLinkVenue
MSVD-QAVideo Question Answering via Gradually Refined Attention over Appearance and Motion2017LinkACM Multimedia
MSRVTT-QAVideo Question Answering via Gradually Refined Attention over Appearance and Motion2017LinkACM Multimedia
TGIF-QATGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering2017LinkCVPR
ActivityNet-QAActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering2019LinkAAAI
Pororo-QADeepStory: Video Story QA by Deep Embedded Memory Networks2017LinkIJCAI
TVQATVQA: Localized, Compositional Video Question Answering2018LinkEMNLP
MAD-QAEncoding and Controlling Global Semantics for Long-form Video Question Answering2024LinkEMNLP
Ego-QAEncoding and Controlling Global Semantics for Long-form Video Question Answering2024LinkEMNLP

Video Instruction Tuning

Pretraining Dataset
NamePaperDateLinkVenue
VidChapters-7MVidChapters-7M: Video Chapters at Scale2023LinkNeurIPS
VALOR-1MVALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset2023LinkarXiv
Youku-mPLUGYouku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks2023LinkarXiv
InternVidInternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation2023LinkarXiv
VAST-27MVAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset2023LinkNeurIPS
Fine-tuning Dataset
NamePaperDateLinkVenue
MIMIC-ITMIMIC-IT: Multi-Modal In-Context Instruction Tuning2023LinkarXiv
VideoInstruct100KVideo-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models2023LinkarXiv
TimeITTimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding2023LinkCVPR

Video-based Large Language Models Benchmark

TitleDateCodeVenue
LVBench: An Extreme Long Video Understanding Benchmark06/2024code-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models11/2023code-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models05/2023codeNeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star07/2023code-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star11/2023codeNeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding12/2023code-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark12/2023code-
TempCompass: Do Video LLMs Really Understand Videos? Star03/2024codeACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Star06/2024code-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Star06/2024code-

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

Star History Chart

β™₯️ Contributors

Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.

Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Siting Xu @ Southern University of Science and Technology
Luchuan Song @ University of Rochester
Susan Liang @ University of Rochester
Teng Wang @ The University of Hong Kong
Daoan Zhang @ University of Rochester
Jie An @ University of Rochester
Jingyang Lin @ University of Rochester
Rongyi Zhu @ University of Rochester
Ali Vosoughi @ University of Rochester
Chao Huang @ University of Rochester
Zeliang Zhang @ University of Rochester
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Feng Zheng @ Southern University of Science and Technology
Jianguo Zhang @ Southern University of Science and Technology
Ping Luo @ University of Hong Kong
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester

<a href="https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding/graphs/contributors"> <img src="https://contrib.rocks/image?repo=yunlong10/Awesome-LLMs-for-Video-Understanding" /> </a>