Awesome

Awesome-LLMs-for-Video-Understanding

🔥🔥🔥 Video Understanding with Large Language Models: A Survey

Yunlong Tang1,*, Jing Bi1,*, Siting Xu2,*, Luchuan Song1, Susan Liang1 , Teng Wang2,3 , Daoan Zhang1 , Jie An1 , Jingyang Lin1 , Rongyi Zhu1 , Ali Vosoughi1 , Chao Huang1 , Zeliang Zhang1 , Pinxin Liu1 , Mingqian Feng1 , Feng Zheng2 , Jianguo Zhang2 , Ping Luo3 , Jiebo Luo1, Chenliang Xu1,†. (*Core Contributors, †Corresponding Authors)

1University of Rochester, 2Southern University of Science and Technology, 3The University of Hong Kong

Paper | Project Page

</h5>

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update: ✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024. ✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality. ✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section. ✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification. ✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

<center> Table of Contents </center>

Awesome-LLMs-for-Video-Understanding

Why we need Vid-LLMs?

😎 Vid-LLMs: Models

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🗒️ Taxonomy 1

🕹️ Video Analyzer × LLM

LLM as Summarizer

Title	Model	Date	Code	Venue
Seeing the Unseen: Visual Metaphor Captioning for Videos	GIT-LLaVA	06/2024	code	arXiv
Zero-shot long-form video understanding through screenplay	MM-Screenplayer	06/2024	project page	CVPR
MoReVQA exploring modular reasoning models for video question answering	MoReVQA	04/2024	project page	CVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	IG-VLM	03/2024	code	arXiv
Language repository for long video understanding	LangRepo	03/2024	code	arXiv
Understanding long videos in one multimodal language model pass	MVU	03/2024	code	arXiv
Video ReCap recursive captioning of hour-long videos	Video ReCap	02/2024	code	CVPR
A Simple LLM Framework for Long-Range Video Question-Answering	LLoVi	12/2023	code	arXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videos	Grounding-prompter	12/2023	code	arXiv
Learning object state changes in videos an open-world perspective	VIDOSC	12/2023	code	CVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?	AntGPT	07/2023	code	ICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
VLog: Video as a Long Document	VLog	04/2023	code	-
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR

LLM as Manager

Title	Model	Date	Code	Venue
DrVideo: Document Retrieval Based Long Video Understanding	DrVideo	06/2024	code	arXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer	OmAgent	06/2024	code	arXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA	LVNet	06/2024	code	arXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videos	VideoTree	05/2024	code	arXiv
Harnessing Large Language Models for Training-free Video Anomaly Detection	LAVAD	04/2024	code	CVPR
TraveLER a multi-LMM agent framework for video question-answering	TraveLER	04/2024	code	arXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity features	GPTSee	03/2024	code	arXiv
Reframe anything LLM agent for open world video reframing	RAVA	03/2024	code	arXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videos	SCHEMA	03/2024	code	ICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoning	TV-TREES	02/2024	code	arXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	VideoAgent	03/2024	project page	arXiv
VideoAgent long-form video understanding with large language model as agent	VideoAgent	03/2024	code	arXiv
VURF a general-purpose reasoning and self-refinement framework for video understanding	VURF	03/2024	code	arXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videos	KEPP	03/2024	code	CVPR
DoraemonGPT toward understanding dynamic scenes with large language models	DoraemonGPT	01/2024	code	arXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos	LifelongMemory	12/2023	code	arXiv
Zero-Shot Video Question Answering with Procedural Programs	ProViQ	12/2023	code	arXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	AssistGPT	06/2023	code	arXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
ViperGPT: Visual Inference via Python Execution for Reasoning	ViperGPT	03/2023	code	arXiv
Hawk: Learning to Understand Open-World Video Anomalies	Hawk	05/2024	code	arXiv

👾 Video Embedder × LLM

LLM as Text Decoder

Title	Model	Date	Code	Venue
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	AuroraCap	10/2024	project page	arXiv
Artemis towards referential understanding in complex videos	Artemis	06/2024	code	arXiv
EmoLLM multimodal emotional understanding meets large language models	EmoLLM	06/2024	code	arXiv
Fewer tokens and fewer videos extending video understanding abilities in large vision-language models	FTFV-LLM	06/2024	-	arXiv
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams	Flash-VStream	06/2024	code	arXiv
LLAVIDAL benchmarking large language vision models for daily activities of living	LLAVIDAL	06/2024	code	arXiv
Long context transfer from language to vision	LongVA	06/2024	code	arXiv
ShareGPT4Video improving video understanding and generation with better captions	ShareGPT4Video	06/2024	code	arXiv
Towards event-oriented long video understanding	VIM	06/2024	code	arXiv
Video-SALMONN speech-enhanced audio-visual large language models	Video-SALMONN	06/2024	code	ICML
VideoGPT+ integrating image and video encoders for enhanced video understanding	VideoGPT+	06/2024	code	arXiv
VideoLLaMA 2 advancing spatial-temporal modeling and audio understanding in video-LLMs	VideoLLaMA 2	06/2024	code	arXiv
MotionLLM: Understanding Human Behaviors from Human Motions and Videos	MotionLLM	05/2024	project page	arXiv
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	VideoChat2	11/2023	code	CVPR
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization	Shotluck Holmes	05/2024	-	arXiv
Streaming long video understanding with large language models	VideoStreaming	05/2024	-	arXiv
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline	VideoNarrator	05/2024	-	arXiv
TOPA extend large language models for video understanding via text-only pre-alignment	TOPA	05/2024	code	NeurIPS
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering	MovieChat+	04/2024	code	arXiv
AutoAD III: The Prequel – Back to the Pixels	AutoAD III	04/2024	project page	CVPR
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward	LLaVA-Hound-DPO	04/2024	code	arXiv
From image to video, what do we need in multimodal LLMs	RED-VILLM	04/2024	-	arXiv
Koala key frame-conditioned long video-LLM	Koala	04/2024	project page	CVPR
LongVLM efficient long video understanding via large language models	LongVLM	04/2024	code	ECCV
MA-LMM memory-augmented large multimodal model for long-term video understanding	MA-LMM	04/2024	code	CVPR
MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens	MiniGPT4-Video	04/2024	code	arXiv
Pegasus-v1 technical report	Pegasus-v1	04/2024	code	arXiv
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	PLLaVA	04/2024	code	arXiv
ST-LLM: Large Language Models Are Effective Temporal Learners	ST-LLM	04/2024	code	arXiv
Tarsier recipes for training and evaluating large video description models	Tarsier	07/2024	code	arXiv
X-VARS introducing explainability in football refereeing with multi-modal large language model	X-VARS	04/2024	code	arXiv
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios	CAT	03/2024	code	arXiv
InternVideo2 scaling video foundation models for multimodal video understanding	InternVideo2	03/2024	code	ECCV
MovieLLM enhancing long video understanding with AI-generated movies	MovieLLM	03/2024	code	arXiv
LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs	IVAwithLLM	02/2024	code	arXiv
LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding	LSTP	02/2024	code	EMNLP
LVCHAT facilitating long video comprehension	LVCHAT	02/2024	code	arXiv
OSCaR: Object State Captioning and State Change Representation	OSCaR	02/2024	code	NAACL
Slot-VLM SlowFast slots for video-language modeling	Slot-VLM	02/2024	code	arXiv
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training	COSMO	01/2024	code	arXiv
Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering	GCG	01/2024	code	ACMMM
Audio-Visual LLM for Video Understanding	AV-LLM	12/2023	code	arXiv
Generative Multimodal Models are In-Context Learners	Emu2	12/2023	project page	CVPR
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples	MMICT	12/2023	code	TOMM
VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding	VaQuitA	12/2023	code	arXiv
VILA: On Pre-training for Visual Language Models	VILA	12/2023	code	CVPR
Vista-LLaMA reliable video narrator via equal distance to visual tokens	Vista-LLaMA	12/2023	project page	arXiv
Chat-UniVi unified visual representation empowers large language models with image and video understanding	Chat-UniVi	11/2023	code	CVPR
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	LLaMA-VID	11/2023	code	arXiv
Video-LLaVA learning united visual representation by alignment before projection	Video-LLaVA	11/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	CVPR
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	project page	arXiv
Valley: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	arXiv
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	ACL
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	EMNLP
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	ChatBridge	05/2023	code	arXiv
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Otter	05/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

LLM as Regressor

Title	Model	Date	Code	Venue
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM	Holmes-VAD	06/2024	code	arXiv
VideoLLM-online online video large language model for streaming video	VideoLLM-online	06/2024	code	CVPR
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision	VLM4HOI	04/2024	project page	arXiv
V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning	V2Xum-LLaMA	04/2024	code	arXiv
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue	AVicuna	03/2024	code	arXiv
Elysium exploring object-level perception in videos via MLLM	Elysium	03/2024	code	arXiv
HawkEye training video-text LLMs for grounding text in videos	HawkEye	03/2024	code	arXiv
LITA language instructed temporal-localization assistant	LITA	03/2024	code	arXiv
OmniViD: A Generative Framework for Universal Video Understanding	OmniViD	03/2024	code	CVPR
GroundingGPT: Language Enhanced Multi-modal Grounding Model	GroundingGPT	01/2024	[code](https: //github.com/lzw-lzw/GroundingGPT)	arXiv
TimeChat a time-sensitive multimodal large language model for long video understanding	TimeChat	12/2023	code	CVPR
Self-Chained Image-Language Model for Video Localization and Question Answering	SeViLA	11/2023	code	NeurIPS
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv

LLM as Hidden Layer

Title	Model	Date	Code	Venue
VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding	VTG-LLM	05/2024	code	arXiv
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	VITRON	04/2024	project page	NeurIPS
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT	VTG-GPT	03/2024	code	arXiv
Momentor advancing video large language model with fine-grained temporal reasoning	Momentor	02/2024	code	ICML
Detours for navigating instructional videos	VidDetours	01/2024	code	CVPR
OneLLM: One Framework to Align All Modalities with Language	OneLLM	12/2023	code	arXiv
GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation	GPT4Video	11/2023	code	ACMMM

🧭 (Analyzer + Embedder) × LLM

LLM as Manager

Title	Model	Date	Code	Venue
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv

LLM as Summarizer

Title	Model	Date	Code	Venue
Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos	SUM-shot	12/2023	code	arXiv

LLM as Regressor

Title	Model	Date	Code	Venue
Vript: A Video Is Worth Thousands of Words	Vriptor	06/2024	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	project page	ECCV
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code	arXiv
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR

LLM as Text Decoder

Title	Model	Date	Code	Venue
Contextual AD Narration with Interleaved Multimodal Sequence	Uni-AD	03/2024	code	arXiv
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning	MM-narrator	11/2023	project page	arXiv
Vamos: Versatile Action Models for Video Understanding	Vamos	11/2023	project page	ECCV
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description	Auto-AD II	10/2023	project page	ICCV

LLM as Hidden Layer

Title	Model	Date	Code	Venue
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv

🗒️ Taxonomy 2

🤖 LLM-based Video Agents

Title	Model	Date	Code	Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Socratic Models	04/2022	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
VLog: Video as a Long Document	VLog	04/2023	code	-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv
MISAR: A Multimodal Instructional System with Augmented Reality	MISAR	10/2023	project page	ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos	Grounding-Prompter	12/2023	-	arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	NaVid	02/2024	project page -	RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	VideoAgent	03/2024	project page	arXiv
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	VideoINSTA	09/2024	code	EMNLP

🎥 Vid-LLM Pretraining

Title	Model	Date	Code	Venue
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	-	arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title	Model	Date	Code	Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	code	arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Video-LLaVA	11/2023	code	arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Chat-UniVi	11/2023	code	arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	LLaMA-VID	11/2023	code	arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens	VISTA-LLAMA	12/2023	-	arXiv
Audio-Visual LLM for Video Understanding	-	12/2023	-	arXiv
AutoAD: Movie Description in Context	AutoAD	06/2023	code	CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description	AutoAD II	10/2023	-	ICCV
AutoAD III: The Prequel -- Back to the Pixels	AutoAD III	04/2024	-	CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models	FAVOR	10/2023	code	arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	VideoLLaMA2	06/2024	code	arXiv

Fine-tuning with Insertive Adapters

Title	Model	Date	Code	Venue
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Otter	06/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

Fine-tuning with Hybrid Adapters

Title	Model	Date	Code	Venue
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	GPT4Video	11/2023	-	arXiv

🦾 Hybrid Methods

Title	Model	Date	Code	Venue
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code demo	arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	TimeChat	12/2023	code	CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Video-GroundingDINO	12/2023	code	arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot	Video4096	05/2023		EMNLP

💎 Training-free Methods

Title	Model	Date	Code	Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	SlowFast-LLaVA	07/2024	-	arXiv
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	TS-LLaVA	11/2024	code	arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name	Paper	Date	Link	Venue
Charades	Hollywood in homes: Crowdsourcing data collection for activity understanding	2016	Link	ECCV
YouTube8M	YouTube-8M: A Large-Scale Video Classification Benchmark	2016	Link	-
ActivityNet	ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding	2015	Link	CVPR
Kinetics-GEBC	GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval	2022	Link	ECCV
Kinetics-400	The Kinetics Human Action Video Dataset	2017	Link	-
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS

Captioning and Description

Name	Paper	Date	Link	Venue
Microsoft Research Video Description Corpus (MSVD)	Collecting Highly Parallel Data for Paraphrase Evaluation	2011	Link	ACL
Microsoft Research Video-to-Text (MSR-VTT)	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	2016	Link	CVPR
Tumblr GIF (TGIF)	TGIF: A New Dataset and Benchmark on Animated GIF Description	2016	Link	CVPR
Charades	Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding	2016	Link	ECCV
Charades-Ego	Actor and Observer: Joint Modeling of First and Third-Person Videos	2018	Link	CVPR
ActivityNet Captions	Dense-Captioning Events in Videos	2017	Link	ICCV
HowTo100m	HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	2019	Link	ICCV
Movie Audio Descriptions (MAD)	MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions	2021	Link	CVPR
YouCook2	Towards Automatic Learning of Procedures from Web Instructional Videos	2017	Link	AAAI
MovieNet	MovieNet: A Holistic Dataset for Movie Understanding	2020	Link	ECCV
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
Video Timeline Tags (ViTT)	Multimodal Pretraining for Dense Video Captioning	2020	Link	AACL-IJCNLP
TVSum	TVSum: Summarizing web videos using titles	2015	Link	CVPR
SumMe	Creating Summaries from User Videos	2014	Link	ECCV
VideoXum	VideoXum: Cross-modal Visual and Textural Summarization of Videos	2023	Link	IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC)	VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	2024	Link	arXiv

Grounding and Retrieval

Name	Paper	Date	Link	Venue
Epic-Kitchens-100	Rescaling Egocentric Vision	2021	Link	IJCV
VCR (Visual Commonsense Reasoning)	From Recognition to Cognition: Visual Commonsense Reasoning	2019	Link	CVPR
Ego4D-MQ and Ego4D-NLQ	Ego4D: Around the World in 3,000 Hours of Egocentric Video	2021	Link	CVPR
Vid-STG	Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences	2020	Link	CVPR
Charades-STA	TALL: Temporal Activity Localization via Language Query	2017	Link	ICCV
DiDeMo	Localizing Moments in Video with Natural Language	2017	Link	ICCV

Question Answering

Name	Paper	Date	Link	Venue
MSVD-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
MSRVTT-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
TGIF-QA	TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	2017	Link	CVPR
ActivityNet-QA	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	2019	Link	AAAI
Pororo-QA	DeepStory: Video Story QA by Deep Embedded Memory Networks	2017	Link	IJCAI
TVQA	TVQA: Localized, Compositional Video Question Answering	2018	Link	EMNLP
MAD-QA	Encoding and Controlling Global Semantics for Long-form Video Question Answering	2024	Link	EMNLP
Ego-QA	Encoding and Controlling Global Semantics for Long-form Video Question Answering	2024	Link	EMNLP

Video Instruction Tuning

Pretraining Dataset

Name	Paper	Date	Link	Venue
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS
VALOR-1M	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	2023	Link	arXiv
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	2023	Link	arXiv
VAST-27M	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	2023	Link	NeurIPS

Fine-tuning Dataset

Name	Paper	Date	Link	Venue
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	2023	Link	arXiv
VideoInstruct100K	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	2023	Link	arXiv
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	2023	Link	CVPR

Video-based Large Language Models Benchmark

Title	Date	Code	Venue
LVBench: An Extreme Long Video Understanding Benchmark	06/2024	code	-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	11/2023	code	-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models	05/2023	code	NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	07/2023	code	-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	11/2023	code	NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	12/2023	code	-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	12/2023	code	-
TempCompass: Do Video LLMs Really Understand Videos?	03/2024	code	ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	06/2024	code	-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	06/2024	code	-

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

♥️ Contributors

Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.

Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Siting Xu @ Southern University of Science and Technology
Luchuan Song @ University of Rochester
Susan Liang @ University of Rochester
Teng Wang @ The University of Hong Kong
Daoan Zhang @ University of Rochester
Jie An @ University of Rochester
Jingyang Lin @ University of Rochester
Rongyi Zhu @ University of Rochester
Ali Vosoughi @ University of Rochester
Chao Huang @ University of Rochester
Zeliang Zhang @ University of Rochester
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Feng Zheng @ Southern University of Science and Technology
Jianguo Zhang @ Southern University of Science and Technology
Ping Luo @ University of Hong Kong
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester