Home

Awesome

paper   welcome

Foundation Models for Video Understanding: A Survey

Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund

benchmark

Motivation

The field of video understanding is undergoing significant advancement, as evidenced by the increasing number of research publications focused on various video understanding tasks (Figure below). This growth coincides with the development of large-scale pretraining techniques. These techniques have demonstrated remarkable capabilities in adapting to diverse tasks, requiring minimal additional training with robust generalization. As a result, researchers are actively investigating the role of these foundational models to address a broad spectrum of video understanding challenges. We surveyed more than 200 foundation models, analyzing their performance over several common video tasks. Our survey, Foundation Model for Video Understanding: A Survey, also provides an overview of 16 different video tasks, including their benchmarks and evaluation metrics.

benchmark

Main Classification

benchmark

Menu

Image-based

Distilling Vision-Language Models on Millions of Videos. (Distill-VLM). [CVPR, 2024]. <br> Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan.<br> [Paper]

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model. (COSA). [ICLR, 2024]. <br> Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu.<br> [Paper] [Code]

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition. (FROSTER). [ICLR, 2024]. <br> Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han.<br> [Paper] [Code]

EZ-CLIP: Efficient Zeroshot Video Action Recognition. (EZ-CLIP). [arxiv, 2024]. <br> Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat.<br> [Paper] [Code]

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition. (M2-CLIP). [arxiv, 2024]. <br> Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu.<br> [Paper]

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter. (PaLM2-VAdapter). [arxiv, 2024]. <br> Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang.<br> [Paper]

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering. (Q-ViD). [arxiv, 2024]. <br> David Romero, Thamar Solorio.<br> [Paper]

Revealing Single Frame Bias for Video-and-Language Learning. (Singularity). [ACL, 2023]. <br> Jie Lei, Tamara L Berg, Mohit Bansal.<br> [Paper] [Code]

AdaCLIP: Towards Pragmatic Multimodal Video Retrieval. (AdaCLIP). [ACM-MM, 2023]. <br> Zhiming Hu, Angela Ning Ye, Salar Hosseini Khorasgani, Iqbal Mohomed.<br> [Paper] [Code]

CLIP4Caption: CLIP for Video Caption. (CLIP4Caption). [ACM-MM, 2023]. <br> Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, Xiu Li.<br> [Paper]

RTQ: Rethinking Video-language Understanding Based on Image-text Model. (RTQ). [ACM Multimedia, 2023]. <br> Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, Yu Qiao.<br> [Paper] [Code]

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning. [ACM Multimedia, 2023]. <br> Qiang Wang, Junlong Du, Ke Yan, Shouhong Ding.<br> [Paper]

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. (VideoLDM). [CVPR, 2023]. <br> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang.<br> [Paper]

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models. (BIKE). [CVPR, 2023]. <br> Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang.<br> [Paper] [Code]

Dual-path Adaptation from Image to Video Transformers. (DualPath). [CVPR, 2023]. <br> Jungin Park, Jiyoung Lee, Kwanghoon Sohn.<br> [Paper] [Code]

Fine-tuned CLIP Models are Efficient Video Learners. (ViFi-CLIP). [CVPR, 2023]. <br> Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan.<br> [Paper] [Code]

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval. (VoP). [CVPR, 2023]. <br> Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang.<br> [Paper] [Code]

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting. (Vita-CLIP). [CVPR, 2023]. <br> Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah.<br> [Paper] [Code]

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning. (DiST). [ICCV, 2023]. <br> Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang.<br> [Paper] [Code]

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models. (pyoco). [ICCV, 2023]. <br> Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, Yu Qiao.<br> [Paper] [Demo]

Unmasked Teacher: Towards Training-Efficient Video Foundation Models. (UMT). [ICCV, 2023]. <br> Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, Yu Qiao.<br> [Paper] [Code]

Alignment and Generation Adapter for Efficient Video-text Understanding. (AG-Adapter). [ICCVW, 2023]. <br> Han Fang, Zhifei Yang, Yuhan Wei, Xianghao Zang, Chao Ban, Zerun Feng, Zhongjiang He, Yongxiang Li, Hao Sun.<br> [Paper]

AIM: Adapting Image Models for Efficient Video Action Recognition. (AIM). [ICLR, 2023]. <br> Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li.<br> [Paper] [Code]

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge. (MAXI). [ICCV, 2023]. <br> Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof.<br> [Paper] [Code]

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval. (PromptSwitch). [ICCV, 2023]. <br> Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu.<br> [Paper] [Code]

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval. (ProST). [ICCV, 2023]. <br> Pandeng Li, Chen-Wei Xie, Liming Zhao, Hongtao Xie, Jiannan Ge, Yun Zheng, Deli Zhao, Yongdong Zhang.<br> [Paper] [Code]

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer. (Temp-adapter). [ICCV, 2023]. <br> Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H.S.Torr, Xiao-Ping Zhang, Yansong Tang.<br> [Paper] [Code]

Tracking Anything with Decoupled Video Segmentation. (DEVA). [ICCV, 2023]. <br> Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H.S.Torr, Xiao-Ping Zhang, Yansong Tang.<br> [Paper] [Code]

Alignment and Generation Adapter for Efficient Video-text Understanding. (AG-Adapter). [ICCVW, 2023]. <br> Han Fang; Zhifei Yang; Yuhan Wei; Xianghao Zang; Chao Ban; Zerun Feng; Zhongjiang He; Yongxiang Li; Hao Sun.<br> [Paper]

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts. (ViTiS). [ICCVW, 2023]. <br> Deniz Engin, Yannis Avrithis.<br> [Paper] [Code]

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. (CLIP-ViP). [ICLR, 2023]. <br> Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo.<br> [Paper] [Code]

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception. (IMP). [NeurIPS, 2023]. <br> Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam.<br> [Paper]

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. (P-Former). [NeurIPS, 2023]. <br> Yiren Jian, Chongyang Gao, Soroush Vosoughi.<br> [Paper] [Code]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. (InstructBLIP). [NeurIPS, 2023]. <br> Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.<br> [Paper] [Code]

Language-based Action Concept Spaces Improve Video Self-Supervised Learning. (LSS). [NeurIPS, 2023]. <br> Kanchana Ranasinghe, Michael Ryoo.<br> [Paper]

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks. (MaMMUT). [TMLR, 2023]. <br> Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova.<br> [Paper] [Code]

MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval. (MEME). [SIGIR, 2023]. <br> Seong-Min Kang, Yoon-Sik Cho.<br> [Paper] [Code]

EVA-CLIP: Improved Training Techniques for CLIP at Scale. (EVA-CLIP). [arxiv, 2023]. <br> Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao.<br> [Paper] [Code]

Fine-grained Text-Video Retrieval with Frozen Image Encoders. (CrossVTR). [arxiv, 2023]. <br> Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu.<br> [Paper]

Harvest Video Foundation Models via Efficient Post-Pretraining. (Harvest). [arxiv, 2023]. <br> Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu Qiao, Ping Luo.<br> [Paper] [Code]

Motion-Conditioned Diffusion Model for Controllable Video Synthesis. (MCDiff). [arxiv, 2023]. <br> Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, Ming-Hsuan Yang.<br> [Paper]

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding. (Mug-STAN). [arxiv, 2023]. <br> Ruyang Liu, Jingjia Huang, Wei Gao, Thomas H. Li, Ge Li.<br> [Paper] [Code]

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter. (PaLM2-VAdapter). [arxiv, 2023]. <br> Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang.<br> [Paper]

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation. (RefSAM). [arxiv, 2023]. <br> Yonglin Li, Jing Zhang, Xiao Teng, Long Lan.<br> [Paper]

Segment and Track Anything. (SAM-Track). [arxiv, 2023]. <br> Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, Yi Yang.<br> [Paper] [Code]

Segment Anything Meets Point Tracking. (SAM-PT). [arxiv, 2023]. <br> Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu.<br> [Paper] [Code]

Track Anything: Segment Anything Meets Videos. (TAM). [arxiv, 2023]. <br> Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, Feng Zheng.<br> [Paper] [Code]

Tracking Anything in High Quality. (HQTrack). [arxiv, 2023]. <br> Jiawen Zhu, Zhenyu Chen, Zeqi Hao, Shijie Chang, Lu Zhang, Dong Wang, Huchuan Lu, Bin Luo, Jun-Yan He, Jin-Peng Lan, Hanyuan Chen, Chenyang Li.<br> [Paper] [Code]

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter. (TaCA). [arxiv, 2023]. <br> Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou.<br> [Paper]

UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything Model. (UVOSAM). [arxiv, 2023]. <br> Zhenghao Zhang, Shengfan Zhang, Zhichao Wei, Zuozhuo Dai, Siyu Zhu.<br> [Paper] [Code]

Videoprompter: an ensemble of foundational models for zero-shot video understanding. (Videoprompter). [arxiv, 2023]. <br> Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah.<br> [Paper]

ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection. (ZEETAD). [WACV, 2024]. <br> Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, Ngan Le.<br> [Paper] [Code]

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video. (ZeroI2V). [arxiv, 2023]. <br> Xinhao Li, Limin Wang.<br> [Paper] [Code]

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks. (FitCLIP). [BMVC, 2022]. <br> Santiago Castro and Fabian Caba.<br> [Paper] [Code]

Cross Modal Retrieval with Querybank Normalisation. (QB-Norm). [CVPR, 2022]. <br> Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie.<br> [Paper] [Code]

Revisiting the "Video" in Video-Language Understanding. (ATP). [CVPR, 2022]. <br> Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles.<br> [Paper] [Code]

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. (X-Pool). [CVPR, 2022]. <br> Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, Guangwei Yu.<br> [Paper] [Code]

Frozen CLIP models are Efficient Video Learners. (EVL). [ECCV, 2022]. <br> Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li.<br> [Paper] [Code]

Prompting Visual-Language Models for Efficient Video Understanding. (Prompt-CLIP). [ECCV, 2022]. <br> Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie.<br> [Paper] [Code]

Expanding Language-Image Pretrained Models for General Video Recognition. (X-CLIP). [ECCV, 2022]. <br> Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.<br> [Paper] [Code]

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning. (St-Adapter). [NeurIPS, 2022]. <br> Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li.<br> [Paper] [Code]

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. (CLIP4Clip). [Neurocomputing, 2022]. <br> Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li.<br> [Paper] [Code]

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations. (CLIP2Video). [TMM, 2022]. <br> Han Fang, Pengfei Xiong, Luhui Xu, Wenhan Luo.<br> [Paper]

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. (CenterCLIP). [SIGIR, 2022]. <br> Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang.<br> [Paper] [Code]

Cross-Modal Adapter for Text-Video Retrieval. (Cross-Modal-Adapter). [arxiv, 2022]. <br> Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Jiwen Lu, Jie Zhou, Shiji Song, Gao Huang.<br> [Paper] [Code]

Imagen Video: High Definition Video Generation with Diffusion Models. (Imagen Video). [arxiv, 2022]. <br> Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans.<br> [Paper] [Demo]

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. (MOV). [arxiv, 2022]. <br> Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui.<br> [Paper]

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. (Video-CoCa). [arxiv, 2022]. <br> Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu.<br> [Paper]

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. (Frozen). [ICCV, 2021]. <br> Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman.<br> [Paper] [Code]

ActionCLIP: A New Paradigm for Video Action Recognition. (ActionCLIP). [arxiv, 2021]. <br> Mengmeng Wang, Jiazheng Xing, Yong Liu.<br> [Paper] [Code]

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. (CAMoE). [arxiv, 2021]. <br> Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen.<br> [Paper]

End-to-End Learning of Visual Representations from Uncurated Instructional Videos. (MIL-NCE). [CVPR, 2020]. <br> Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman.<br> [Paper] [Code]

Video-based

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding. (MovieChat). [CVPR, 2024]. <br> Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang.<br> [Paper] [Code]

SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling. (SRTube). [CVPR, 2024]. <br> Ju-Hee Lee and Je-Won Kang.<br> [Paper]

VidLA: Video-Language Alignment at Scale. (VidLA). [CVPR, 2024]. <br> Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi.<br> [Paper]

ControlVideo: Training-free Controllable Text-to-Video Generation. (ControlVideo). [ICLR, 2024]. <br> Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian.<br> [Paper] [Code]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. (ViCLIP). [ICLR, 2024]. <br> Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao.<br> [Paper] [Code]

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. (MORA). [arxiv, 2024]. <br> Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, Lichao Sun.<br> [Paper] [Code]

Memory Consolidation Enables Long-Context Video Understanding. (MC-ViT). [ICML, 2024]. <br> Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J. Hénaff.<br> [Paper]

Lumiere: A Space-Time Diffusion Model for Video Generation. (Lumiere). [arxiv, 2024]. <br> Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri.<br> [Paper] [Demo]

VideoChat: Chat-Centric Video Understanding. (VideoChat). [arxiv, 2024]. <br> KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao.<br> [Paper] [Code]

VILA: On Pre-training for Visual Language Models. (VILA). [arxiv, 2024]. <br> Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han.<br> [Paper] [Code]

World Model on Million-Length Video And Language With Blockwise RingAttention. (LWM). [arxiv, 2024]. <br> Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel.<br> [Paper] [Code]

Video generation models as world simulators. (SORA). [OpenAI, 2024]. <br> OpenAI.<br> [Link]

All in One: Exploring Unified Video-Language Pre-training. (All-in-One). [CVPR, 2023]. <br> Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.<br> [Paper] [Code]

Clover: Towards A Unified Video-Language Alignment and Fusion Model. (Clover). [CVPR, 2023]. <br> Jingjia Huang, Yinan Li, Jiashi Feng, Xinglong Wu, Xiaoshuai Sun, Rongrong Ji.<br> [Paper] [Code]

HierVL: Learning Hierarchical Video-Language Embeddings. (HierVL). [CVPR, 2023]. <br> Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman.<br> [Paper] [Code]

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling. (LAVENDER). [CVPR, 2023]. <br> Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang.<br> [Paper] [Code]

Learning Video Representations from Large Language Models. (LaViLa). [CVPR, 2023]. <br> Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar.<br> [Paper] [Code]

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning. (MVD). [CVPR, 2023]. <br> Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang.<br> [Paper] [Code]

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models. (MELTR). [CVPR, 2023]. <br> Dohwan Ko, Joonmyung Choi, Hyeong Kyu Choi, Kyoung-Woon On, Byungseok Roh, Hyunwoo J. Kim.<br> [Paper] [Code]

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation. (MMVG). [CVPR, 2023]. <br> Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang.<br> [Paper] [Code]

Test of Time: Instilling Video-Language Models with a Sense of Time. (TACT). [CVPR, 2023]. <br> Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek.<br> [Paper] [Code]

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. (VideoMAEv2). [CVPR, 2023]. <br> Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell.<br> [Paper] [Code]

VindLU: A Recipe for Effective Video-and-Language Pretraining. (VindLU). [CVPR, 2023]. <br> Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius.<br> [Paper] [Code]

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling. (Violetv2). [CVPR, 2023]. <br> Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu.<br> [Paper] [Code]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. (Video-LLaMA). [EMNLP, 2023]. <br> Hang Zhang, Xin Li, Lidong Bing.<br> [Paper] [Code]

Audiovisual Masked Autoencoders. (AudVis MAE). [ICCV, 2023]. <br> Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab.<br> [Paper]

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. (FateZero). [ICCV, 2023]. <br> Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen.<br> [Paper] [Code]

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training. (Hitea). [ICCV, 2023]. <br> Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, Fei Huang.<br> [Paper]

MGMAE: Motion Guided Masking for Video Masked Autoencoding. (MGMAE). [ICCV, 2023]. <br> Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, Limin Wang.<br> [Paper] [Code]

Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval. (ProST). [ICCV, 2023]. <br> Pandeng Li, Chen-Wei Xie, Liming Zhao, Hongtao Xie, Jiannan Ge, Yun Zheng, Deli Zhao, Yongdong Zhang·<br> [Paper] [Code]

StableVideo: Text-driven Consistency-aware Diffusion Video Editing. (StableVideo). [ICCV, 2023]. <br> Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu·<br> [Paper] [Code]

Verbs in Action: Improving verb understanding in video-language models. (VFC). [ICCV, 2023]. <br> Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid.<br> [Paper] [Code]

Paxion: Patching Action Knowledge in Video-Language Foundation Models. (PAXION). [NeurIPS, 2023]. <br> Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji.<br> [Paper] [Code]

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System. (ChatVideo). [arxiv, 2023]. <br> Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang.<br> [Paper]

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. (Control-A-Video). [arxiv, 2023]. <br> Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin.<br> [Paper] [Code]

Dreamix: Video Diffusion Models are General Video Editors. (Dreammix). [arxiv, 2023]. <br> Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen.<br> [Paper]

MM-VID: Advancing Video Understanding with GPT-4V(ision). (MM-VID). [arxiv, 2023]. <br> Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang.<br> [Paper] [Demo]

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling. (MuLTI). [arxiv, 2023]. <br> Jiaqi Xu, Bo Liu, Yunkuo Chen, Mengli Cheng, Xing Shi.<br> [Paper]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. (SV3D). [arxiv, 2023]. <br> Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach.<br> [Paper] [Code]

Valley: Video Assistant with Large Language model Enhanced abilitY. (Valley). [arxiv, 2023]. <br> Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei.<br> [Paper]

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models. (Video-Bench). [arxiv, 2023]. <br> Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan.<br> [Paper] [Code]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. (Video-ChatGPT). [arxiv, 2023]. <br> Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan.<br> [Paper] [Code]

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning. (VALOR). [arxiv, 2023]. <br> Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal.<br> [Paper] [Code]

VideoGLUE: Video General Understanding Evaluation of Foundation Models. (Video-GLUE). [arxiv, 2023]. <br> Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong.<br> [Paper] [Code]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. (Video-LLaVA). [arxiv, 2023]. <br> Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan.<br> [Paper] [Code]

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. (HD-VILA). [CVPR, 2022]. <br> Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo.<br> [Paper] [Code]

Align and Prompt: Video-and-Language Pre-training with Entity Prompts. (MCQ). [CVPR, 2022]. <br> Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi.<br> [Paper] [Code]

BEVT: BERT Pretraining of Video Transformers. (Bevt). [CVPR, 2022]. <br> Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan.<br> [Paper] [Code]

Bridging Video-text Retrieval with Multiple Choice Questions. (ALPRO). [CVPR, 2022]. <br> Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, Ping Luo.<br> [Paper] [Code]

End-to-end Generative Pretraining for Multimodal Video Captioning. (MV-GPT). [CVPR, 2022]. <br> Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid.<br> [Paper]

Object-aware Video-language Pre-training for Retrieval. (OA-Trans). [CVPR, 2022]. <br> Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou.<br> [Paper] [Code]

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. (SwinBERT). [CVPR, 2022]. <br> Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, Lijuan Wang.<br> [Paper] [Code]

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval. (MILES). [ECCV, 2022]. <br> Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, Xiaohu Qie, Ping Luo.<br> [Paper] [Code]

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning. (LF-VILA). [NeurIPS, 2022]. <br> Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu.<br> [Paper] [Code]

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval. (TMVM). [NeurIPS, 2022]. <br> Chengzhi Lin, Ancong Wu, Junwei Liang, Jun Zhang, Wenhang Ge, Wei-Shi Zheng, Chunhua Shen.<br> [Paper]

Masked Autoencoders As Spatiotemporal Learners. (ST-MAE). [NeurIPS, 2022]. <br> Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He.<br> [Paper] [Code]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. (VideoMAE). [NeurIPS, 2022]. <br> Zhan Tong, Yibing Song, Jue Wang, Limin Wang.<br> [Paper] [Code]

TVLT: Textless Vision-Language Transformer. (TVLT). [NeurIPS, 2022]. <br> Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal.<br> [Paper] [Code]

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training. (MAM<sup>2</sup>). [arxiv, 2022]. <br> Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang.<br> [Paper]

Make-A-Video: Text-to-Video Generation without Text-Video Data. (Make-A-Video). [arxiv, 2022]. <br> Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman.<br> [Paper]

SimVTP: Simple Video Text Pre-training with Masked Autoencoders. (SimVTP). [arxiv, 2022]. <br> Yue Ma, Tianyu Yang, Yin Shan, Xiu Li.<br> [Paper] [Code]

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling. (Violet). [arxiv, 2022]. <br> Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu.<br> [Paper] [Code]

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. (VideoCLIP). [EMNLP, 2021]. <br> Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer.<br> [Paper] [Code]

Just Ask: Learning to Answer Questions from Millions of Narrated Videos. (Just-Ask). [ICCV, 2021]. <br> Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid.<br> [Paper] [Code]

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning. (Vimpac). [arxiv, 2021]. <br> Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal.<br> [Paper] [Code]

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. (UniVL). [arxiv, 2020]. <br> Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou.<br> [Paper] [Code]

Universal

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. (Chat-UniVi). [CVPR, 2024]. <br> Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan.<br> [Paper] [Code]

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. (Dysen-VDM). [CVPR, 2024]. <br> Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua.<br> [Paper] [Code]

General Object Foundation Model for Images and Videos at Scale. (GLEE). [CVPR, 2024]. <br> Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai.<br> [Paper] [Code]

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding. (MA-LMM). [CVPR, 2024]. <br> Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim.<br> [Paper] [Code]

Timechat: A time-sensitive multimodal large language model for long video understanding. (TimeChat). [CVPR, 2024]. <br> Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou.<br> [Paper] [Code]

VideoLLM-online: Online Video Large Language Model for Streaming Video. (VideoLLM-online). [CVPR, 2024]. <br> Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou.<br> [Paper] [Code]

LongVLM: Efficient Long Video Understanding via Large Language Models (LongVLM). [ECCV, 2024]. <br> Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang.<br> [Paper] [Code]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. (LanguageBind). [ICLR, 2024]. <br> Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan.<br> [Paper] [Code]

X<sup>2</sup>-VLM: All-In-One Pre-trained Model For Vision-Language Tasks. (X<sup>2</sup>-VLM). [TPAMI, 2024]. <br> Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou.<br> [Paper] [Code]

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding. (InternVideo2). [arxiv, 2024]. <br> Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, Limin Wang.<br> [Paper] [Code]

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. (Video-LaVIT). [arxiv, 2024]. <br> Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu.<br> [Paper] [Code]

VideoPoet: A Large Language Model for Zero-Shot Video Generation. (VideoPoet). [arxiv, 2024]. <br> Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang.<br> [Paper]

VideoPrism: A Foundational Visual Encoder for Video Understanding. (VideoPrism). [arxiv, 2024]. <br> Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong.<br> [Paper]

OmniMAE: Single Model Masked Pretraining on Images and Videos. (OmniMAE). [CVPR, 2023]. <br> Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra.<br> [Paper] [Code]

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding. (TESTA). [EMNLP, 2023]. <br> Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou.<br> [Paper] [Code]

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training. (Smaug). [ICCV, 2023]. <br> Yuanze Lin, Chen Wei, Huiyu Wang, Alan Yuille, Cihang Xie.<br> [Paper]

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video. (mPLUG-2). [ICML, 2023]. <br> Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou.<br> [Paper] [Code]

Contrastive Audio-Visual Masked Autoencoder. (CAV-MAE). [ICLR, 2023]. <br> Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass.<br> [Paper] [Code]

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. (VAST). [NeurIPS, 2023]. <br> Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu.<br> [Paper] [Code]

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention. (Perceiver-VL). [WACV, 2023]. <br> Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal.<br> [Paper] [Code]

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models. (FAVOR). [arxiv, 2023]. <br> Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang.<br> [Paper] [Code]

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. (Macaw-LLM). [arxiv, 2023]. <br> Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu.<br> [Paper] [Code]

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models. (PG-Video-LLaVA). [arxiv, 2023]. <br> Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan.<br> [Paper] [Code]

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale. (TVTSv2). [arxiv, 2023]. <br> Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan.<br> [Paper] [Code]

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders. (Vic-MAE). [arxiv, 2023]. <br> Jefferson Hernandez, Ruben Villegas, Vicente Ordonez.<br> [Paper]

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset. (VALOR). [arxiv, 2023]. <br> Jefferson Hernandez, Ruben Villegas, Vicente Ordonez.<br> [Paper] [Code]

Masked Feature Prediction for Self-Supervised Visual Pre-Training. (MaskFeat). [CVPR, 2022]. <br> Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph Feichtenhofer.<br> [Paper] [Code]

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks. (OmniVL). [NeurIPS, 2022]. <br> Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan.<br> [Paper]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning. (InternVideo). [arxiv, 2022]. <br> Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao.<br> [Paper] [Code]

Self-supervised video pretraining yields human-aligned visual representations. (VITO). [arxiv, 2022]. <br> Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff.<br> [Paper]

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. (VLM). [ACL, 2021]. <br> Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer.<br> [Paper] [Code]

MERLOT: Multimodal Neural Script Knowledge Models. (MERLOT). [NeurIPS, 2021]. <br> Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi.<br> [Paper] [Code]

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. (VATT). [NeurIPS, 2021]. <br> Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong.<br> [Paper] [Code]