Awesome
Awesome-Text-to-Video-Generation
A curated (continually updated) list of Text-to-Video studies. It's based on our survey paper: From Sora What We Can See: A Survey of Text-to-Video Generation. In this survey, We have conducted a comprehensive exploration of existing works in the Text-to-Video field using OpenAI’s Sora as a clue, and we have also summarized 24 datasets and 9 evaluation metrics in this field. Specifically, we discussed the problems existing in this research area and Sora itself, combined with the advantages of Sora and the characteristics of related fields to provide future research directions. If our work can inspire you, feel free to cite our paper and star our repo.
This project is curated and maintained by Rui Sun and Yumin Zhang.
@article{sun2024sora,
title={From Sora What We Can See: A Survey of Text-to-Video Generation},
author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
journal={arXiv preprint arXiv:2405.10674},
year={2024}
}
Topics of this repo cover: <br>
Text-to-Seq-Image
,Text-to-Video
Table of Content
<a name="text_to_seq_image"></a> Text-to-Seq-Image
- LivePhoto: Real Image Animation with Text-guided Motion Control <br> Team: HKU, Alibaba Group, Ant Group. <br> Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page] <br>
- Scalable Diffusion Models with Transformers
Sequential Images
<br> Team: UC Berkeley, NYU. <br> William Peebles, Saining Xie <br> ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
<a name="text_to_video"></a> Text-to-Video
-
Zero-Shot Video Editing through Adaptive Sliding Score Distillation
Video Editing
<br> Team: Nanjing University. <br> Lianghan Zhu, Yanqi Bao, Jing Huo, et al., Yang Gao <br> arXiv, 2024.06 [Paper], [PDF], [Home Page] <br> -
CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion <br> Team: University of Science and Technology of China. <br> Xingrui Wang, Xin Li, Zhibo Chen <br> arXiv, 2024.06 [Paper], [PDF], [Home Page] <br>
-
VideoTetris: Towards Compositional Text-to-Video Generation <br> Team: Peking University. <br> Ye Tian, Ling Yang, Haotian Yang, et al., Bin Cui <br> arXiv, 2024.06 [Paper], [PDF], [Code], [Home Page] <br>
-
Searching Priors Makes Text-to-Video Synthesis Better <br> Team: Zhejiang University. <br> Haoran Cheng, Liang Peng, Linxuan Xia, et al., Boxi Wu <br> arXiv, 2024.06 [Paper], [PDF], [Home Page] <br>
-
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
3DGS Task
<br> Team: KAIST, ByteDance. <br> Inkyu Shin, Qihang Yu, Xiaohui Shen, et al., Liang-Chieh Chen <br> arXiv, 2024.06 [Paper], [PDF], [Home Page] <br> -
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation <br> Team: Tsinghua University. <br> Tianchen Zhao, Tongcheng Fang, Enshu Liu, et al., Yu Wang <br> arXiv, 2024.06 [Paper], [PDF], [Home Page] <br>
-
FIFO-Diffusion: Generating Infinite Videos from Text without Training <br> Team: Computer Vision Laboratory, ECE & IPAI, Seoul National University <br> Jihwan Kim, Junoh Kang, Jinyoung Cho, Bohyung Han <br> arXiv, 2024.05 [Paper], [PDF], [Code], [Home Page]
-
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation <br> Team: UCLA, Google. <br> Hritik Bansal, Yonatan Bitton, Michal Yarom, et al., Kai-Wei Chang <br> arXiv, 2024.05 [Paper], [PDF], [Code], [Dataset], [Pretrained Model], [Home Page] <br>
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Robotics
<br> Team: Tsinghua University. <br> Jialong Wu, Shaofeng Yin, Ningya Feng, et al., Mingsheng Long <br> arXiv, 2024.05 [Paper], [PDF], [Code], [Home Page] <be> -
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators <br> Team: Peking University, University of Rochester. <br> Shenghai Yuan, Jinfa Huang, Yujun Shi, et al., Li Yuan, Jiebo Luo <br> arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] <br>
-
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis <br> Team: Snap Inc, University of Trento.<br> Willi Menapace, Aliaksandr Siarohin, et al., Sergey Tulyakov <br> arXiv, 2024.02 [Paper], [PDF], [Home Page]
-
Video generation models as world simulators <br> Team: Sora, Open AI. <br> Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh <br> online page, 2024.02 [Paper], [Home Page] <br>
-
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation <br> Team: University of Waterloo. <br> Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen <br> arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
World Model on Million-Length Video And Language With RingAttention
Long Video
<br> Team: UC Berkeley. <br> Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel <br> arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br> -
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model <br> Team: Peking University. <br> Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang <br> arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page] <br>
-
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation <br> Team: Bytedance Inc. <br> Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng <br> arXiv, 2024.01 [Paper], [PDF], [Home Page] <br>
-
UniVG: Towards UNIfied-modal Video Generation <br> Team: Baidu Inc. <br> Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao <br> arXiv, 2024.01 [Paper], [PDF], [Home Page] <br>
-
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM <br> Team: HiDream.ai Inc. <br> Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei <br> arXiv, 2024.01 [Paper], [PDF], [Home Page] <br>
-
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models <br> Team: Tencent AI Lab. <br> Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan <br> arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
Lumiere: A Space-Time Diffusion Model for Video Generation <br> Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion. <br> Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri <br> arXiv, 2024.01 [Paper], [PDF], [Home Page] <br>
-
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos <br> Team: HUST, Alibaba Group. <br> Xiang Wang, Shiwei Zhang, et al., Nong Sang <br> arXiv, 2023.12 [Paper], [PDF], [Home Page]
-
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion <br> Team: Fudan University, Alibaba Group, HUST, Zhejiang University. <br> Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] <br>
-
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation <br> Team: Peking University, Microsoft Research. <br> Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu <br> arXiv, 2023.12 [Paper], [PDF] <br>
-
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
Training-free
<br> Team: Victoria University of Wellington, NVIDIA <br> Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]<br> -
FreeInit: Bridging Initialization Gap in Video Diffusion Models
Training-free
<br> Team: Nanyang Technological University <br> Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]<br> -
MTVG : Multi-text Video Generation with Text-to-Video Models
Training-free
<br> Team: Korea University, NVIDIA <br> Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]<br> -
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos <br> Team: HUST, Alibaba Group, Zhejiang University, Ant Group <br> Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]<br>
-
InstructVideo: Instructing Video Diffusion Models with Human Feedback <br> Team: Zhejiang University, Alibaba Group, Tsinghua University <br> Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]<br>
-
VideoLCM: Video Latent Consistency Model <br> Team: HUST, Alibaba Group, SJTU <br> Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]<br>
-
Photorealistic Video Generation with Diffusion Models <br> Team: Stanford University Fei-Fei Li, Google. <br> Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama <br> arXiv, 2023.12 [Paper], [PDF], [Home Page] <br>
-
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation <br> Team: HUST, Alibaba Group, Fudan University. <br> Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation <br> Team: HKU, Meta. <br> Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua <br> arXiv, 2023.12 [Paper], [PDF], [Home Page] <br>
-
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter <br> Team: Tsinghua University, Tencent AI Lab, CUHK. <br> Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan <br> arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)] <br>
-
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Multimodal
<br> Team: Tencent. <br> Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu <br> arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br> -
F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis
Training-free
<br> Team: University of Electronic Science and Technology of China. <br> Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song <br> arXiv, 2023.11 [Paper], [PDF] <br> -
AdaDiff: Adaptive Step Selection for Fast Diffusion
Training-free
<br> Team: Fudan University. <br> Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang <br> arXiv, 2023.11 [Paper], [PDF] <br> -
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
Training-free
<br> Team: University of Technology Sydney. <br> Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang <br> arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] <br> -
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Training-free
<br> Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences. <br> Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen <br> arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] <br> -
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation <br> Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University. <br> Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo <br> arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] <br>
-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation <br> Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University. <br> Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou <br> arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset] <br>
-
ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models <br> Team: University of Science and Technology of China, Microsoft. <br> Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong <br> arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)] <br>
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets <br> Team: Stability AI. <br> Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach <br> arXiv, 2023.11 [Paper], [PDF], [Code] <br>
-
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline <br> Team: Sber AI. <br> Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov <br> arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)] <br>
-
MoVideo: Motion-Aware Video Generation with Diffusion Models <br> Team: ETH, Meta. <br> Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan <br> arXiv, 2023.11 [Paper], [PDF], [Home Page] <br>
-
Optimal Noise pursuit for Augmenting Text-to-Video Generation <br> Team: Zhejiang Lab. <br> Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang <br> arXiv, 2023.11 [Paper], [PDF] <br>
-
Make Pixels Dance: High-Dynamic Video Generation <br> Team: ByteDance. <br> Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li <br> arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] <br>
-
Learning Universal Policies via Text-Guided Video Generation <br> Team: MIT, Google DeepMind, UC Berkeley. <br> Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel <br> NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page] <br>
-
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning <br> Team: Meta. <br> Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra <br> arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)] <br>
-
MotionDirector: Motion Customization of Text-to-Video Diffusion Models <br> Team: Show Lab, National University of Singapore, Zhejiang University <br> Rui Zhao, Yuchao Gu, et al., Mike Zheng Shou <br> ECCV'24 (Oral), 2023.10 [Paper], [PDF], [Code], [Home Page]
-
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
Training-free
<br> Team: Nanyang Technological University. <br> Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu <br> ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] <br> -
ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
Training-free
<br> Team: Shanghai Artificial Intelligence Laboratory. <br> Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao <br> arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] <br> -
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation <br> Team: Tencent AI Lab. <br> Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan <br> arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] <br>
-
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction <br> Team: Shanghai Artificial Intelligence Laboratory. <br> Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu <br> arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] <br>
-
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors <br> Team: The Chinese University of Hong Kong. <br> Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan <br> arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)] <br>
-
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation <br> Team: Nankai University, MEGVII Technology. <br> Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang <br> arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
LLM-grounded Video Diffusion Models
Training-free
<br> Team: UC Berkeley. <br> Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li <br> arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page] <br> -
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning <br> Team: UNC Chapel Hill. <br> Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal <br> arXiv, 2023.09 [Paper], [PDF], [Code] <br>
-
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation <br> Team: Baidu Inc. <br> Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang <br> arXiv, 2023.09 [Paper], [PDF], [Home Page] <br>
-
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models <br> Team: Shanghai Artificial Intelligence Laboratory. <br> Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu <br> arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] <br>
-
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation <br> Team: Huawei. <br> Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu <br> arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] <br>
-
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Training-free
<br> Team: School of Information Science and Technology, ShanghaiTech University. <br> Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang <br> NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page] <br> -
Show-1: Marrying pixel and latent diffusion models for text-to-video generation. <br> Team: Show Lab, National University of Singapore <br> David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou <br> arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model] <br>
-
GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER <br> Team: Institute of Automation, Chinese Academy of Sciences (CASIA). <br> Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu <br> NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] <br>
-
DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
Training-free
<br> Team: East China Normal University. <br> Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang <br> arXiv, 2023.08 [Paper], [PDF], [Home Page] <br> -
SimDA: Simple Diffusion Adapter for Efficient Video Generation <br> Team: Fudan University, Microsoft. <br> Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang <br> arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page] <br>
-
Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models <br> Team: National University of Singapore. <br> Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua <br> arXiv, 2023.08 [Paper], [PDF], [Code] <br>
-
ModelScope Text-to-Video Technical Report <br> Team: Alibaba Group. <br> Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang <br> arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)] <br>
-
Dual-Stream Diffusion Net for Text-to-Video Generation <br> Team: Nanjing University of Science and Technology. <br> Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang <br> arXiv, 2023.08 [Paper], [PDF] <br>
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning <br> Team: The Chinese University of Hong Kong. <br> Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai <br> ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation <br> Team: HKUST. <br> Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen <br> arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] <br>
-
Probabilistic Adaptation of Text-to-Video Models <br> Team: Google, UC Berkeley. <br> Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel <br> arXiv, 2023.06 [Paper], [PDF], [Home Page] <br>
-
ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation <br> Team: School of Artificial Intelligence, University of Chinese Academy of Sciences. <br> Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu <br> IJCNN'23, 2023.06 [Paper], [PDF] <br>
-
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance <br> Team: CUHK. <br> Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong <br> arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
VideoComposer: Compositional Video Synthesis with Motion Controllability <br> Team: Alibaba Group. <br> Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou <br> NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] <br>
-
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation <br> Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group. <br> Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan <br> CVPR'23, arXiv, 2023.06 [Paper], [PDF]<br>
-
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
Training-free
<br> Team: Korea University. <br> Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim <br> arXiv, 2023.05 [Paper], [PDF] <br> -
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models <br> Team: Carnegie Mellon University. <br> Rohan Dhesikan, Vignesh Rajmohan <br> arXiv, 2023.05 [Paper], [PDF], [Code(coming)] <br>
-
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models <br> Team: University of Maryland. <br> Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji <br> ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page] <br>
-
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity <br> Team: NUS, CUHK. <br> Zijiao Chen, Jiaxin Qing, Juan Helen Zhou <br> NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] <br>
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation <br> Team: Google Research <br> Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang <br> arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog] <br>
-
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning <br> Team: Tsinghua University, Beijing Film Academy <br> Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu <br> arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] <br>
-
Text2Performer: Text-Driven Human Video Generation <br> Team: Nanyang Technological University <br> Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu <br> arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] <br>
-
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation <br> Team: University of Rochester, Meta. <br> Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin <br> arXiv, 2023.04 [Paper], [PDF], [Home Page] <br>
-
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos <br> Team: Tsinghua University, HKUST. <br> Yue Ma, Yingqing He, Xiaodong Cun, et al., Qifeng Chen <br> AAAI'24, arXiv, 2023.04 [Paper], [PDF], [Home Page], [Code] <br>
-
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models <br> Team: NVIDIA. <br> Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis <br> CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page] <br>
-
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation <br> Team: University of Science and Technology of China, Microsoft. <br> Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan <br> arXiv, 2023.03 [Paper], [PDF], [Home Page] <br>
-
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators <br> Team: Picsart AI Resarch (PAIR). <br> Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi <br> arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)] <br>
-
Structure and Content-Guided Video Synthesis with Diffusion Models <br> Team: Runway <br> Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis <br> ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page] <br>
-
SceneScape: Text-Driven Consistent Scene Generation <br> Team: Weizmann Institute of Science, NVIDIA Research <br> Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel <br> NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page] <br>
-
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation <br> Team: Renmin University of China, Peking University, Microsoft Research <br> Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo <br> CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code] <br>
-
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation <br> Team: Show Lab, National University of Singapore. <br> Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al <br> ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model] <br>
-
MagicVideo: Efficient Video Generation With Latent Diffusion Models <br> Team: ByteDance Inc. <br> Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng <br> arXiv, 2022.11 [Paper], [PDF], [Home Page] <br>
-
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Long Video
<br> Team: HKUST, Tencent AI Lab. <br> Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen <br> arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page] <br> -
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation <br> Team: UC Santa Barbara, Meta. <br> Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell <br> CVPR'23, arXiv, 2022.11 [Paper], [PDF]<br>
-
Phenaki: Variable Length Video Generation From Open Domain Textual Description <br> Team: Google. <br> Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan <br> ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page] <br>
-
Imagen Video: High Definition Video Generation with Diffusion Models <br> Team: Google. <br> Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans <br> arXiv, 2022.10 [Paper], [PDF], [Home Page] <br>
-
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Story Visualization
<br> Team: UNC Chapel Hill. <br> Adyasha Maharana, Darryl Hannan, Mohit Bansal <br> ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)] <br> -
Make-A-Video: Text-to-Video Generation without Text-Video Data <br> Team: Meta AI. <br> Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman <br> ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code]<br>
-
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model <br> Team: S-Lab, SenseTime. <br> Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu <br> TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo]<br>
-
Word-Level Fine-Grained Story Visualization
Story Visualization
<br> Team: University of Oxford. <br> Bowen Li, Thomas Lukasiewicz <br> ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model]<br> -
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers <br> Team: Tsinghua University. <br> Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang <br> ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)] <br>
-
CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers <br> Team: Tsinghua University. <br> Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang <br> NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page] <br>
-
Long video generation with time-agnostic vqgan and time-sensitive transformer <br> Team: Meta AI. <br> Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh <br> ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code] <br>
-
Video Diffusion Models
text-conditioned
<br> Team: Google. <br> Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet <br> arXiv, 2022.04 [Paper], [PDF], [Home Page] <br> -
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Long Video
<br> Team: Microsoft. <br> Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan <br> NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page] <br> -
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion <br> Team: Microsoft. <br> Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan <br> ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code]<br>
-
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions <br> Team: Microsoft, Duke University. <br> Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan <br> arXiv, 2021.04 [Paper], [PDF] <br>
-
Cross-Modal Dual Learning for Sentence-to-Video Generation <br> Team: Tsinghua University. <br> Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu <br> ACM MM'19 [Paper], [PDF]
-
IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation <br> Team: Peking University. <br> Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng <br> IJCAI'19 [Paper], [PDF]
-
Imagine this! scripts to compositions to videos <br> Team: University of Illinois Urbana-Champaign, AI2, University of Washington. <br> Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi <br> ECCV'18, arxiv, 2018.04 [Paper], [PDF]
-
To Create What You Tell: Generating Videos from Captions <br> Team: USTC, Microsoft Research. <br> Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei <br> ACM MM'17, arxiv, 2018.04 [Paper], [PDF]
-
Neural Discrete Representation Learning. <br> Team: DeepMind. <br> Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu <br> NeurIPS'17, arxiv, 2017.11 [Paper], [PDF] <br>
-
Video Generation From Text. <br> Team: Duke University, NEC Labs America. <br> Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin <br> AAAI'18, arxiv, 2017.10 [Paper], [PDF] <br>
-
Attentive semantic video generation using captions. <br> Team: IIT Hyderabad. <br> Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian <br> ICCV'17, arxiv, 2017.08 [Paper], [PDF] <br>
-
Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
VAE
<br> Team: IIT Hyderabad. <br> Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian <br> ACM MM'17, arXiv, 2016.11 [Paper], [PDF] <br>
<a name="dataset_and_metrics"></a> Datasets & Metrics
Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct
. <br>
Metrics are divided as image-level, video-level
. <br>
Dataset | Domain | Annotated | #Clips | #Sent | Len_C(s) | Len_S | #Videos | Resolution | FPS | Dur(h) | Year | Source |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CV-Text | Face | Generated | 70K | 1400K | - | 67.2 | - | 480P | - | - | 2023 | Online |
MSR-VTT | Open | Manual | 10K | 200K | 15.0s | 9.3 | 7.2K | 240P | 30 | 40 | 2016 | YouTube |
DideMo | Open | Manual | 27K | 41K | 6.9s | 8.0 | 10.5K | - | - | 87 | 2017 | Flickr |
Y-T-180M | Open | ASR | 180M | - | - | - | 6M | - | - | - | 2021 | YouTube |
WVid2M | Open | Alt-text | 2.5M | 2.5M | 18.0 | 12.0 | 2.5M | 360P | - | 13K | 2021 | Web |
H-100M | Open | ASR | 103M | - | 13.4 | 32.5 | 3.3M | 720P | - | 371.5K | 2022 | YouTube |
InternVid | Open | Generated | 234M | - | 11.7 | 17.6 | 7.1M | *720P | - | 760.3K | 2023 | YouTube |
H-130M | Open | Generated | 130M | 130M | - | 10.0 | - | 720P | - | - | 2023 | YouTube |
Y-mP | Open | Manual | 10M | 10M | 54.2 | - | - | - | - | 150K | 2023 | Youku |
V-27M | Open | Generated | 27M | 135M | 12.5 | - | - | - | - | - | 2024 | YouTube |
P-70M | Open | Generated | - | 70.8M | 8.5 | 13.2 | 70.8M | 720P | - | 166.8K | 2024 | YouTube |
ChronoMagic-Pro | Open | Generated | - | - | 234.1 | - | 460K | 720P | - | 30.0K | 2024 | YouTube |
LSMDC | Movie | Manual | 118K | 118K | 4.8s | 7.0 | 200 | 1080P | - | 158 | 2017 | Movie |
MAD | Movie | Manual | - | 384K | - | 12.7 | 650 | - | - | 1.2K | 2022 | Movie |
UCF-101 | Action | Manual | 13K | - | 7.2s | - | - | 240P | 25 | 27 | 2012 | YouTube |
ANet-200 | Action | Manual | 100K | - | - | 13.5 | 2K | *720P | 30 | 849 | 2015 | YouTube |
Charades | Action | Manual | 10K | 16K | - | - | 10K | - | - | 82 | 2016 | Home |
Kinetics | Action | Manual | 306K | - | 10.0s | - | 306K | - | - | - | 2017 | YouTube |
ActNet | Action | Manual | 100K | 100K | 36.0s | 13.5 | 20K | - | - | 849 | 2017 | YouTube |
C-Ego | Action | Manual | - | - | - | - | 8K | 240P | - | 69 | 2018 | Home |
SS-V2 | Action | Manual | - | - | - | - | 220.1K | - | 12 | - | 2018 | Daily |
How2 | Instruct | Manual | 80K | 80K | 90.0 | 20.0 | 13.1K | - | - | 2000 | 2018 | YouTube |
HT100M | Instruct | ASR | 136M | 136M | 3.6 | 4.0 | 1.2M | 240P | - | 134.5K | 2019 | YouTube |
YCook2 | Cooking | Manual | 14K | 14K | 19.6 | 8.8 | 2K | - | - | 176 | 2018 | YouTube |
E-Kit | Cooking | Manual | 40K | 40K | - | - | 432 | *1080P | 60 | 55 | 2018 | Home |
-
(ShareGPT4Video) ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Dataset (adding)
<br> Team: USTC, CUHK, PKU, Shanghai AI Lab. <br> Lin Chen, Xilin Wei, Jinsong Li, et al., Jiaqi Wang <br> arXiv, 2024.06 [Paper], [PDF], [Code], [Dataset], [Home Page] -
(ChronoMagic-Pro) ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation <br> Team: Peking University, University of Rochester. <br> Shenghai Yuan, Jinfa Huang, et al., Jiebo Luo, Li Yuan <br> arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] <br>
-
(VideoPhy) VideoPhy: Evaluating Physical Commonsense for Video Generation
Dataset (adding)
<br> Team: University of California Los Angeles, Google Research. <br> Hritik Bansal, Zongyu Lin, Tianyi Xie, et al., Aditya Grover <br> arXiv, 2024.06 [Paper], [PDF], [Code], [Hugging Face], [Home Page] -
(GenAI-Bench) Evaluating Text-to-Visual Generation with Image-to-Text Generation
Dataset (adding)
<br> Team: CMU, Meta. <br> Zhiqiu Lin, Deepak Pathak, Baiqi Li, et al., Deva Ramanan <br> arXiv, 2024.04 [Paper], [PDF], [Code], [Home Page] -
(VidProM) VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
Dataset (adding)
<br> Team: ReLER Lab. <br> Wenhao Wang, Yi Yang <br> arXiv, 2024.03 [Paper], [PDF], [Code], [Hugging Face] -
(ECTV) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Dataset (adding)
<br> Team: Tencent AI Lab, CUHK. <br> Yaofang Liu, Xiaodong Cun, Xuebo Liu, et al., Ying Shan <br> CVPR'24, arXiv, 2023.10 [Paper], [PDF], [Code], [Dataset], [Home Page] -
(CV-Text) Celebv-text: A large-scale facial text-video datase
Dataset (Domain:Face)
<br> Team: University of Sydney, SenseTime Research. <br> Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu <br> CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page] -
(MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language
Dataset (Domain:Open)
<br> Team: Microsoft Research. <br> Jun Xu , Tao Mei , Ting Yao and Yong Rui <br> CVPR'16 [Paper], [PDF] -
(DideMo) Localizing moments in video with natural language
Dataset (Domain:Open)
<br> Team: UC Berkeley, Adobe <br> Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell <br> ICCV'17, arXiv, 2017.08 [Paper], [PDF] -
(YT-Tem-180M) Merlot: Multimodal neural script knowledge models
Dataset (Domain:Open)
<br> Team: University of Washington <br> Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi <br> NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page] -
(WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Dataset (Domain:Open)
<br> Team: University of Oxford, CNRS. <br> Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman <br> ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page] -
(HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Dataset (Domain:Open)
<br> Team: Microsoft Research Asia. <br> Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo <br> CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code] -
(InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation
Dataset (Domain:Open)
<br> Team: Shanghai AI Laboratory. <br> Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao <br> arXiv, 2023.07 [Paper], [PDF], [Code] -
(HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Dataset (Domain:Open)
<br> Team: Peking University, Microsoft Research. <br> Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu <br> arXiv, 2023.05 [Paper], [PDF] -
(Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Dataset (Domain:Open)
<br> Team: DAMO Academy, Alibaba Group. <br> Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang <br> arXiv, 2023.06 [Paper], [PDF] -
(VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Dataset (Domain:Open)
<br> Team: UCAS, CAS <br> Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu <br> NeurIPS'23, arXiv, 2023.05 [Paper], [PDF] -
(Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Dataset (Domain:Open)
<br> Team: Snap Inc., University of California, University of Trento. <br> Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov <br> arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page] -
(LSMDC) Movie description
Dataset (Domain:Movie)
<br> Team: Max Planck Institute for Informatics. <br> Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele <br> IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page] -
(MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions
Dataset (Domain:Movie)
<br> Team: KAUST, Adobe Research. <br> Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem <br> CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code] -
(UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild
Dataset (Domain:Action)
<br> Team: University of Central Florida. <br> Khurram Soomro, Amir Roshan Zamir, Mubarak Shah <br> arXiv, 2012.12 [Paper], [PDF], [Data] -
(ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding
Dataset (Domain:Action)
<br> Team: Universidad del Norte, KAUST <br> Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles <br> CVPR'15, [Paper], [PDF], [Home Page] -
(Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding
Dataset (Domain:Action)
<br> Team: Carnegie Mellon University <br> Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta <br> ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page] -
(Kinetics) The kinetics human action video dataset
Dataset (Domain:Action)
<br> Team: Google <br> Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman <br> arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(ActivityNet) Dense-captioning events in videos
Dataset (Domain:Action)
<br> Team: Stanford University <br> Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles <br> ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos
Dataset (Domain:Action)
<br> Team: Carnegie Mellon University <br> Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari <br> arXiv, 2018.04, [Paper], [PDF], [Home Page] -
(SS-V2) The "something something" video database for learning and evaluating visual common sense
Dataset (Domain:Action)
<br> Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic <br> ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page] -
(How2) How2: a large-scale dataset for multimodal language understanding
Dataset (Domain:Instruct)
<br> Team: Carnegie Mellon University. <br> Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze <br> arXiv, 2018.11 [Page], [PDF] -
(HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Dataset (Domain:Instruct)
<br> Team: Ecole Normale Superieure, Inria, CIIRC. <br> Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic <br> arXiv, 2019.06 [Page], [PDF], [Home Page] <br> -
(YouCook2) Towards automatic learning of procedures from web instructional video
Dataset (Domain:Cooking)
<br> Team: University of Michigan, University of Rochester <br> Luowei Zhou, Chenliang Xu, Jason J. Corso <br> AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page] -
(Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset
Dataset (Domain:Cookding)
<br> Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto. <br> Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray <br> ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page] <br> -
(PSNR/SSIM) Image quality assessment: from error visibility to structural similarity
Metric (image-level)
<br> Team: New York University. <br> Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli <br> IEEE TIP, 2004.04. [Paper], [PDF] <br> -
(IS) Improved techniques for training gans
Metric (image-level)
<br> Team: OpenAI <br> Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen <br> NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code] <br> -
(FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium
Metric (image-level)
<br> Team: Johannes Kepler University Linz <br> Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter <br> NeurIPS'17, arXiv, 2017.06 [Paper], [PDF] -
(CLIP Score) Learning transferable visual models from natural language supervision
Metric (image-level)
<br> Team: OpenAI. <br> Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever <br> ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code] <br> -
(Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan
Metric (video-level)
<br> Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi <br> IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code]<br> -
(FVD/KVD) FVD: A new metric for video generation
Metric (video-level)
<br> Team: Johannes Kepler University, Google <br> Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly <br> ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code] -
(FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Metric (video-level)
<br> Team: Show Lab, National University of Singapore. <br> Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al <br> ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model] <br>
Acknowledgement and References
Citation
If you find this repository useful, please consider citing our paper and this list:
@article{sun2024sora,
title={From Sora What We Can See: A Survey of Text-to-Video Generation},
author={Sun, Rui and Zhang, Yumin and Shah, Tejal and Sun, Jiahao and Zhang, Shuoying and Li, Wenqi and Duan, Haoran and Wei, Bo and Ranjan, Rajiv},
journal={arXiv preprint arXiv:2405.10674},
year={2024}
}
@misc{sun2024t2vgenerationlist,
title={Awesome-Text-to-Video-Generation},
author={Sun, Rui and Zhang, Yumin},
year={2024},
publisher={GitHub},
howpublished={\url{https://github.com/soraw-ai/Awesome-Text-to-Video-Generation}},
}