Home

Awesome

Awesome Text-to-Video Generation

This repository contains a curated list of text-to-video generation papers and BibTeX entries (until Dec. 2023).

Paper summary

NameDateAffiliationTrain setTest setOther expr
GODIVA21.04MicrosoftHowTo100MMSR-VTTuser study
NUWA websiteECCV22Microsoft241k VATEXKinetics, MSR-VTTsketch2video, edit
Video Diffusion websiteNIPS22Google10M-unconditional, longer
Imagen Video website22.10Google14M--
MagicVideo website22.11ByteDanceWebVid-10M (+ 10M from HD-VILA-100M + 7M)UCF-101, MSR-VTTuser study
LVDM website code22.11HKUST2M from WebVid-10MUCF-101, Sky Time-lapse, Taichiunconditional, long
Make-A-Video websiteICLR23MetaWebVid-10M + 10M from HD-VILA-100MUCF-101, MSR-VTTuser study
Phenaki websiteICLR23Google~15MKinetics-400img conditioned
CogVideo demo website codeICLR23THU5.4MUCF-101, Kinetics-600user study
Video LDM websiteCVPR23 23.04NVIDIAWebVid-10M (+ 683k driving)UCF-101, MSR-VTTpersonalized
Gen1 demo websiteICCV23Runway6.4M-user study, edit, iv2v, customization
PYoCo websiteICCV23 23.05NVIDIA22.5MUCF-101, MSR-VTTunconditional
VideoComposer website codeNIPS23AlibabaWebVid-10MMSR-VTTcompositional i2v, sketch, motion control
GLOBER website codeNIPS23CASIAWebVid-10M or lessUCF-101, Sky Time-lapse, Taichi, WebVid-10Munconditional
VideoFusion23.03CASIAWebVid-10M or lessUCF-101, Sky Time-lapse, Taichi, WebVid-10Munconditional, long
Latent-Shift website23.04MetaWebVid-10MUCF-101, MSR-VTTuser study
VideoFactory23.05PKUHD-VG-130M + WebVid-10MUCF-101, MSR-VTT, WebVid-10Muser study, personalized
Make-Your-Video website code23.06CUHKWebVid-10MUCF-101depth, re-rendering, user study
Animate-A-Story website23.07HKUSTWebVid-10MUCF-101storytelling, personalized
InternVidICLR24 23.07ShanghaiWebVid10M + InternVid18MUCF-101, MSR-VTTdialogue
ModelScopeT2V demo website23.08AlibabaWebVid-10MMSR-VTT-
Dysen-VDM website23.08NUSWebVid-10MUCF-101, MSR-VTTuser study
VidRD website code23.09HuaweiWebVid-2M, TGIF, VATEX, Pexels (5.3M)UCF-101-
LaVie demo demo2 website code23.09ShanghaiWebVid-10M + Vimeo25MUCF-101, MSR-VTTuser study, long, personalized
Show-1 demo demo2 website code23.09NUSWebVid-10MUCF-101, MSR-VTTuser study
VideoCrafter demo demo2 website code23.10TencentWebVid-10M + 10M-user study, img conditioned, i2v
Emu Video website23.11Meta34MUCF-101user study, longer
SVD demo website1 website2 code23.11StabilityLVD (580M) / LVD-F (152M)UCF-101i2v, user study, camera motion, multi-view
PixelDance website23.11ByteDanceWebVid-10M + 500k watermark-freeUCF-101, MSR-VTTlong, sketch instruction, edit
W.A.L.T website23.12Google89MUCF-101, Kinetics-600class-conditional, i2v
VideoPoet website23.12Google~270M (100M paired)UCF-101, MSR-VTTuser study, stylization, edit, i2v, long, camera motion

Bold dataset indicates zero-shot evaluation.

Models without a technical report such as Gen-2, Pika 1.0, zeroscope are not included.

Bold expr for quantitative

VideoComposer (NeurIPS23), PixelDance: 4fps 16 frames; VideoPoet: 8fps 17 frames; EMU Video: input 4/8fps 8 frames, output 16fps 37 frames

Zero-shot leaderboard

NameDateDataMSR-VTT CLIPSIMMSR-VTT FIDMSR-VTT FVDUCF-101 FIDUCF-101 FVDUCF-101 IS
CogVideoICLR235.4M0.263123.591294179.00701.5925.27
MagicVideo22.1110M998145.00655.00
LVDM22.112M0.2381742641.80
VideoFusion23.0310M0.279575.77639.9017.49
Latent-Shift23.0410M0.277315.23
VideoCrafter23.1020M0.287566.95910.8718.26
Video LDMCVPR23 23.0410M0.2929550.6133.45
VideoComposerNIPS2310M0.2932580
InternVidICLR24 23.0728M0.295160.25616.5121.04
Animate-A-Story23.0710M516.15
ModelScopeT2V23.0810M0.293011.09550
LaVie23.0935M0.2949526.30
Emu Video23.1134M606.2042.70
Make-A-VideoICLR2320M0.304913.17367.2333.00
VideoFactory23.05140M0.3005410.00
Show-123.0910M0.307213.08538394.4635.42
VidRD23.095.3M363.1939.37
Dysen-VDM23.1110M0.320412.64325.4235.57
W.A.L.T23.1289M258.1035.10
VideoPoet23.12270M0.3049 / 0.3123213355.0038.44
PYoCoICCV23 23.0522.5M9.73 / 22.14355.1947.76
Make-Your-Video23.0610M330.49
PixelDance23.1110.5M0.312538149.36242.8242.10
SVD23.11152M242.02

Bold indicates open-source code or demo release.

Strikethrough indicates private data involved.

Dataset summary

NameSizeTypeDateAffiliation
UCF-10113kclass2013UCF
MSR-VTT10KtextCVPR16Microsoft
Kinetics650kclassCVPR17DeepMind
HowTo100M136MtextICCV19ENS
WebVid-10M10MtextICCV21Oxford
HD-VILA-100M103MtextCVPR22Microsoft
HD-VG-130M130Mtext23.05Microsoft
InternVid234M (10M)text23.07Shanghai AI Lab
Vimeo25M25Mtext23.09Shanghai AI Lab

Strikethrough indicates not yet released.

UCF-101: 320x240 25fps

MSR-VTT: resize to 320x240 30fps

Evaluation protocol

eval CLIPSIM, FID, FVD on MSR-VTT, FVD, IS on UCF-101

Table for #evaluation samples and backbone

MSR-VTT CLIPSIMMSR-VTT FVDMSR-VTT FIDUCF-101 ISUCF-101 FVD
CogVideo---10k2048
Video LDM2990 CLIP32--10k
VideoComposer---
InternVid2990 CLIP32--20202020
Make-A-Video59794-5979410k10k
VideoPoet59794 CLIP16/CLIP3240960-10k10k training
PYoCo--59794 CLIP32/Inception20202048
SVD----13320 script 240x320