Awesome
Awesome Text-to-Video Generation
This repository contains a curated list of text-to-video generation papers and BibTeX entries (until Dec. 2023).
Paper summary
Name | Date | Affiliation | Train set | Test set | Other expr |
---|---|---|---|---|---|
GODIVA | 21.04 | Microsoft | HowTo100M | MSR-VTT | user study |
NUWA website | ECCV22 | Microsoft | 241k VATEX | Kinetics, MSR-VTT | sketch2video, edit |
Video Diffusion website | NIPS22 | 10M | - | unconditional, longer | |
Imagen Video website | 22.10 | 14M | - | - | |
MagicVideo website | 22.11 | ByteDance | WebVid-10M (+ 10M from HD-VILA-100M + 7M) | UCF-101, MSR-VTT | user study |
LVDM website code | 22.11 | HKUST | 2M from WebVid-10M | UCF-101, Sky Time-lapse, Taichi | unconditional, long |
Make-A-Video website | ICLR23 | Meta | WebVid-10M + 10M from HD-VILA-100M | UCF-101, MSR-VTT | user study |
Phenaki website | ICLR23 | ~15M | Kinetics-400 | img conditioned | |
CogVideo demo website code | ICLR23 | THU | 5.4M | UCF-101, Kinetics-600 | user study |
Video LDM website | CVPR23 23.04 | NVIDIA | WebVid-10M (+ 683k driving) | UCF-101, MSR-VTT | personalized |
Gen1 demo website | ICCV23 | Runway | 6.4M | - | user study, edit, iv2v, customization |
PYoCo website | ICCV23 23.05 | NVIDIA | 22.5M | UCF-101, MSR-VTT | unconditional |
VideoComposer website code | NIPS23 | Alibaba | WebVid-10M | MSR-VTT | compositional i2v, sketch, motion control |
GLOBER website code | NIPS23 | CASIA | WebVid-10M or less | UCF-101, Sky Time-lapse, Taichi, WebVid-10M | unconditional |
VideoFusion | 23.03 | CASIA | WebVid-10M or less | UCF-101, Sky Time-lapse, Taichi, WebVid-10M | unconditional, long |
Latent-Shift website | 23.04 | Meta | WebVid-10M | UCF-101, MSR-VTT | user study |
VideoFactory | 23.05 | PKU | HD-VG-130M + WebVid-10M | UCF-101, MSR-VTT, WebVid-10M | user study, personalized |
Make-Your-Video website code | 23.06 | CUHK | WebVid-10M | UCF-101 | depth, re-rendering, user study |
Animate-A-Story website | 23.07 | HKUST | WebVid-10M | UCF-101 | storytelling, personalized |
InternVid | ICLR24 23.07 | Shanghai | WebVid10M + InternVid18M | UCF-101, MSR-VTT | dialogue |
ModelScopeT2V demo website | 23.08 | Alibaba | WebVid-10M | MSR-VTT | - |
Dysen-VDM website | 23.08 | NUS | WebVid-10M | UCF-101, MSR-VTT | user study |
VidRD website code | 23.09 | Huawei | WebVid-2M, TGIF, VATEX, Pexels (5.3M) | UCF-101 | - |
LaVie demo demo2 website code | 23.09 | Shanghai | WebVid-10M + Vimeo25M | UCF-101, MSR-VTT | user study, long, personalized |
Show-1 demo demo2 website code | 23.09 | NUS | WebVid-10M | UCF-101, MSR-VTT | user study |
VideoCrafter demo demo2 website code | 23.10 | Tencent | WebVid-10M + 10M | - | user study, img conditioned, i2v |
Emu Video website | 23.11 | Meta | 34M | UCF-101 | user study, longer |
SVD demo website1 website2 code | 23.11 | Stability | LVD (580M) / LVD-F (152M) | UCF-101 | i2v, user study, camera motion, multi-view |
PixelDance website | 23.11 | ByteDance | WebVid-10M + 500k watermark-free | UCF-101, MSR-VTT | long, sketch instruction, edit |
W.A.L.T website | 23.12 | 89M | UCF-101, Kinetics-600 | class-conditional, i2v | |
VideoPoet website | 23.12 | ~270M (100M paired) | UCF-101, MSR-VTT | user study, stylization, edit, i2v, long, camera motion |
Bold dataset indicates zero-shot evaluation.
Models without a technical report such as Gen-2, Pika 1.0, zeroscope are not included.
Bold expr for quantitative
VideoComposer (NeurIPS23), PixelDance: 4fps 16 frames; VideoPoet: 8fps 17 frames; EMU Video: input 4/8fps 8 frames, output 16fps 37 frames
Zero-shot leaderboard
Name | Date | Data | MSR-VTT CLIPSIM | MSR-VTT FID | MSR-VTT FVD | UCF-101 FID | UCF-101 FVD | UCF-101 IS |
---|---|---|---|---|---|---|---|---|
CogVideo | ICLR23 | 0.2631 | 23.59 | 1294 | 179.00 | 701.59 | 25.27 | |
MagicVideo | 22.11 | 10M | 998 | 145.00 | 655.00 | |||
LVDM | 22.11 | 2M | 0.2381 | 742 | 641.80 | |||
VideoFusion | 23.03 | 10M | 0.2795 | 75.77 | 639.90 | 17.49 | ||
Latent-Shift | 23.04 | 10M | 0.2773 | 15.23 | ||||
VideoCrafter | 23.10 | 0.2875 | 66.95 | 910.87 | 18.26 | |||
Video LDM | CVPR23 23.04 | 10M | 0.2929 | 550.61 | 33.45 | |||
VideoComposer | NIPS23 | 10M | 0.2932 | 580 | ||||
InternVid | ICLR24 23.07 | 0.2951 | 60.25 | 616.51 | 21.04 | |||
Animate-A-Story | 23.07 | 10M | 516.15 | |||||
ModelScopeT2V | 23.08 | 10M | 0.2930 | 11.09 | 550 | |||
LaVie | 23.09 | 0.2949 | 526.30 | |||||
Emu Video | 23.11 | 606.20 | 42.70 | |||||
Make-A-Video | ICLR23 | 20M | 0.3049 | 13.17 | 367.23 | 33.00 | ||
VideoFactory | 23.05 | 0.3005 | 410.00 | |||||
Show-1 | 23.09 | 10M | 0.3072 | 13.08 | 538 | 394.46 | 35.42 | |
VidRD | 23.09 | 5.3M | 363.19 | 39.37 | ||||
Dysen-VDM | 23.11 | 10M | 0.3204 | 12.64 | 325.42 | 35.57 | ||
W.A.L.T | 23.12 | 258.10 | 35.10 | |||||
VideoPoet | 23.12 | 0.3049 / 0.3123 | 213 | 355.00 | 38.44 | |||
PYoCo | ICCV23 23.05 | 9.73 / 22.14 | 355.19 | 47.76 | ||||
Make-Your-Video | 23.06 | 10M | 330.49 | |||||
PixelDance | 23.11 | 0.3125 | 381 | 49.36 | 242.82 | 42.10 | ||
SVD | 23.11 | 242.02 |
Bold indicates open-source code or demo release.
Strikethrough indicates private data involved.
Dataset summary
Name | Size | Type | Date | Affiliation |
---|---|---|---|---|
UCF-101 | 13k | class | 2013 | UCF |
MSR-VTT | 10K | text | CVPR16 | Microsoft |
Kinetics | 650k | class | CVPR17 | DeepMind |
HowTo100M | 136M | text | ICCV19 | ENS |
WebVid-10M | 10M | text | ICCV21 | Oxford |
HD-VILA-100M | 103M | text | CVPR22 | Microsoft |
130M | text | 23.05 | Microsoft | |
InternVid | 234M (10M) | text | 23.07 | Shanghai AI Lab |
25M | text | 23.09 | Shanghai AI Lab |
Strikethrough indicates not yet released.
UCF-101: 320x240 25fps
MSR-VTT: resize to 320x240 30fps
Evaluation protocol
eval CLIPSIM, FID, FVD on MSR-VTT, FVD, IS on UCF-101
- CLIP similarity (CLIPSIM): clipscore (TorchMetrics) CLIP ViT-B/32, ViT-B/16
- Frechet inception distance (FID): pytorch-fid (TorchMetrics) CLIP ViT-B/32
- Frechet video distance (FVD): TATS (LVDM, stylegan-v) I3D Kinetics-400
- Inception score (IS): tgan2 (TorchMetrics, stylegan-v) C3D UCF-101
Table for #evaluation samples and backbone
MSR-VTT CLIPSIM | MSR-VTT FVD | MSR-VTT FID | UCF-101 IS | UCF-101 FVD | |
---|---|---|---|---|---|
CogVideo | - | - | - | 10k | 2048 |
Video LDM | 2990 CLIP32 | - | - | 10k | |
VideoComposer | - | - | - | ||
InternVid | 2990 CLIP32 | - | - | 2020 | 2020 |
Make-A-Video | 59794 | - | 59794 | 10k | 10k |
VideoPoet | 59794 CLIP16/CLIP32 | 40960 | - | 10k | 10k training |
PYoCo | - | - | 59794 CLIP32/Inception | 2020 | 2048 |
SVD | - | - | - | - | 13320 script 240x320 |