Home

Awesome

MiniSora Community

<!-- PROJECT SHIELDS -->

Contributors Forks Issues MIT License Stargazers <br />

<!-- PROJECT LOGO --> <div align="center"> <img src="assets/logo.jpg" width="600"/> <div>&nbsp;</div> <div align="center"> </div> </div> <div align="center">

English | 简体中文

</div> <p align="center"> 👋 join us on <a href="https://cdn.vansin.top/minisora.jpg" target="_blank">WeChat</a> </p>

The MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.

Hot News

empty

Reproduction Group of MiniSora Community

Sora Reproduction Goals of MiniSora

  1. GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
  2. Training-Efficiency: It should achieve good results without requiring extensive training time.
  3. Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.

MiniSora-DiT: Reproducing the DiT Paper with XTuner

https://github.com/mini-sora/minisora-DiT

Requirements

We are recruiting MiniSora Community contributors to reproduce DiT using XTuner.

We hope the community member has the following characteristics:

  1. Familiarity with the OpenMMLab MMEngine mechanism.
  2. Familiarity with DiT.

Background

  1. The author of DiT is the same as the author of Sora.
  2. XTuner has the core technology to efficiently train sequences of length 1000K.

Support

  1. Computational resources: 2*A100.
  2. Strong supports from XTuner core developer P佬@pppppM.

Recent round-table Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Speaker: MMagic Core Contributors

Live Streaming Time: 03/12 20:00

Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.

PPT: FeiShu Link

<!-- Please scan the QR code with WeChat to book a live video session. <div align="center"> <img src="assets/SD3论文领读.png" width="100"/> <div>&nbsp;</div> <div align="center"> </div> </div> -->

Highlights from Previous Discussions

Night Talk with Sora: Video Diffusion Overview

ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models

Paper Reading Program

Recruitment of Presenters

Related Work

<h3 id="diffusion-models">01 Diffusion Models</h3>
PaperLink
1) Guided-Diffusion: Diffusion Models Beat GANs on Image SynthesisNeurIPS 21 Paper, GitHub
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion ModelsCVPR 22 Paper, GitHub
3) EDM: Elucidating the Design Space of Diffusion-Based Generative ModelsNeurIPS 22 Paper, GitHub
4) DDPM: Denoising Diffusion Probabilistic ModelsNeurIPS 20 Paper, GitHub
5) DDIM: Denoising Diffusion Implicit ModelsICLR 21 Paper, GitHub
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential EquationsICLR 21 Paper, GitHub, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion modelsICLR 24 Paper, GitHub, Blog
8) Diffusion Models in Vision: A SurveyTPAMI 23 Paper, GitHub
9) Improved DDPM: Improved Denoising Diffusion Probabilistic ModelsICML 21 Paper, Github
10) Classifier-free diffusion guidanceNIPS 21 Paper
11) Glide: Towards photorealistic image generation and editing with text-guided diffusion modelsPaper, Github
12) VQ-DDM: Global Context with Discrete Diffusion in Vector Quantised Modelling for Image GenerationCVPR 22 Paper, Github
13) Diffusion Models for Medical Anomaly DetectionPaper, Github
14) Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification SystemsPaper
15) DiffusionDet: Diffusion Model for Object DetectionICCV 23 Paper, Github
16) Label-efficient semantic segmentation with diffusion modelsICLR 22 Paper, Github, Project
<h3 id="diffusion-transformer">02 Diffusion Transformer</h3>
PaperLink
1) UViT: All are Worth Words: A ViT Backbone for Diffusion ModelsCVPR 23 Paper, GitHub, ModelScope
2) DiT: Scalable Diffusion Models with TransformersICCV 23 Paper, GitHub, Project, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant TransformersArXiv 23, GitHub, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion ModelArXiv 24, GitHub
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion TransformersArXiv 24, GitHub
6) Large-DiT: Large Diffusion TransformerGitHub
7) VisionLLaMA: A Unified LLaMA Interface for Vision TasksArXiv 24, GitHub
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image SynthesisPaper, Blog
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image GenerationArXiv 24, Project
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image SynthesisArXiv 23, GitHub ModelScope
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency ModelArXiv 24,
12) Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion TransformersArXiv 24, GitHub
13) DDM: Deconstructing Denoising Diffusion Models for Self-Supervised LearningArXiv 24
14) Autoregressive Image Generation without Vector QuantizationArXiv 24, GitHub
15) Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal ModelArXiv 24
<h3 id="baseline-video-generation-models">03 Baseline Video Generation Models</h3>
PaperLink
1) ViViT: A Video Vision TransformerICCV 21 Paper, GitHub
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion ModelsCVPR 23 Paper
3) DiT: Scalable Diffusion Models with TransformersICCV 23 Paper, Github, Project, ModelScope
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video GeneratorsArXiv 23, GitHub
5) Latte: Latent Diffusion Transformer for Video GenerationArXiv 24, GitHub, Project, ModelScope
<h3 id="diffusion-unet">04 Diffusion UNet</h3>
PaperLink
1) Taming Transformers for High-Resolution Image SynthesisCVPR 21 Paper,GitHub ,Project
2) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic AlignmentArXiv 24 Github
<h3 id="video-generation">05 Video Generation</h3>
PaperLink
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific TuningICLR 24 Paper, GitHub, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion ModelsArXiv 23, GitHub, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion ModelsArXiv 22
4) MoCoGAN: Decomposing Motion and Content for Video GenerationCVPR 18 Paper
5) Adversarial Video Generation on Complex DatasetsPaper
6) W.A.L.T: Photorealistic Video Generation with Diffusion ModelsArXiv 23, Project
7) VideoGPT: Video Generation using VQ-VAE and TransformersArXiv 21, GitHub
8) Video Diffusion ModelsArXiv 22, GitHub, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and InterpolationNeurIPS 22 Paper, GitHub, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video GenerationArXiv 23, Project, Blog
11) MAGVIT: Masked Generative Video TransformerCVPR 23 Paper, GitHub, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak ConditionsArXiv 24, GitHub, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video GenerationPaper, GitHub, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video EditingICCV 23 Paper, GitHub, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large DatasetsPaper, GitHub
16) ADD: Adversarial Diffusion DistillationPaper, GitHub
17) GenTron: Diffusion Transformers for Image and Video GenerationCVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion ModelsCVPR 23 Paper, GitHub
19) MotionDirector: Motion Customization of Text-to-Video Diffusion ModelsArXiv 23, GitHub
20) TGAN-ODE: Latent Neural Differential Equations for Video GenerationPaper, GitHub
21) VideoCrafter1: Open Diffusion Models for High-Quality Video GenerationArXiv 23, GitHub
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion ModelsArXiv 24, GitHub
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video GenerationArXiv 22, GitHub
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion ModelsArXiv 23, GitHub ,Project
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion ModelsICCV 23 Paper, Project
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video GenerationCVPR 23 Paper
27) Movie Gen: A Cast of Media Foundation ModelsPaper, Project
<h3 id="dataset">06 Dataset</h3>
<h4 id="dataset_paper">6.1 Public Datasets</h4>
Dataset Name - PaperLink
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers<br><small>70M Clips, 720P, Downloadable</small>CVPR 24 Paper, Github, Project, ModelScope
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation<br><small>10M Clips, 720P, Downloadable</small>ArXiv 24, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset<br><small>70K Clips, 720P, Downloadable</small>CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation<br><small> 130M Clips, 720P, Downloadable</small>ArXiv 23, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions<br><small> 100M Clips, 720P, Downloadable</small>CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions<br><small>10.3M Clips, 720P, Downloadable</small>ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models<br><small>180M Clips, 480P, Downloadable</small>NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips<br><small>136M Clips, 240P, Downloadable</small>ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild<br><small>13K Clips, 240P, Downloadable</small>CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation<br><small>122K Clips, 240P, Downloadable</small>ACL 11 Paper, Project
11) Fashion-Text2Video - A human video dataset with rich label and text annotations<br><small>600 Videos, 480P, Downloadable</small>ArXiv 23, Project
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M<br><small>5B Clips, Downloadable</small>NeurIPS 22 Paper, Project
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time<br><small>20k videos, Downloadable</small>Arxiv 17 Paper, Project
14) MSR-VTT - A large-scale video benchmark for video understanding<br><small>10k Clips, Downloadable</small>CVPR 16 Paper, Project
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling<br><small>Downloadable</small>Arxiv 16 Paper, Project
16) Youku-mPLUG - First open-source large-scale Chinese video text dataset<br><small>Downloadable</small>ArXiv 23, Project, ModelScope
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models<br><small>6.69M, Downloadable</small>ArXiv 24, Github
18) Pixabay100 - A video dataset collected from Pixabay<br><small>Downloadable</small>Github
19) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites<br><small>Long Durations and Structured Captions</small>ArXiv 21, Project , ModelScope
20) MiraData(Mini-Sora Data): A Large-Scale Video Dataset with Long Durations and Structured Captions<br><small>10M video-text pairs</small>Github, Project
21) IDForge: A video dataset featuring scenes of people speaking.<br><small>300k Clips, Downloadable</small>ArXiv 24, Github
<h4 id="video_aug">6.2 Video Augmentation Methods</h4>
<h5 id="video_aug_basic">6.2.1 Basic Transformations</h5>
Three-stream CNNs for action recognitionPRL 17 Paper
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural NetworksEL 19 Paper
Intra-clip Aggregation for Video Person Re-identificationICIP 20 Paper
VideoMix: Rethinking Data Augmentation for Video ClassificationCVPR 20 Paper
mixup: Beyond Empirical Risk MinimizationICLR 17 Paper
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable FeaturesICCV 19 Paper
Video Salient Object Detection via Fully Convolutional NetworksICIP 18 Paper
Illumination-Based Data Augmentation for Robust Background SubtractionSKIMA 19 Paper
Image editing-based data augmentation for illumination-insensitive background subtractionEIM 20 Paper
<h5 id="video_aug_feature">6.2.2 Feature Space</h5>
Feature Re-Learning with Data Augmentation for Content-based Video RecommendationACM 18 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion TransferTrans 21 Paper
<h5 id="video_aug_gan">6.2.3 GAN-based Augmentation</h5>
Deep Video-Based Performance CloningCVPR 18 Paper
Adversarial Action Data Augmentation for Similar Gesture Action RecognitionIJCNN 19 Paper
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient SamplesMM 20 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion TransferTrans 20 Paper
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial NetsTPAMI 20 Paper
CrowdGAN: Identity-Free Interactive Crowd Video Generation and BeyondTPAMI 22 Paper
<h5 id="video_aug_ed">6.2.4 Encoder/Decoder Based</h5>
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance VideoECCV 20 Paper
Autoencoder-based Data Augmentation for Deepfake DetectionACM 23 Paper
<h5 id="video_aug_simulation">6.2.5 Simulation</h5>
A data augmentation methodology for training machine/deep learning gait recognition algorithmsCVPR 16 Paper
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare ApplicationsIEEE 21 Paper
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone FlightsCVPR 19 Paper
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical ModelsIJCV 19 Paper
Using synthetic data for person tracking under adverse weather conditionsIVC 21 Paper
Unlimited Road-scene Synthetic Annotation (URSA) DatasetITSC 18 Paper
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video DataCVPR 21 Paper
Universal Semantic Segmentation for Fisheye Urban Driving ImagesSMC 20 Paper
<h3 id="patchifying-methods">07 Patchifying Methods</h3>
PaperLink
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleCVPR 21 Paper, Github
2) MAE: Masked Autoencoders Are Scalable Vision LearnersCVPR 22 Paper, Github
3) ViViT: A Video Vision Transformer (-)ICCV 21 Paper, GitHub
4) DiT: Scalable Diffusion Models with Transformers (-)ICCV 23 Paper, GitHub, Project, ModelScope
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-)CVPR 23 Paper, GitHub, ModelScope
6) FlexiViT: One Model for All Patch SizesPaper, Github
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and ResolutionArXiv 23, Github
8) VQ-VAE: Neural Discrete Representation LearningPaper, Github
9) VQ-GAN: Neural Discrete Representation LearningCVPR 21 Paper, Github
10) LVT: Latent Video TransformerPaper, Github
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-)ArXiv 21, GitHub
12) Predicting Video with VQVAEArXiv 21
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via TransformersICLR 23 Paper, Github
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive TransformerECCV 22 Paper, Github
15) MAGVIT: Masked Generative Video Transformer (-)CVPR 23 Paper, GitHub, Project, Colab
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual GenerationICLR 24 Paper, Github
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)ArXiv 23, Project, Blog
18) CLIP: Learning Transferable Visual Models From Natural Language SupervisionCVPR 21 Paper, Github
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationArXiv 22, Github
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsArXiv 23, Github
<h3 id="long-context">08 Long-context</h3>
PaperLink
1) World Model on Million-Length Video And Language With RingAttentionArXiv 24, GitHub
2) Ring Attention with Blockwise Transformers for Near-Infinite ContextArXiv 23, GitHub
3) Extending LLMs' Context Window with 100 SamplesArXiv 24, GitHub
4) Efficient Streaming Language Models with Attention SinksICLR 24 Paper, GitHub
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed SurveyPaper
6) MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCVPR 24 Paper, GitHub, Project
7) MemoryBank: Enhancing Large Language Models with Long-Term MemoryPaper, GitHub
<h3 id="audio-related-resource">09 Audio Related Resource</h3>
PaperLink
1) Stable Audio: Fast Timing-Conditioned Latent Audio DiffusionArXiv 24, Github, Blog
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video GenerationCVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio TasksNeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and datasetNeurlPS 23 Paper, GitHub
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationArXiv 23, GitHub
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level QualityTPAMI 24 Paper, GitHub
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing SynthesizersICLR 24 Paper, GitHub
8) UniAudio: An Audio Foundation Model Toward Universal Audio GenerationArXiv 23, GitHub
9) Diffsound: Discrete Diffusion Model for Text-to-sound GenerationTASLP 22 Paper
10) AudioGen: Textually Guided Audio GenerationICLR 23 Paper, Project
11) AudioLDM: Text-to-audio generation with latent diffusion modelsICML 23 Paper, GitHub, Project, Huggingface
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised PretrainingArXiv 23, GitHub, Project, Huggingface
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion ModelsICML 23 Paper, GitHub
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio GenerationArXiv 23
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion modelArXiv 23, GitHub, Project, Huggingface
16) AudioLM: a Language Modeling Approach to Audio GenerationArXiv 22
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking HeadArXiv 23, GitHub
18) MusicGen: Simple and Controllable Music GenerationNeurIPS 23 Paper, GitHub
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTArXiv 23
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent AlignersCVPR 24 Paper
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingEMNLP 23 Paper
22) Audio-Visual LLM for Video UnderstandingArXiv 23
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)ArXiv 23, Project, Blog
24) Movie Gen: A Cast of Media Foundation ModelsPaper, Project
<h3 id="consistency">10 Consistency</h3>
PaperLink
1) Consistency ModelsPaper, GitHub
2) Improved Techniques for Training Consistency ModelsArXiv 23
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-)ICLR 21 Paper, GitHub, Blog
4) Improved Techniques for Training Score-Based Generative ModelsNIPS 20 Paper, GitHub
4) Generative Modeling by Estimating Gradients of the Data DistributionNIPS 19 Paper, GitHub
5) Maximum Likelihood Training of Score-Based Diffusion ModelsNIPS 21 Paper, GitHub
6) Layered Neural Atlases for Consistent Video EditingTOG 21 Paper, GitHub, Project
7) StableVideo: Text-driven Consistency-aware Diffusion Video EditingICCV 23 Paper, GitHub, Project
8) CoDeF: Content Deformation Fields for Temporally Consistent Video ProcessingPaper, GitHub, Project
9) Sora Generates Videos with Stunning Geometrical ConsistencyPaper, GitHub, Project
10) Efficient One-stage Video Object Detection by Exploiting Temporal ConsistencyECCV 22 Paper, GitHub
11) Bootstrap Motion Forecasting With Self-Consistent ConstraintsICCV 23 Paper
12) Enforcing Realism and Temporal Consistency for Large-Scale Video InpaintingPaper
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-AssignmentCVPRW 23 Paper, GitHub
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene ParsingArXiv 21
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label FilterTCSVT 23 Paper
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle TrackingCVPRW 19 Paper
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-)ArXiv 23
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-)ArXiv 24
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional MaskArXiv 23
<h3 id="prompt-engineering">11 Prompt Engineering</h3>
PaperLink
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion ModelsArXiv 24, GitHub, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMsArXiv 24, GitHub
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language ModelsTMLR 23 Paper, GitHub
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTSICLR 24 Paper, GitHub
5) Progressive Text-to-Image Diffusion with Soft Latent DirectionArXiv 23
6) Self-correcting LLM-controlled Diffusion ModelsCVPR 24 Paper, GitHub
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image GenerationMM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language ModelsNeurIPS 23 Paper, GitHub
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept CompositionArXiv 24, GitHub
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User InstructionsArXiv 23, GitHub
11) Controllable Text-to-Image Generation with GPT-4ArXiv 23
12) LLM-grounded Video Diffusion ModelsICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided PlanningArXiv 23
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene SyntaxArXiv 23, Github, Project
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLMArXiv 24
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM AnimatorNeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language ModelsArXiv 23
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video GenerationArXiv 23
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT PlanningArXiv 23
20) Multimodal Procedural Planning via Dual Text-Image PromptingArXiv 23, Github
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision GeneralistsICLR 24 Paper, Github
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding FeedbackArXiv 23
23) TaleCrafter: Interactive Story Visualization with Multiple CharactersSIGGRAPH Asia 23 Paper
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image SynthesisArXiv 23, Github
25) COLE: A Hierarchical Generation Framework for Graphic DesignArXiv 23
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source SupervisionArXiv 23
27) Vlogger: Make Your Dream A VlogCVPR 24 Paper, Github
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian SplattingPaper
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object DiffusionArXiv 24
<h4 id="theoretical-foundations-and-model-architecture">Recaption</h4>
PaperLink
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion ModelsArXiv 23, GitHub
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video GenerationArXiv 23, GitHub
3) CoCa: Contrastive Captioners are Image-Text Foundation ModelsArXiv 22, Github
4) CogView3: Finer and Faster Text-to-Image Generation via Relay DiffusionArXiv 24
5) VideoChat: Chat-Centric Video UnderstandingCVPR 24 Paper, Github
6) De-Diffusion Makes Text a Strong Cross-Modal InterfaceArXiv 23
7) HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleArXiv 23
8) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated DataArXiv 24
9) LLMGA: Multimodal Large Language Model based Generation AssistantArXiv 23, Github
10) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic AlignmentArXiv 24, Github
11) MyVLM: Personalizing VLMs for User-Specific QueriesArXiv 24
12) A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image GenerationArXiv 23, Github
13) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs(-)ArXiv 24, Github
14) FlexCap: Generating Rich, Localized, and Flexible Captions in ImagesArXiv 24
15) Video ReCap: Recursive Captioning of Hour-Long VideosArXiv 24, Github
16) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationICML 22, Github
17) PromptCap: Prompt-Guided Task-Aware Image CaptioningICCV 23, Github
18) CIC: A framework for Culturally-aware Image CaptioningArXiv 24
19) Improving Image Captioning Descriptiveness by Ranking and LLM-based FusionArXiv 24
20) FuseCap: Leveraging Large Language Models for Enriched Fused Image CaptionsWACV 24, Github
<h3 id="security">12 Security</h3>
PaperLink
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetNeurIPS 23 Paper, Github
2) LIMA: Less Is More for AlignmentNeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail?NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion ModelsCVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion ModelsNeurIPS 23 Paper
6) Ablating concepts in text-to-image diffusion modelsICCV 23 Paper
7) Diffusion art or digital forgery? investigating data replication in diffusion modelsICCV 23 Paper, Project
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep NetworksICCV 20 Paper
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacksICML 20 Paper
10) A pilot study of query-free adversarial attack against stable diffusionICCV 23 Paper
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion ModelsICCV 23 Paper
12) Erasing Concepts from Diffusion ModelsICCV 23 Paper, Project
13) Ablating Concepts in Text-to-Image Diffusion ModelsICCV 23 Paper, Project
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetNeurIPS 23 Paper, Project
15) Stable Bias: Evaluating Societal Representations in Diffusion ModelsNeurIPS 23 Paper
16) Threat Model-Agnostic Adversarial Defense using Diffusion ModelsPaper
17) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?Paper, Github
18) Differentially Private Diffusion Models Generate Useful Synthetic ImagesPaper
19) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image ModelsSIGSAC 23 Paper, Github
20) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion ModelsPaper, Github
21) Unified Concept Editing in Diffusion ModelsWACV 24 Paper, Project
22) Diffusion Model Alignment Using Direct Preference OptimizationArXiv 23
23) RAFT: Reward rAnked FineTuning for Generative Foundation Model AlignmentTMLR 23 Paper , Github
24) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene SimulationPaper, Github, Project
<h3 id="world-model">13 World Model</h3>
PaperLink
1) NExT-GPT: Any-to-Any Multimodal LLMArXiv 23, GitHub
<h3 id="video-compression">14 Video Compression</h3>
PaperLink
1) H.261: Video codec for audiovisual services at p x 64 kbit/sPaper
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: VideoPaper
3) H.263: Video coding for low bit rate communicationPaper
4) H.264: Overview of the H.264/AVC video coding standardPaper
5) H.265: Overview of the High Efficiency Video Coding (HEVC) StandardPaper
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its ApplicationsPaper
7) DVC: An End-to-end Deep Video Compression FrameworkCVPR 19 Paper, GitHub
8) OpenDVC: An Open Source Implementation of the DVC Video Compression MethodPaper, GitHub
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent EnhancementCVPR 20 Paper, Github
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability ModelJ-STSP 21 Paper, Github
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GANIJCAI 22 Paper, Github
12) ALVC: Advancing Learned Video Compression with In-loop Frame PredictionT-CSVT 22 Paper, Github
13) DCVC: Deep Contextual Video CompressionNeurIPS 21 Paper, Github
14) DCVC-TCM: Temporal Context Mining for Learned Video CompressionTM 22 Paper, Github
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video CompressionMM 22 Paper, Github
16) DCVC-DC: Neural Video Compression with Diverse ContextsCVPR 23 Paper, Github
17) DCVC-FM: Neural Video Compression with Feature ModulationCVPR 24 Paper, Github
18) SSF: Scale-Space Flow for End-to-End Optimized Video CompressionCVPR 20 Paper, Github
<h3 id="Mamba">15 Mamba</h3>
<h4 id="theoretical-foundations-and-model-architecture">15.1 Theoretical Foundations and Model Architecture</h4>
PaperLink
1) Mamba: Linear-Time Sequence Modeling with Selective State SpacesArXiv 23, Github
2) Efficiently Modeling Long Sequences with Structured State SpacesICLR 22 Paper, Github
3) Modeling Sequences with Structured State SpacesPaper
4) Long Range Language Modeling via Gated State SpacesArXiv 22, GitHub
<h4 id="image-generation-and-visual-applications">15.2 Image Generation and Visual Applications</h4>
PaperLink
1) Diffusion Models Without AttentionArXiv 23
2) Pan-Mamba: Effective Pan-Sharpening with State Space ModelArXiv 24, Github
3) Pretraining Without AttentionArXiv 22, Github
4) Block-State TransformersNIPS 23 Paper
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space ModelArXiv 24, Github
6) VMamba: Visual State Space ModelArXiv 24, Github
7) ZigMa: Zigzag Mamba Diffusion ModelArXiv 24, Github
8) MambaVision: A Hybrid Mamba-Transformer Vision BackboneArXiv 24, GitHub
<h4 id="video-processing-and-understanding">15.3 Video Processing and Understanding</h4>
PaperLink
1) Long Movie Clip Classification with State-Space Video ModelsECCV 22 Paper, Github
2) Selective Structured State-Spaces for Long-Form Video UnderstandingCVPR 23 Paper
3) Efficient Movie Scene Detection Using State-Space TransformersCVPR 23 Paper, Github
4) VideoMamba: State Space Model for Efficient Video UnderstandingPaper, Github
<h4 id="medical-image-processing">15.4 Medical Image Processing</h4>
PaperLink
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretrainingArXiv 24, Github
2) MambaIR: A Simple Baseline for Image Restoration with State-Space ModelArXiv 24, Github
3) VM-UNet: Vision Mamba UNet for Medical Image SegmentationArXiv 24, Github
<h3 id="existing-high-quality-resources">16 Existing high-quality resources</h3>
ResourcesLink
1) Datawhale - AI视频生成学习Feishu doc
2) A Survey on Generative Diffusion ModelTKDE 24 Paper, GitHub
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion ModelsArXiv 23, GitHub
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/SynthesisGitHub
5) video-generation-survey: A reading list of video generationGitHub
6) Awesome-Video-DiffusionGitHub
7) Video Generation Task in Papers With CodeTask
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision ModelsArXiv 24, GitHub
9) Open-Sora-Plan (PKU-YuanGroup)GitHub
10) State of the Art on Diffusion Models for Visual ComputingPaper
11) Diffusion Models: A Comprehensive Survey of Methods and ApplicationsCSUR 24 Paper, GitHub
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and ComparablePaper
13) On the Design Fundamentals of Diffusion Models: A SurveyPaper
14) Efficient Diffusion Models for Vision: A SurveyPaper
15) Text-to-Image Diffusion Models in Generative AI: A SurveyPaper
16) Awesome-Diffusion-TransformersGitHub, Project
17) Open-Sora (HPC-AI Tech)GitHub, Blog
18) LAVIS - A Library for Language-Vision IntelligenceACL 23 Paper, GitHub, Project
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and InferenceGitHub
20) Awesome-Long-ContextGitHub1, GitHub2
21) Lite-SoraGitHub
22) Mira: A Mini-step Towards Sora-like Long Video GenerationGitHub, Project
<h3 id="train">17 Efficient Training</h3>
<h4 id="train_paral">17.1 Parallelism based Approach</h4>
<h5 id="train_paral_dp">17.1.1 Data Parallelism (DP)</h5>
1) A bridging model for parallel computationPaper
2) PyTorch Distributed: Experiences on Accelerating Data Parallel TrainingVLDB 20 Paper
<h5 id="train_paral_mp">17.1.2 Model Parallelism (MP)</h5>
1) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model ParallelismArXiv 19 Paper
2) TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language ModelsPMLR 21 Paper
<h5 id="train_paral_pp">17.1.3 Pipeline Parallelism (PP)</h5>
1) GPipe: Efficient Training of Giant Neural Networks using Pipeline ParallelismNeurIPS 19 Paper
2) PipeDream: generalized pipeline parallelism for DNN trainingSOSP 19 Paper
<h5 id="train_paral_gp">17.1.4 Generalized Parallelism (GP)</h5>
1) Mesh-TensorFlow: Deep Learning for SupercomputersArXiv 18 Paper
2) Beyond Data and Model Parallelism for Deep Neural NetworksMLSys 19 Paper
<h5 id="train_paral_zp">17.1.5 ZeRO Parallelism (ZP)</h5>
1) ZeRO: Memory Optimizations Toward Training Trillion Parameter ModelsArXiv 20
2) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion ParametersACM 20 Paper
3) ZeRO-Offload: Democratizing Billion-Scale Model TrainingArXiv 21
4) PyTorch FSDP: Experiences on Scaling Fully Sharded Data ParallelArXiv 23
<h4 id="train_non">17.2 Non-parallelism based Approach</h4>
<h5 id="train_non_reduce">17.2.1 Reducing Activation Memory</h5>
1) Gist: Efficient Data Encoding for Deep Neural Network TrainingIEEE 18 Paper
2) Checkmate: Breaking the Memory Wall with Optimal Tensor RematerializationMLSys 20 Paper
3) Training Deep Nets with Sublinear Memory CostArXiv 16 Paper
4) Superneurons: dynamic GPU memory management for training deep neural networksACM 18 Paper
<h5 id="train_non_cpu">17.2.2 CPU-Offloading</h5>
1) Training Large Neural Networks with Constant Memory using a New Execution AlgorithmArXiv 20 Paper
2) vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network designIEEE 16 Paper
<h5 id="train_non_mem">17.2.3 Memory Efficient Optimizer</h5>
1) Adafactor: Adaptive Learning Rates with Sublinear Memory CostPMLR 18 Paper
2) Memory-Efficient Adaptive Optimization for Large-Scale LearningPaper
<h4 id="train_struct">17.3 Novel Structure</h4>
1) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic AlignmentArXiv 24 Github
<h3 id="infer">18 Efficient Inference</h3>
<h4 id="infer_reduce">18.1 Reduce Sampling Steps</h4>
<h5 id="infer_reduce_continuous">18.1.1 Continuous Steps</h4>
1) Generative Modeling by Estimating Gradients of the Data DistributionNeurIPS 19 Paper
2) WaveGrad: Estimating Gradients for Waveform GenerationArXiv 20
3) Noise Level Limited Sub-Modeling for Diffusion Probabilistic VocodersICASSP 21 Paper
4) Noise Estimation for Generative Diffusion ModelsArXiv 21
<h5 id="infer_reduce_fast">18.1.2 Fast Sampling</h5>
1) Denoising Diffusion Implicit ModelsICLR 21 Paper
2) DiffWave: A Versatile Diffusion Model for Audio SynthesisICLR 21 Paper
3) On Fast Sampling of Diffusion Probabilistic ModelsArXiv 21
4) DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 StepsNeurIPS 22 Paper
5) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic ModelsArXiv 22
6) Fast Sampling of Diffusion Models with Exponential IntegratorICLR 22 Paper
<h5 id="infer_reduce_dist">18.1.3 Step distillation</h5>
1) On Distillation of Guided Diffusion ModelsCVPR 23 Paper
2) Progressive Distillation for Fast Sampling of Diffusion ModelsICLR 22 Paper
3) SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two SecondsNeurIPS 23 Paper
4) Tackling the Generative Learning Trilemma with Denoising Diffusion GANsICLR 22 Paper
<h4 id="infer_opt">18.2 Optimizing Inference</h4>
<h5 id="infer_opt_low">18.2.1 Low-bit Quantization</h5>
1) Q-Diffusion: Quantizing Diffusion ModelsCVPR 23 Paper
2) Q-DM: An Efficient Low-bit Quantized Diffusion ModelNeurIPS 23 Paper
3) Temporal Dynamic Quantization for Diffusion ModelsNeurIPS 23 Paper
<h5 id="infer_opt_ps">18.2.2 Parallel/Sparse inference</h5>
1) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion ModelsCVPR 24 Paper
2) Efficient Spatially Sparse Inference for Conditional GANs and Diffusion ModelsNeurIPS 22 Paper
3) PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer ModelsArXiv 24

Citation

If this project is helpful to your work, please cite it using the following format:

@misc{minisora,
    title={MiniSora},
    author={MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}
@misc{minisora,
    title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
    author={Survey Paper Group of MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

Minisora Community WeChat Group

<div align="center"> <img src="assets/qrcode.png" width="200"/> <div>&nbsp;</div> <div align="center"> </div> </div>

Star History

Star History Chart

How to Contribute to the Mini Sora Community

We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

Community contributors

<a href="https://github.com/mini-sora/minisora/graphs/contributors"> <img src="https://contrib.rocks/image?repo=mini-sora/minisora" /> </a>