Home

Awesome

<h1 align="left"> <a href="">Open-Sora Plan</a></h1>

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.

本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前版本离目标差距仍然较大,仍需持续完善和快速迭代,欢迎Pull request!目前代码同时支持使用国产AI计算系统(华为昇腾)进行完整的训练和推理。基于昇腾训练出的模型,也可输出持平业界的视频质量。

<h5 align="left">

slack badge WeChat badge Twitter <br> License GitHub repo contributors GitHub Commit Pr GitHub issues GitHub closed issues <br> GitHub repo stars  GitHub repo forks  GitHub repo watchers  GitHub repo size

</h5> <h5 align="left"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2>

📣 News

😍 Gallery

93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.

<table class="center"> <tr> <td><video src="https://github.com/user-attachments/assets/1c84bc92-d585-46c9-ae7c-e5f79cefea88" autoplay></td> </tr> </table>

😮 Highlights

Open-Sora Plan shows excellent performance in video generation.

🔥 High performance CausalVideoVAE, but with fewer training cost

🚀 Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

<p align="center"> <img src="https://s21.ax1x.com/2024/07/22/pk7cob8.png" width="650" style="margin-bottom: 0.2;"/> <p>

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command.

python -m opensora.serve.gradio_web_server --model_path "path/to/model" --ae_path "path/to/causalvideovae"

ComfyUI

Coming soon...

🐳 Resource

VersionArchitectureDiffusion ModelCausalVideoVAEData
v1.2.03D93x720p, 29x720p[1], 93x480p[1,2], 29x480p, 1x480p, 93x480p_i2vAnysizeAnnotations
v1.1.02+1D221x512x512, 65x512x512AnysizeData and Annotations
v1.0.02+1D65x512x512, 65x256x256, 17x256x256AnysizeData and Annotations

[1] Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

[2] We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.

[!Warning]

<div align="left"> <b> 🚨 For version 1.2.0, we no longer support 2+1D models. </b> </div>

⚙️ Requirements and Installation

  1. Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
  1. Install required packages We recommend the requirements as follows.
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
  1. Install optional requirements such as static type checking:
pip install -e '.[dev]'

🗝️ Training & Validating

🗜️ CausalVideoVAE

Data prepare

The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.

Training Dataset
|——sub_dataset1
    |——sub_sub_dataset1
        |——video1.mp4
        |——video2.mp4
        ......
    |——sub_sub_dataset2
        |——video3.mp4
        |——video4.mp4
        ......
|——sub_dataset2
    |——video5.mp4
    |——video6.mp4
    ......
|——video7.mp4
|——video8.mp4

Training

bash scripts/causalvae/train.sh

We introduce the important args for training.

ArgparseUsage
Training size
--num_framesThe number of using frames for training videos
--resolutionThe resolution of the input to the VAE
--batch_sizeThe local batch size in each GPU
--sample_rateThe frame interval of when loading training videos
Data processing
--video_path/path/to/dataset
Load weights
--model_config/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter.
--pretrained_model_name_or_pathA directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer
--resume_from_checkpoint/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer.

Inference

bash scripts/causalvae/rec_video.sh

We introduce the important args for inference.

ArgparseUsage
Ouoput video size
--num_framesThe number of frames of generated videos
--heightThe resolution of generated videos
--widthThe resolution of generated videos
Data processing
--video_pathThe path to the original video
--rec_pathThe path to the generated video
Load weights
--ae_path/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json
Other
--enable_tilintgUse tiling to deal with videos of high resolution and long duration
--save_memorySave memory to inference but lightly influence quality

Evaluation

For evaluation, you should save the original video clips by using --output_origin.

bash scripts/causalvae/prepare_eval.sh

We introduce the important args for inference.

ArgparseUsage
Ouoput video size
--num_framesThe number of frames of generated videos
--resolutionThe resolution of generated videos
Data processing
--real_video_dirThe directory of the original videos.
--generated_video_dirThe directory of the generated videos.
Load weights
--ckpt/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.
Other
--enable_tilintgUse tiling to deal with videos of high resolution and long time.
--output_originOutput the original video clips, fed into the VAE.

Then, we begin to eval. We introduce the important args in the script for evaluation.

bash scripts/causalvae/eval.sh
ArgparseUsage
--metricThe metric, such as psnr, ssim, lpips
--real_video_dirThe directory of the original videos.
--generated_video_dirThe directory of the generated videos.

📜 Text-to-Video

Data prepare

We use a data.txt file to specify all the training data. Each line in the file consists of DATA_ROOT and DATA_JSON. The example of data.txt is as follows.

/path/to/data_root_1,/path/to/data_json_1.json
/path/to/data_root_2,/path/to/data_json_2.json
...

Then, we introduce the format of the annotation json file. The absolute data path is the concatenation of DATA_ROOT and the "path" field in the annotation json file.

For image

The format of image annotation file is as follows.

[
  {
    "path": "00168/001680102.jpg",
    "cap": [
      "xxxxx."
    ],
    "resolution": {
      "height": 512,
      "width": 683
    }
  },
  ...
]

For video

The format of video annotation file is as follows. More details refer to HF dataset.

[
  {
    "path": "panda70m_part_5565/qLqjjDhhD5Q/qLqjjDhhD5Q_segment_0.mp4",
    "cap": [
      "A man and a woman are sitting down on a news anchor talking to each other."
    ],
    "resolution": {
      "height": 720,
      "width": 1280
    },
    "fps": 29.97002997002997,
    "duration": 11.444767
  },
  ...
]

Training

bash scripts/text_condition/gpu/train_t2v.sh

We introduce some key parameters in order to customize your training process.

ArgparseUsage
Training size
--num_frames 61To train videos of different durations, e.g, 29, 61, 93, 125...
--max_height 640To train videos of different resolutions
--max_width 480To train videos of different resolutions
Data processing
--data /path/to/data.txtSpecify your training data.
--speed_factor 1.25To accelerate 1.25x videos.
--drop_short_ratio 1.0Do not want to train on videos of dynamic durations, discard all video data with frame counts not equal to --num_frames
--group_frameIf you want to train with videos of dynamic durations, we highly recommend specifying --group_frame as well. It improves computational efficiency during training.
Multi-stage transfer learning
--interpolation_scale_h 1.0When training a base model, such as 240p (--max_height 240, --interpolation_scale_h 1.0) , and you want to initialize higher resolution models like 480p (height 480) from 240p's weights, you need to adjust --max_height 480, --interpolation_scale_h 2.0, and set --pretrained to your 240p weights path (path/to/240p/xxx.safetensors).
--interpolation_scale_w 1.0Same as --interpolation_scale_h 1.0
Load weights
--pretrainedThis is typically used for loading pretrained weights across stages, such as using 240p weights to initialize 480p training. Or when switching datasets and you do not want the previous optimizer state.
--resume_from_checkpointIt will resume the training process from the latest checkpoint in --output_dir. Typically, we set --resume_from_checkpoint="latest", which is useful in cases of unexpected interruptions during training.
Sequence Parallelism
--sp_size 8 --train_sp_batch_size 2It means running a batch size of 2 across 8 GPUs (8 GPUs on the same node).

[!Warning]

<div align="left"> <b> 🚨 We have two ways to load weights: `--pretrained` and `--resume_from_checkpoint`. The latter will override the former. </b> </div>

Inference

We provide multiple inference scripts to support various requirements. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method EulerAncestralDiscrete for sampling.

Inference on 93×720p, we report speed on H100.

Size1 GPU8 GPUs (sp)
29×720p420s/100step80s/100step
93×720p3400s/100step450s/100step

🖥️ 1 GPU

If you only have one GPU, it will perform inference on each sample sequentially, one at a time.

bash scripts/text_condition/gpu/sample_t2v.sh

🖥️🖥️ Multi-GPUs

If you want to batch infer a large number of samples, each GPU will infer one sample.

bash scripts/text_condition/gpu/sample_t2v_ddp.sh

🖥️🖥️ Multi-GPUs & Sequence Parallelism

If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.

bash scripts/text_condition/gpu/sample_t2v_sp.sh

🖼️ Image-to-Video

Data prepare

Same as Text-to-Video.

Training

bash scripts/text_condition/gpu/train_inpaint.sh

In addition to the parameters shared with the Text-to-Video mode, there are some unique parameters specific to the Image-to-Video mode that you need to be aware of.

ArgparseUsage
Training size
--use_vae_preprocessed_maskWhether to use VAE (Variational Autoencoder) to encode the mask in order to achieve frame-level mask alignment.
Data processing
--i2v_ratio 0.5The proportion of training data allocated to executing the Image-to-Video task.
--transition_ratio 0.4 The proportion of training data allocated to executing the transition task.
--v2v_ratio 0.1The proportion of training data allocated to executing the video continuation task.
--default_text_ratio 0.5When training with CFG (Classifier-Free Guidance) enabled, a portion of the text is replaced with default text, while another portion is set to an empty string.
Load weights
--pretrained_transformer_model_path This parameter functions the same as the --pretrained parameter.

Inference

In the current version, we have only open-sourced the 480p version of the Image-to-Video (I2V) model. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method PNDM for sampling. Please note that due to the addition of frame-controllable fine-tuning, using the other samplers may not yield satisfactory results.

Inference on 93×480p, we report speed on H100.

Size1 GPU8 GPUs (sp)
93×480p427s/100step81s/100step

Before inference, you need to create two text files: one named prompt.txt and another named conditional_images_path.txt. Each line of text in prompt.txt should correspond to the paths on each line in conditional_images_path.txt.

For example, if the content of prompt.txt is:

this is a prompt of i2v task.
this is a prompt of transition task.

Then the content of conditional_images_path should be:

/path/to/image_0.png
/path/to/image_1_0.png,/path/to/image_1_1.png

This means we will execute a image-to-video task using /path/to/image_0.png and "this is a prompt of i2v task." For the transition task, we'll use /path/to/image_1_0.png and /path/to/image_1_1.png (note that these two paths are separated by a comma without any spaces) along with "this is a prompt of transition task."

After creating the files, make sure to specify their paths in the sample_inpaint.sh script.

🖥️ 1 GPU

If you only have one GPU, it will perform inference on each sample sequentially, one at a time.

bash scripts/text_condition/gpu/sample_inpaint.sh

🖥️🖥️ Multi-GPUs

If you want to batch infer a large number of samples, each GPU will infer one sample.

bash scripts/text_condition/gpu/sample_inpaint_ddp.sh

🖥️🖥️ Multi-GPUs & Sequence Parallelism

If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.

bash scripts/text_condition/gpu/sample_inpaint_sp.sh

💡 How to Contribute

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

🔒 License

<!-- ## ✨ Star History [![Star History](https://api.star-history.com/svg?repos=PKU-YuanGroup/Open-Sora-Plan)](https://star-history.com/#PKU-YuanGroup/Open-Sora-Plan&Date) -->

✏️ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

Latest DOI

DOI

🤝 Community contributors

<a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan/graphs/contributors"> <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/Open-Sora-Plan" /> </a>