Home

Awesome

<p align="center"> <img src="./assets/readme/icon.png" width="250"/> </p> <div align="center"> <a href="https://github.com/hpcaitech/Open-Sora/stargazers"><img src="https://img.shields.io/github/stars/hpcaitech/Open-Sora?style=social"></a> <a href="https://hpcaitech.github.io/Open-Sora/"><img src="https://img.shields.io/badge/Gallery-View-orange?logo=&amp"></a> <a href="https://discord.gg/kZakZzrSUT"><img src="https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp"></a> <a href="https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-247ipg9fk-KRRYmUl~u2ll2637WRURVA"><img src="https://img.shields.io/badge/Slack-ColossalAI-blueviolet?logo=slack&amp"></a> <a href="https://twitter.com/yangyou1991/status/1769411544083996787?s=61&t=jT0Dsx2d-MS5vS9rNM5e5g"><img src="https://img.shields.io/badge/Twitter-Discuss-blue?logo=twitter&amp"></a> <a href="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/WeChat.png"><img src="https://img.shields.io/badge/微信-小助手加群-green?logo=wechat&amp"></a> <a href="https://hpc-ai.com/blog/open-sora-v1.0"><img src="https://img.shields.io/badge/Open_Sora-Blog-blue"></a> <a href="https://huggingface.co/spaces/hpcai-tech/open-sora"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Gradio Demo-blue"></a> </div>

Open-Sora: Democratizing Efficient Video Production for All

We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.

[中文文档] [潞晨云|OpenSora镜像|视频教程]

<div align="center"> <a href="https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-opensora"> <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/1.gif" width="850" /> </a> </div> ## 📰 News

🎥 Latest Demo

🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face. More samples and corresponding prompts are available in our Gallery.

4s 720×12804s 720×12804s 720×1280
<img src="assets/demo/v1.2/sample_0013.gif" width=""><img src="assets/demo/v1.2/sample_1718.gif" width=""><img src="assets/demo/v1.2/sample_0087.gif" width="">
<img src="assets/demo/v1.2/sample_0052.gif" width=""><img src="assets/demo/v1.2/sample_1719.gif" width=""><img src="assets/demo/v1.2/sample_0002.gif" width="">
<img src="assets/demo/v1.2/sample_0011.gif" width=""><img src="assets/demo/v1.2/sample_0004.gif" width=""><img src="assets/demo/v1.2/sample_0061.gif" width="">
<details> <summary>OpenSora 1.1 Demo</summary>
2s 240×4262s 240×426
<img src="assets/demo/sample_16x240x426_9.gif" width=""><img src="assets/demo/sora_16x240x426_26.gif" width="">
<img src="assets/demo/sora_16x240x426_27.gif" width=""><img src="assets/demo/sora_16x240x426_40.gif" width="">
2s 426×2404s 480×854
<img src="assets/demo/sora_16x426x240_24.gif" width=""><img src="assets/demo/sample_32x480x854_9.gif" width="">
16s 320×32016s 224×4482s 426×240
<img src="assets/demo/sample_16s_320x320.gif" width=""><img src="assets/demo/sample_16s_224x448.gif" width=""><img src="assets/demo/sora_16x426x240_3.gif" width="">
</details> <details> <summary>OpenSora 1.0 Demo</summary>
2s 512×5122s 512×5122s 512×512
<img src="assets/readme/sample_0.gif" width=""><img src="assets/readme/sample_1.gif" width=""><img src="assets/readme/sample_2.gif" width="">
A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop.A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff.The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.
<img src="assets/readme/sample_3.gif" width=""><img src="assets/readme/sample_4.gif" width=""><img src="assets/readme/sample_5.gif" width="">
A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

Videos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.

</details>

🔆 New Features/Updates

<details> <summary>View more</summary> </details>

TODO list sorted by priority

<details> <summary>View more</summary> </details>

Contents

Other useful documents and links are listed below.

Installation

Install from Source

For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, please refer to Installation Documentation for more instructions on different cuda version, and additional dependency for data preprocessing, VAE, and model evaluation.

# create a virtual env and activate (conda as an example)
conda create -n opensora python=3.9
conda activate opensora

# download the repo
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora

# install torch, torchvision and xformers
pip install -r requirements/requirements-cu121.txt

# the default installation is for inference only
pip install -v . # for development mode, `pip install -v -e .`

(Optional, recommended for fast speed, especially for training) To enable layernorm_kernel and flash_attn, you need to install apex and flash-attn with the following commands.

# install flash attention
# set enable_flash_attn=False in config to disable flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex
# set enable_layernorm_kernel=False in config to disable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

Use Docker

Run the following command to build a docker image from Dockerfile provided.

docker build -t opensora .

Run the following command to start the docker container in interactive mode.

docker run -ti --gpus all -v .:/workspace/Open-Sora opensora

Model Weights

Open-Sora 1.2 Model Weights

ModelModel SizeData#iterationsBatch SizeURL
Diffusion1.1B30M70kDynamic:link:
VAE384M3M1M8:link:

See our report 1.2 for more infomation. Weight will be automatically downloaded when you run the inference script.

For users from mainland China, try export HF_ENDPOINT=https://hf-mirror.com to successfully download the weights.

Open-Sora 1.1 Model Weights

<details> <summary>View more</summary>
ResolutionModel SizeData#iterationsBatch SizeURL
mainly 144p & 240p700M10M videos + 2M images100kdynamic:link:
144p to 720p700M500K HQ videos + 1M images4kdynamic:link:

See our report 1.1 for more infomation.

:warning: LIMITATION: This version contains known issues which we are going to fix in the next version (as we save computation resource for the next release). In addition, the video generation may fail for long duration, and high resolution will have noisy results due to this problem.

</details>

Open-Sora 1.0 Model Weights

<details> <summary>View more</summary>
ResolutionModel SizeData#iterationsBatch SizeGPU days (H800)URL
16×512×512700M20K HQ20k2×6435:link:
16×256×256700M20K HQ24k8×6445:link:
16×256×256700M366K80k8×64117:link:

Training orders: 16x256x256 $\rightarrow$ 16x256x256 HQ $\rightarrow$ 16x512x512 HQ.

Our model's weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in our report. More about the dataset can be found in datasets.md. HQ means high quality.

:warning: LIMITATION: Our model is trained on a limited budget. The quality and text alignment is relatively poor. The model performs badly, especially on generating human beings and cannot follow detailed instructions. We are working on improving the quality and text alignment.

</details>

Gradio Demo

🔥 You can experience Open-Sora on our 🤗 Gradio application on Hugging Face online.

Local Deployment

If you want to deploy gradio locally, we have also provided a Gradio application in this repository, you can use the following the command to start an interactive web application to experience video generation with Open-Sora.

pip install gradio spaces
python gradio/app.py

This will launch a Gradio application on your localhost. If you want to know more about the Gradio applicaiton, you can refer to the Gradio README.

To enable prompt enhancement and other language input (e.g., 中文输入), you need to set the OPENAI_API_KEY in the environment. Check OpenAI's documentation to get your API key.

export OPENAI_API_KEY=YOUR_API_KEY

Getting Started

In the Gradio application, the basic options are as follows:

Gradio Demo

The easiest way to generate a video is to input a text prompt and click the "Generate video" button (scroll down if you cannot find). The generated video will be displayed in the right panel. Checking the "Enhance prompt with GPT4o" will use GPT-4o to refine the prompt, while "Random Prompt" button will generate a random prompt by GPT-4o for you. Due to the OpenAI's API limit, the prompt refinement result has some randomness.

Then, you can choose the resolution, duration, and aspect ratio of the generated video. Different resolution and video length will affect the video generation speed. On a 80G H100 GPU, the generation speed (with num_sampling_step=30) and peak memory usage is:

Image2s4s8s16s
360p3s, 24G18s, 27G31s, 27G62s, 28G121s, 33G
480p2s, 24G29s, 31G55s, 30G108s, 32G219s, 36G
720p6s, 27G68s, 41G130s, 39G260s, 45G547s, 67G

Note that besides text to video, you can also use image to video generation. You can upload an image and then click the "Generate video" button to generate a video with the image as the first frame. Or you can fill in the text prompt and click the "Generate image" button to generate an image with the text prompt, and then click the "Generate video" button to generate a video with the image generated with the same model.

Gradio Demo

Then you can specify more options, including "Motion Strength", "Aesthetic" and "Camera Motion". If "Enable" not checked or the choice is "none", the information is not passed to the model. Otherwise, the model will generate videos with the specified motion strength, aesthetic score, and camera motion.

For the aesthetic score, we recommend using values higher than 6. For motion strength, a smaller value will lead to a smoother but less dynamic video, while a larger value will lead to a more dynamic but likely more blurry video. Thus, you can try without it and then adjust it according to the generated video. For the camera motion, sometimes the model cannot follow the instruction well, and we are working on improving it.

You can also adjust the "Sampling steps", this is directly related to the generation speed as it is the number of denoising. A number smaller than 30 usually leads to a poor generation results, while a number larger than 100 usually has no significant improvement. The "Seed" is used for reproducibility, you can set it to a fixed number to generate the same video. The "CFG Scale" controls how much the model follows the text prompt, a smaller value will lead to a more random video, while a larger value will lead to a more text-following video (7 is recommended).

For more advanced usage, you can refer to Gradio README.

Inference

Open-Sora 1.2 Command Line Inference

The basic command line inference is as follows:

# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

You can add more options to the command line to customize the generation.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --num-sampling-steps 30 --flow 5 --aes 6.5 \
  --prompt "a beautiful waterfall"

For image to video generation and other functionalities, the API is compatible with Open-Sora 1.1. See here for more instructions.

If your installation do not contain apex and flash-attn, you need to disable them in the config file, or via the folowing command.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p \
  --layernorm-kernel False --flash-attn False \
  --prompt "a beautiful waterfall"

Sequence Parallelism Inference

To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs.

# text to video
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

:warning: LIMITATION: The sequence parallelism is not supported for gradio deployment. For now, the sequence parallelism is only supported when the dimension can be divided by the number of GPUs. Thus, it may fail for some cases. We tested 4 GPUs for 720p and 2 GPUs for 480p.

GPT-4o Prompt Refinement

We find that GPT-4o can refine the prompt and improve the quality of the generated video. With this feature, you can also use other language (e.g., Chinese) as the prompt. To enable this feature, you need prepare your openai api key in the environment:

export OPENAI_API_KEY=YOUR_API_KEY

Then you can inference with --llm-refine True to enable the GPT-4o prompt refinement, or leave prompt empty to get a random prompt generated by GPT-4o.

python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --llm-refine True

Open-Sora 1.1 Command Line Inference

<details> <summary>View more</summary>

Since Open-Sora 1.1 supports inference with dynamic input size, you can pass the input size as an argument.

# text to video
python scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854

If your installation do not contain apex and flash-attn, you need to disable them in the config file, or via the folowing command.

python scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 32 --image-size 480 854 --layernorm-kernel False --flash-attn False

See here for more instructions including text-to-image, image-to-video, video-to-video, and infinite time generation.

</details>

Open-Sora 1.0 Command Line Inference

<details> <summary>View more</summary>

We have also provided an offline inference script. Run the following commands to generate samples, the required model weights will be automatically downloaded. To change sampling prompts, modify the txt file passed to --prompt-path. See here to customize the configuration.

# Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path OpenSora-v1-HQ-16x512x512.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 16x256x256 (5s/sample, 100 time steps, 22 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt

# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./assets/texts/t2v_samples.txt

The speed is tested on H800 GPUs. For inference with other models, see here for more instructions. To lower the memory usage, set a smaller vae.micro_batch_size in the config (slightly lower sampling speed).

</details>

Data Processing

High-quality data is crucial for training good generation models. To this end, we establish a complete pipeline for data processing, which could seamlessly convert raw videos to high-quality video-text pairs. The pipeline is shown below. For detailed information, please refer to data processing. Also check out the datasets we use.

Data Processing Pipeline

Training

Open-Sora 1.2 Training

The training process is same as Open-Sora 1.1.

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

Open-Sora 1.1 Training

<details> <summary>View more</summary>

Once you prepare the data in a csv file, run the following commands to launch training on a single node.

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-1/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
</details>

Open-Sora 1.0 Training

<details> <summary>View more</summary>

Once you prepare the data in a csv file, run the following commands to launch training on a single node.

# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.

colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

For training other models and advanced usage, see here for more instructions.

</details>

Evaluation

We support evaluation based on:

All the evaluation code is released in eval folder. Check the README for more details. Our report also provides more information about the evaluation during training. The following table shows Open-Sora 1.2 greatly improves Open-Sora 1.0.

ModelTotal ScoreQuality ScoreSemantic Score
Open-Sora V1.075.91%78.81%64.28%
Open-Sora V1.279.23%80.71%73.30%

VAE Training & Evaluation

We train a VAE pipeline that consists of a spatial VAE followed by a temporal VAE. For more details, refer to VAE Documentation. Before you run the following commands, follow our Installation Documentation to install the required dependencies for VAE and Evaluation.

If you want to train your own VAE, we need to prepare data in the csv following the data processing pipeline, then run the following commands. Note that you need to adjust the number of trained epochs (epochs) in the config file accordingly with respect to your own csv data size.

# stage 1 training, 380k steps, 8 GPUs
torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage1.py --data-path YOUR_CSV_PATH
# stage 2 training, 260k steps, 8 GPUs
torchrun --nnodes=1 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage2.py --data-path YOUR_CSV_PATH
# stage 3 training, 540k steps, 24 GPUs
torchrun --nnodes=3 --nproc_per_node=8 scripts/train_vae.py configs/vae/train/stage3.py --data-path YOUR_CSV_PATH

To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:

# video generation
torchrun --standalone --nnodes=1 --nproc_per_node=1 scripts/inference_vae.py configs/vae/inference/video.py --ckpt-path YOUR_VAE_CKPT_PATH --data-path YOUR_CSV_PATH --save-dir YOUR_VIDEO_DIR
# the original videos will be saved to `YOUR_VIDEO_DIR_ori`
# the reconstructed videos through the pipeline will be saved to `YOUR_VIDEO_DIR_rec`
# the reconstructed videos through the spatial VAE only will be saved to `YOUR_VIDEO_DIR_spatial`

# score calculation
python eval/vae/eval_common_metric.py --batch_size 2 --real_video_dir YOUR_VIDEO_DIR_ori --generated_video_dir YOUR_VIDEO_DIR_rec --device cuda --sample_fps 24 --crop_size 256 --resolution 256 --num_frames 17 --sample_rate 1 --metric ssim psnr lpips flolpips

Contribution

Thanks goes to these wonderful contributors:

<a href="https://github.com/hpcaitech/Open-Sora/graphs/contributors"> <img src="https://contrib.rocks/image?repo=hpcaitech/Open-Sora" /> </a>

If you wish to contribute to this project, please refer to the Contribution Guideline.

Acknowledgement

Here we only list a few of the projects. For other works and datasets, please refer to our report.

We are grateful for their exceptional work and generous contribution to open source. Special thanks go to the authors of MiraData and Rectified Flow for their valuable advice and help. We wish to express gratitude towards AK for sharing this project on social media and Hugging Face for providing free GPU resources for our online Gradio demo.

Citation

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You},
  title = {Open-Sora: Democratizing Efficient Video Production for All},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Star History

Star History Chart