Awesome

Lite-Sora

Introduction

The lite-sora project is an initiative to replicate Sora, co-launched by East China Normal University and the ModelScope community. It aims to explore the minimal reproduction and streamlined implementation of the video generation algorithms behind Sora. We hope to provide concise and readable code to facilitate collective experimentation and improvement, continuously pushing the boundaries of open-source video generation technology.

Roadmap

Usage

Python Environment

conda env create -f environment.yml
conda activate litesora

Download Models

models/text_encoder/model.safetensors: Stable Diffusion XL's Text Encoder. download
models/denoising_model/model.safetensors：We trained a denoising model using a small dataset Pixabay100. This model serves to demonstrate that our training code is capable of fitting the training data properly, with a resolution of 64*64. Obviously this model is overfitting due to the limited amount of training data, and thus it lacks generalization capability at this stage. Its purpose is solely for verifying the correctness of the training algorithm. download
models/vae/model.safetensors: Stable Video Diffusion's VAE. download

Training

from litesora.data import TextVideoDataset
from litesora.models import SDXLTextEncoder2
from litesora.trainers.v1 import LightningVideoDiT
import lightning as pl
import torch


if __name__ == '__main__':
    # dataset and data loader
    dataset = TextVideoDataset("data/pixabay100", "data/pixabay100/metadata.json",
                               num_frames=64, height=64, width=64)
    train_loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=1, num_workers=8)

    # model
    model = LightningVideoDiT(learning_rate=1e-5)
    model.text_encoder.load_state_dict_from_diffusers("models/text_encoder/model.safetensors")

    # train
    trainer = pl.Trainer(max_epochs=100000, accelerator="gpu", devices="auto", callbacks=[
        pl.pytorch.callbacks.ModelCheckpoint(save_top_k=-1)
    ])
    trainer.fit(model=model, train_dataloaders=train_loader)

While the training program is running, you can launch tensorboard to see the training loss.

tensorboard --logdir .

Inference

Synthesize a video in the pixel space.

from litesora.models import SDXLTextEncoder2, VideoDiT
from litesora.pipelines import PixelVideoDiTPipeline
from litesora.data import save_video
import torch


# models
text_encoder = SDXLTextEncoder2.from_diffusers("models/text_encoder/model.safetensors")
denoising_model = VideoDiT.from_pretrained("models/denoising_model/model.safetensors")

# pipeline
pipe = PixelVideoDiTPipeline(torch_dtype=torch.float16, device="cuda")
pipe.fetch_models(text_encoder, denoising_model)

# generate a video
prompt = "woman, flowers, plants, field, garden"
video = pipe(prompt=prompt, num_inference_steps=100)

# save the video (the resolution is 64*64, we enlarge it to 512*512 here)
save_video(video, "output.mp4", upscale=8)

Encode a video into the latent space, and then decode it.

from litesora.models import SDVAEEncoder, SVDVAEDecoder
from litesora.data import load_video, tensor2video, concat_video, save_video
import torch
from tqdm import tqdm


frames = load_video("data/pixabay100/videos/168572 (Original).mp4",
                    num_frames=1024, height=1024, width=1024, random_crop=False)
frames = frames.to(dtype=torch.float16, device="cpu")

encoder = SDVAEEncoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")
decoder = SVDVAEDecoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")

with torch.no_grad():
    print(frames.shape)
    latents = encoder.encode_video(frames, progress_bar=tqdm)
    print(latents.shape)
    decoded_frames = decoder.decode_video(latents, progress_bar=tqdm)

video = tensor2video(concat_video([frames, decoded_frames]))
save_video(video, "video.mp4", fps=24)

Results (Experimental)

We trained a denoising model using a small dataset Pixabay100. This model serves to demonstrate that our training code is capable of fitting the training data properly, with a resolution of 64*64. Obviously this model is overfitting due to the limited amount of training data, and thus it lacks generalization capability at this stage. Its purpose is solely for verifying the correctness of the training algorithm. download

airport, people, crowd, busy	beach, ocean, waves, water, sand	bee, honey, insect, beehive, nature	coffee, beans, caffeine, coffee, shop

fish, underwater, aquarium, swim	forest, woods, mystical, morning	ocean, beach, sunset, sea, atmosphere	hair, wind, girl, woman, people

reeds, grass, wind, golden, sunshine	sea, ocean, seagulls, birds, sunset	woman, flowers, plants, field, garden	wood, anemones, wildflower, flower

We leverage the VAE model from Stable-Video-Diffusion to encode videos to the latent space. Our code supports extremely long high-resolution videos!

https://github.com/modelscope/lite-sora/assets/35051019/dc205719-d0bc-4bca-b117-ff5aa19ebd86