Home

Awesome

<div align="center"> <h1>Autoregressive Video Generation without Vector Quantization</h1> <p align="center"> <a href="https://arxiv.org/abs/2412.14169"><img src="https://img.shields.io/badge/ArXiv-2412.14169-%23840707.svg" alt="ArXiv"></a> <a href="https://huggingface.co/spaces/BAAI/nova-d48w1024-sdxl1024"><img src="https://img.shields.io/badge/🤗 Demo-T2I-%26840707.svg" alt="T2IDemo"></a> <a href="https://huggingface.co/spaces/BAAI/nova-d48w1024-osp480"><img src="https://img.shields.io/badge/🤗 Demo-T2V-%26840707.svg" alt="T2VDemo"></a> <a href="https://novamodel.github.io/"><img src="https://img.shields.io/badge/Webpage-NOVA-%237CB4F7.svg" alt="Webpage"></a> </p>

Haoge Deng<sup>1,4*</sup>, Ting Pan<sup>2,4*</sup>, Haiwen Diao<sup>3,4*</sup>, Zhengxiong Luo<sup>4*</sup>, Yufeng Cui<sup>4</sup><br> Huchuan Lu<sup>3</sup>, Shiguang Shan<sup>2</sup>, Yonggang Qi<sup>1</sup>, Xinlong Wang<sup>4†</sup><br>

BUPT<sup>1</sup>, ICT-CAS<sup>2</sup>, DLUT<sup>3</sup>, BAAI<sup>4</sup><br> <sup>*</sup> Equal Contribution, <sup></sup> Corresponding Author <br><br><image src="assets/model_overview.png"/>

</div>

We present NOVA (NOn-Quantized Video Autoregressive Model), a model that enables autoregressive image/video generation with high efficiency. NOVA reformulates the video generation problem as non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. NOVA generalizes well and enables diverse zero-shot generation abilities in one unified model.

🚀News

✨Hightlights

🗄️Model Zoo

<a id="model-zoo"></a>

See detailed description in Model Zoo

Text to Image

<a id="text-to-image-weight"></a>

ModelParametersResolutionDataWeightGenEvalDPGBench
NOVA-0.6B0.6B512x51216M🤗 HF link0.7581.76
NOVA-0.3B0.3B1024x1024600M🤗 HF link0.6780.60
NOVA-0.6B0.6B1024x1024600M🤗 HF link0.6982.25
NOVA-1.4B1.4B1024x1024600M🤗 HF link0.7183.01

Text to Video

<a id="text-to-video-weight"></a>

ModelParametersResolutionDataWeightVBench
NOVA-0.6B0.6B33x768x48020M🤗 HF link80.12

📖Table of Contents

1. Installation

1.1 From Source

<a id="from-source"></a> Clone this repository to local disk and install:

pip install diffusers transformers accelerate imageio[ffmpeg]
git clone https://github.com/baaivision/NOVA.git
cd NOVA && pip install .

1.2 From Git

<a id="from-git"></a>

You can also install from the remote repository if you have set your Github SSH key:

pip install diffusers transformers accelerate imageio[ffmpeg]
pip install git+ssh://git@github.com/baaivision/NOVA.git

2. Quick Start

2.1 Text to Image

<a id="text-to-image-quickstart"></a>

import torch
from diffnext.pipelines import NOVAPipeline

model_id = "BAAI/nova-d48w768-sdxl1024"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)
pipe = pipe.to("cuda")

prompt = "a shiba inu wearing a beret and black turtleneck."
image = pipe(prompt).images[0]
    
image.save("shiba_inu.jpg")

2.2 Text to Video

<a id="text-to-video-quickstart"></a>

import torch
from diffnext.pipelines import NOVAPipeline
from diffnext.utils import export_to_image, export_to_video

model_id = "BAAI/nova-d48w1024-osp480"
model_args = {"torch_dtype": torch.float16, "trust_remote_code": True}
pipe = NOVAPipeline.from_pretrained(model_id, **model_args)

# Standard device routine.
pipe = pipe.to("cuda")
# Use CPU model offload routine and expandable allocator if OOM.
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# pipe.enable_model_cpu_offload()

# Text to Video
prompt = "Many spotted jellyfish pulsating under water."
video = pipe(prompt, max_latent_length=9).frames[0]
export_to_video(video, "jellyfish.mp4", fps=12)

# Increase AR and diffusion steps for better video quality.
video = pipe(
  prompt,
  max_latent_length=9,
  num_inference_steps=128,  # default: 64
  num_diffusion_steps=100,  # default: 25
).frames[0]
export_to_video(video, "jellyfish_v2.mp4", fps=12)

# You can also generate images from text, with the first frame as an image.
prompt = "Many spotted jellyfish pulsating under water."
image = pipe(prompt, max_latent_length=1).frames[0, 0]
export_to_image(image, "jellyfish.jpg")

3. Gradio Demo

# For text-to-image demo
python scripts/app_nova_t2i.py --model "BAAI/nova-d48w1024-sdxl1024" --device 0

# For text-to-video demo
python scripts/app_nova_t2v.py --model "BAAI/nova-d48w1024-osp480" --device 0

4. Train

5. Inference

6. Evaluation

📋Todo List

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{deng2024nova,
  title={Autoregressive Video Generation without Vector Quantization},
  author={Deng, Haoge and Pan, Ting and Diao, Haiwen and Luo, Zhengxiong and Cui, Yufeng and Lu, Huchuan and Shan, Shiguang and Qi, Yonggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2412.14169},
  year={2024}
}

Acknowledgement

We thank the repositories: MAE, MAR, MaskGIT, DiT, Open-Sora-Plan, CogVideo, and CodeWithGPU.

License

Code and models are licensed under Apache License 2.0.