Home

Awesome

<p align="center"> <img src="assets/logo-modified.png" width="23%"> <br> </p> <div align="center"> <h1>AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising</h1> <div align="center"> <a href="https://opensource.org/licenses/Apache-2.0"> <img alt="License: Apache 2.0" src="https://img.shields.io/badge/License-Apache%202.0-4E94CE.svg"> </a> <a href="https://arxiv.org/abs/2406.06911"> <img src="https://img.shields.io/badge/Conference-NeurIPS'24-924E7D.svg" alt="Paper"> </a> <a href="https://czg1225.github.io/asyncdiff_page/"> <img src="https://img.shields.io/badge/Project-Page-FFB000.svg" alt="Project"> </a> <a href="https://pytorch.org/"> <img src="https://img.shields.io/badge/PyTorch-%3E=v2.0.1-EE4C2C.svg" alt="PyTorch>=v2.0.1"> </a> </div> </div>

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang
Learning and Vision Lab, National University of Singapore
πŸ₯―[Paper]πŸŽ„[Project Page]
Code Contributors: Zigeng Chen, Zhenxiong Tan

<div align="center"> <img src="assets/combined.png" width="100%" ></img> <br> <em> 2.8x Faster on SDXL with 4 devices. Top: 50 step original (13.81s). Bottom: 50 step AsyncDiff (4.98s) </em> </div> <br> <div align="center"> <img src="assets/combined.gif" width="100%" ></img> <br> <em> 1.8x Faster on AnimateDiff with 2 devices. Top: 50 step original (43.5s). Bottom: 50 step AsyncDiff (24.5s) </em> </div> <br>

Updates

Supported Diffusion Models:

Introduction

We introduce AsyncDiff, a universal and plug-and-play diffusion acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality.

AsyncDiff Overview Above is the overview of the asynchronous denoising process. The denoising model Ρθ is divided into four components for clarity. Following the warm-up stage, each component’s input is prepared in advance, breaking the dependency chain and facilitating parallel processing.

πŸ”§ Quick Start

Installation

Usage Example

Simply add two lines of code to enable asynchronous parallel inference for the diffusion model.

import torch
from diffusers import StableDiffusionPipeline
from asyncdiff.async_sd import AsyncDiff

pipeline = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", 
torch_dtype=torch.float16, use_safetensors=True, low_cpu_mem_usage=True)

async_diff = AsyncDiff(pipeline, model_n=2, stride=1, time_shift=False)

async_diff.reset_state(warm_up=1)
image = pipeline(<prompts>).images[0]
if dist.get_rank() == 0:
  image.save(f"output.jpg")

Here, we use the Stable Diffusion pipeline as an example. You can replace pipeline with any variant of the Stable Diffusion pipeline, such as SD 2.1, SD 1.5, SDXL, or SVD. We also provide the implementation of AsyncDiff for AnimateDiff in asyncdiff.async_animate.

Inference

We offer detailed scripts in examples/ for accelerating inference of SD 2.1, SD 1.5, SDXL, SD 3, ControNet, SD_Upscaler, AnimateDiff, and SVD using our AsyncDiff framework.

πŸš€ Accelerate Stable Diffusion XL:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --run-path examples/run_sdxl.py

πŸš€ Accelerate Stable Diffusion 2.1 or 1.5:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run --nproc_per_node=4 --run-path examples/run_sd.py

πŸš€ Accelerate Stable Diffusion 3 Medium:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_sd3.py

πŸš€ Accelerate Stable Diffusion x4 Upscaler:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_sd_upscaler.py

πŸš€ Accelerate SDXL Inpainting:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_sdxl_inpaint.py

πŸš€ Accelerate ControlNet+SDXL :

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_sdxl_controlnet.py

πŸš€ Accelerate Animate Diffusion:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_animatediff.py

πŸš€ Accelerate Stable Video Diffusion:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --run-path examples/run_svd.py

Qualitative Results

Qualitative Results on SDXL and SD 2.1. More qualitative results can be found in our paper. Qualitative Results

Qualitative Results

Quantitative Results

Quantitative evaluations of AsyncDiff on three text-to-image diffusion models, showcasing various configurations. More quantitative results can be found in our paper. Quantitative Results

Bibtex

@article{chen2024asyncdiff,
  title={AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising},
  author={Chen, Zigeng and Ma, Xinyin and Fang, Gongfan and Tan, Zhenxiong and Wang, Xinchao},
  journal={arXiv preprint arXiv:2406.06911},
  year={2024}
}