Home

Awesome

ComfyUI_stable_fast

Experimental usage of stable-fast and TensorRT.

[!NOTE]

Official TensorRT node https://github.com/comfyanonymous/ComfyUI_TensorRT
This repo is still experimental, just want to try TensorRT that doesn't need to be compiled repeatedly.

Speed Test

Update

Installation

git clone https://github.com/gameltb/ComfyUI_stable_fast custom_nodes/ComfyUI_stable_fast

stable-fast

You'll need to follow the guide below to enable stable fast node.

stable-fast installation

[!NOTE]

Requires stable-fast >= 1.0.0 .

TensorRT(testing)

[!NOTE]

Currently only tested on linux, Not tested on Windows.

The following needs to be installed when you use TensorRT.

pip install onnx zstandard onnxscript --upgrade
pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt==10.2.0
pip install onnx-graphsurgeon polygraphy --extra-index-url https://pypi.ngc.nvidia.com

Usage

Please refer to the screenshot

stable-fast

It can work with Lora, ControlNet and lcm. SD1.5 and SSD-1B are supported. SDXL should work.
Run ComfyUI with --disable-cuda-malloc may be possible to optimize the speed further.

[!NOTE]

TensorRT

Run ComfyUI with --disable-xformers --force-fp16 --fp16-vae and use Apply TensorRT Unet like Apply StableFast Unet.
The Engine will be cached in tensorrt_engine_cache.

[!NOTE]

Apply TensorRT Unet Node

When you use ControlNet, different control image sizes will cause the engine to compile for now.

Table

Features

Stable FastTensorRT(UNET)TensorRT(UNET_BLOCK)
SD1.5
SDXLuntested(Should work)untested
SSD-1B
Lora
ControlNet Unet
VAE decodeWIP-
ControlNet ModelWIPWIP-

Nodes Tested

Stable FastTensorRT(UNET)TensorRT(UNET_BLOCK)
Load LoRA
FreeU(FreeU_V2)
PatchModelAddDownscaleWIP

Speed Test

GeForce RTX 3060 Mobile

GeForce RTX 3060 Mobile (80W) 6GB, Linux , torch 2.1.1, stable fast 0.0.14, tensorrt 9.2.0.post12.dev5, xformers 0.0.23.
workflow: SD1.5, 512x512 bantch_size 1, euler_ancestral karras, 20 steps, use fp16.

Test Stable Fast and xformers run ComfyUI with --disable-cuda-malloc.
Test TensorRT and pytorch run ComfyUI with --disable-xformers.

TensorRT Note

For the TensorRT first launch, it will take up to 10 minutes to build the engine; with timing cache, it will reduce to about 2–3 minutes; with engine cache, it will reduce to about 20–30 seconds for now.

Avg it/s

Stable Fast (enable_cuda_graph)TensorRT (UNET)TensorRT (UNET_BLOCK)pytorch cross attentionxformers
10.10 it/s10.95it/s10.66it/s7.02it/s7.90it/s
enable FreeU9.42 it/s10.04it/s6.75it/s7.54it/s
enable Patch Model Add Downscale10.81 it/s11.30it/s7.46it/s8.41it/s

Avg time spent

workflowStable Fast (enable_cuda_graph)TensorRT (UNET)TensorRT (UNET_BLOCK)pytorch cross attentionxformers
2.21s (first 17s)2.05s2.10s3.06s2.76s
enable FreeU2.35s (first 18.5s)2.24s3.18s2.88
enable Patch Model Add Downscale2.08s (first 31.37s)2.03s2.89s2.61s

Screenshot

sd1.5 ssd-1b