Awesome
🍰 Tiny AutoEncoder for Stable Diffusion
What is TAESD?
TAESD is very tiny autoencoder which uses the same "latent API" as Stable Diffusion's VAE*. TAESD can decode Stable Diffusion's latents into full-size images at (nearly) zero cost. Here's a comparison on my laptop:
TAESD is compatible with SD1/2-based models (using the taesd_*
weights).
TAESD is also compatible with SDXL-based models (using the taesdxl_*
weights), SD3-based models (using the taesd3_*
weights), and FLUX.1-based models (using the taef1_*
weights).
Where can I get TAESD?
- TAESD is already available in
- A1111
- As a previewer, thanks to Sakura-Luna (enable it in Settings > Live Previews)
- As a encoder / decoder, thanks to Kohaku-Blueleaf (try it in Settings > VAE)
- vladmandic thanks to vladmandic
- ComfyUI
- As a previewer, thanks to space-nuko (follow the instructions under "How to show high-quality previews", then launch ComfyUI with
--preview-method taesd
) - As a standalone VAE (download both
taesd_encoder.pth
andtaesd_decoder.pth
intomodels/vae_approx
, then add aLoad VAE
node and setvae_name
totaesd
)
- As a previewer, thanks to space-nuko (follow the instructions under "How to show high-quality previews", then launch ComfyUI with
- A1111
- TAESD is also available for 🧨 Diffusers in
safetensors
format - TAESD's original weights are in this repo
What can I use TAESD for?
Since TAESD is very fast, you can use TAESD to watch Stable Diffusion's image generation progress in real time. Here's a minimal example notebook that adds TAESD previewing to the 🧨 Diffusers implementation of SD2.1.
Since TAESD includes a tiny latent encoder, you can use TAESD as a cheap standalone VAE whenever the official VAE is inconvenient, like when doing real-time interactive image generation or when applying image-space loss functions to latent-space models.
Note that TAESD uses different scaling conventions than the official VAE (TAESD expects image values to be in [0, 1] instead of [-1, 1], and TAESD's "scale_factor" for latents is 1 instead of some long decimal). Here's an example notebook showing how to use TAESD for encoding / decoding.
How does TAESD work?
TAESD is a tiny, distilled version of Stable Diffusion's VAE*, which consists of an encoder and decoder. The encoder turns full-size images into small "latent" ones (with 48x lossy compression), and the decoder then generates new full-size images based on the encoded latents by making up new details.
The original / decoded images are of shape 3xHxW
with values in approximately [0, 1]
, and the latents are of shape 4x(H/8)x(W/8)
with values in approximately [-3, 3]
. You can clip and quantize TAESD latents into 8-bit PNGs without much loss of quality. TAESD latents should look pretty much like Stable Diffusion latents.
Internally, TAESD is a bunch of Conv+ReLU resblocks and 2x upsample layers:
What are the limitations of TAESD?
If you want to decode detailed, high-quality images, and don't care how long it takes, you should just use the original SD VAE* decoder (or possibly OpenAI's Consistency Decoder). TAESD is very tiny and trying to work very quickly, so it tends to fudge fine details. Example:
TAESD trades a (modest) loss in quality for a (substantial) gain in speed and convenience.
Does TAESD work with video generators?
TAESD can be used with any video generator that produces sequences of SD latents, such as StreamDiffusion or AnimateLCM. However, TAESD generates new details for each frame so the results will flicker a bit. For smooth realtime video decoding you want TAESDV, and for slow-but-high-quality video decoding you want the SVD VAE.
Comparison table
SD VAE* | TAESD | |
---|---|---|
Parameters in Encoder | 34,163,592 | 1,222,532 |
Parameters in Decoder | 49,490,179 | 1,222,531 |
ONNX Ops | Add, Cast, Concat, Constant, ConstantOfShape, Conv, Div, Gather, InstanceNormalization, MatMul, Mul, Pad, Reshape, Resize, Shape, Sigmoid, Slice, Softmax, Transpose, Unsqueeze | Add, Constant, Conv, Div, Mul, Relu, Resize, Tanh |
Runtime / memory scales linearly with size of the latents | No | Yes |
Bounded receptive field so you can split decoding work into tiles without, like, weird seams and stuff | No | |
High-quality details | Yes | No |
Tiny | No | Yes |
* VQGAN? AutoencoderKL? first_stage_model
? This thing. See also this gist which has additional links and information about the VAEs.