Home

Awesome

Development repository. Please see CompVis/stable-diffusion for the Stable Diffusion release.


Latent Diffusion Models

arXiv | BibTeX

<p align="center"> <img src=assets/results.gif /> </p>

High-Resolution Image Synthesis with Latent Diffusion Models<br/> Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, BjΓΆrn Ommer<br/> * equal contribution

<p align="center"> <img src=assets/modelfigure.png /> </p>

News

April 2022

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Pretrained Models

A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.

Text-to-Image

text2img-figure

Download the pre-trained weights (5.7GB)

mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

and sample with

python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50

This will save each sample individually as well as a grid of size n_iter x n_samples at the specified output location (default: outputs/txt2img-samples). Quality, sampling speed and diversity are best controlled via the scale, ddim_steps and ddim_eta arguments. As a rule of thumb, higher values of scale produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps generally also gives higher quality samples, but returns are diminishing for values > 250. Fast sampling (i.e. low values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0.
Faster sampling (i.e. even lower values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0 and --plms (see Pseudo Numerical Methods for Diffusion Models on Manifolds).

Beyond 256Β²

For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on can sometimes result in interesting results. To try it out, tune the H and W arguments (which will be integer-divided by 8 in order to calculate the corresponding latent size), e.g. run

python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  

to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.

The example below was generated using the above command. text2img-figure-conv

Inpainting

inpainting

Download the pre-trained weights

wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1

and sample with

python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

indir should contain images *.png and masks <image_fname>_mask.png like the examples provided in data/inpainting_examples.

Class-Conditional ImageNet

Available via a notebook . class-conditional

Unconditional Models

We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta> 

Train your own LDMs

Data preparation

Faces

For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.

LSUN

The LSUN datasets can be conveniently downloaded via the script available here. We performed a custom split into training and validation images, and provide the corresponding filenames at https://ommer-lab.com/files/lsun.zip. After downloading, extract them to ./data/lsun. The beds/cats/churches subsets should also be placed/symlinked at ./data/lsun/bedrooms/./data/lsun/cats/./data/lsun/churches, respectively.

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
β”œβ”€β”€ n01440764
β”‚   β”œβ”€β”€ n01440764_10026.JPEG
β”‚   β”œβ”€β”€ n01440764_10027.JPEG
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ n01443537
β”‚   β”œβ”€β”€ n01443537_10007.JPEG
β”‚   β”œβ”€β”€ n01443537_10014.JPEG
β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

Model Training

Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>.

Training autoencoder models

Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,    

where config_spec is one of {autoencoder_kl_8x8x64(f=32, d=64), autoencoder_kl_16x16x16(f=16, d=16), autoencoder_kl_32x32x4(f=8, d=4), autoencoder_kl_64x64x3(f=4, d=3)}.

For training VQ-regularized models, see the taming-transformers repository.

Training LDMs

In configs/latent-diffusion/ we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,

where <config_spec> is one of {celebahq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_bedrooms-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_churches-ldm-vq-4(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

Model Zoo

Pretrained Autoencoding Models

rec2

All models were trained until convergence (no further substantial improvement in rFID).

ModelrFID vs valtrain stepsPSNRPSIMLinkComments
f=4, VQ (Z=8192, d=3)0.5853306627.43 +/- 4.260.53 +/- 0.21https://ommer-lab.com/files/latent-diffusion/vq-f4.zip
f=4, VQ (Z=8192, d=3)1.0665813125.21 +/- 4.170.72 +/- 0.26https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1no attention
f=8, VQ (Z=16384, d=4)1.1497104323.07 +/- 3.991.17 +/- 0.36https://ommer-lab.com/files/latent-diffusion/vq-f8.zip
f=8, VQ (Z=256, d=4)1.49160864922.35 +/- 3.811.26 +/- 0.37https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip
f=16, VQ (Z=16384, d=8)5.15110116620.83 +/- 3.611.73 +/- 0.43https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1
f=4, KL0.2717699127.53 +/- 4.540.55 +/- 0.24https://ommer-lab.com/files/latent-diffusion/kl-f4.zip
f=8, KL0.9024680324.19 +/- 4.191.02 +/- 0.35https://ommer-lab.com/files/latent-diffusion/kl-f8.zip
f=16, KL (d=16)0.8744299824.08 +/- 4.221.07 +/- 0.36https://ommer-lab.com/files/latent-diffusion/kl-f16.zip
f=32, KL (d=64)2.0440676322.27 +/- 3.931.41 +/- 0.40https://ommer-lab.com/files/latent-diffusion/kl-f32.zip

Get the models

Running the following script downloads und extracts all available pretrained autoencoding models.

bash scripts/download_first_stages.sh

The first stage models can then be found in models/first_stage_models/<model_spec>

Pretrained LDMs

DatsetTaskModelFIDISPrecRecallLinkComments
CelebA-HQUnconditional Image SynthesisLDM-VQ-4 (200 DDIM steps, eta=0)5.11 (5.11)3.290.720.49https://ommer-lab.com/files/latent-diffusion/celeba.zip
FFHQUnconditional Image SynthesisLDM-VQ-4 (200 DDIM steps, eta=1)4.98 (4.98)4.50 (4.50)0.730.50https://ommer-lab.com/files/latent-diffusion/ffhq.zip
LSUN-ChurchesUnconditional Image SynthesisLDM-KL-8 (400 DDIM steps, eta=0)4.02 (4.02)2.720.640.52https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip
LSUN-BedroomsUnconditional Image SynthesisLDM-VQ-4 (200 DDIM steps, eta=1)2.95 (3.0)2.22 (2.23)0.660.48https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip
ImageNetClass-conditional Image SynthesisLDM-VQ-8 (200 DDIM steps, eta=1)7.77(7.76)* /15.82**201.56(209.52)* /78.82**0.84* / 0.65**0.35* / 0.63**https://ommer-lab.com/files/latent-diffusion/cin.zip*: w/ guiding, classifier_scale 10 **: w/o guiding, scores in bracket calculated with script provided by ADM
Conceptual CaptionsText-conditional Image SynthesisLDM-VQ-f4 (100 DDIM steps, eta=0)16.7913.89N/AN/Ahttps://ommer-lab.com/files/latent-diffusion/text2img.zipfinetuned from LAION
OpenImagesSuper-resolutionLDM-VQ-4N/AN/AN/AN/Ahttps://ommer-lab.com/files/latent-diffusion/sr_bsr.zipBSR image degradation
OpenImagesLayout-to-Image SynthesisLDM-VQ-4 (200 DDIM steps, eta=0)32.0215.92N/AN/Ahttps://ommer-lab.com/files/latent-diffusion/layout2img_model.zip
LandscapesSemantic Image SynthesisLDM-VQ-4N/AN/AN/AN/Ahttps://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip
LandscapesSemantic Image SynthesisLDM-VQ-4N/AN/AN/AN/Ahttps://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zipfinetuned on resolution 512x512

Get the models

The LDMs listed above can jointly be downloaded and extracted via

bash scripts/download_models.sh

The models can then be found in models/ldm/<model_spec>.

Coming Soon...

Comments

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and BjΓΆrn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}