

On Memorization in Diffusion Models

We run our experiments on the CIFAR-10 and ImageNet datasets.

CIFAR-10 can be downloaded and saved to datasets/cifar10 by the following commands:

mkdir datasets
mkdir datasets/cifar10
wget -P datasets/cifar10 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

Prepare the full training dataset of CIFAR-10 with $|\mathcal{D}|=50\text{k}$:

python dataset_tool.py --source=datasets/cifar10/cifar-10-python.tar.gz --dest=datasets/cifar10/cifar10-train.zip

To download ImageNet, please refer to ImageNet Object Localization Challenge and save it to datasets/imagenet.

Optimal Diffusion Model

Firstly, we compare the generated images by the theoretical optimum and state-of-the-art diffusion model (EDM). The experiments are run on a single A100 GPU.

We include the implementations of the theoretical optimum in training/optim.py. We use following command to generate images by this theoretical optimum:

torchrun --standalone --nproc_per_node=1 generate_optim.py --outdir=fid-tmp-optim --seeds=0-49999 --subdirs --network=datasets/cifar10/cifar10-train.zip

We use following command to generate images by EDM:

torchrun --standalone --nproc_per_node=1 generate.py --outdir=fid-tmp-edm --seeds=0-49999 --subdirs --network=https://nvlabs-fi-cdn.nvidia.com/edm/pretrained/edm-cifar10-32x32-uncond-vp.pkl

Empirical Study

The basic procedure to evaluate the contribution of a factor on memorization in diffusion models is as follows:

Step I: Sample a training dataset with different sizes $|\mathcal{D}|$. The codes are in dataset_utils, which will be introduced later. The sampled dataset will be saved to $data_path.

Step II: Train a diffusion model on the training data.

All of our experiments related to model training are run on 8 A100 GPUs through DDP with multi-node training. The basic command is

torchrun --nproc_per_node 1 \
         --nnodes $WORLD_SIZE \
         --node_rank $RANK \
         --master_addr $MASTER_ADDR \
         --master_port $MASTER_PORT \
         train.py --outdir=$savedir --argument=$argument

Alternatively, you can use the following command to support DDP with single-node training

torchrun --standalone --nproc_per_node=8 train.py --outdir=$savedir --argument=$argument

We suggest to provide a unique $savedir for each experiment. $argument includes all hyper-parameters.

Step III: Evaluate the snapshots of this trained diffusion model and report the highest memorization ratio.

torchrun --standalone --nproc_per_node=$num_gpu mem_ratio.py --expdir=$outdir --knn-ref=$data_path --log=$outdir/mem_traj.log --seeds=0-9999 --subdirs --batch=512

$outdir refers to the folder including all model snapshots.

Step IV: Gradually increase the training dataset size $|\mathcal{D}|$, and then repeat Step I to Step III and find the Effective Model Memorization (EMM).

Step V: Modify the value of the evaluated factor, and then repeat Step I to Step IV to observe the effect of this factor to memorization.

We provide all the scripts to reproduce our experimental results in the paper.

<p align="center"> <img src="docs/memorize_data_dim.png" alt="" data-canonical-src="docs/memorize_data_dim.pdf" width="50%"/> </p> <p align="center"> <img src="docs/memorize_model_skip.png" alt="" data-canonical-src="docs/memorize_model_skip.pdf" width="100%"/> </p> <p align="center"> <img src="docs/memorize_model_cond.png" alt="" data-canonical-src="docs/memorize_model_cond.pdf" width="100%"/> </p>

Finally, we highlight that conditional EDM with unique labels as conditions can largely memorize training data with $|\mathcal{D}|=50\text{k}$ compared to unconditional EDM.

<p align="center"> <img src="docs/memorize_model_unique.png" alt="" data-canonical-src="docs/memorize_model_unique.pdf" width="50%"/> </p>


