Awesome

Evaluation of Deep Generative models

The codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, accepted to NeurIPS 2023

We studied 41 generative models across a diverse range of image datasets and found:

The state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics when using the default Inception-V3 network.
Supervised networks do not provide a perceptual space that generalizes well for image evaluation, and neither do self-supervised methods from particular families.
DINOv2 provides such a generalized representation space and allows for much richer evaluation of generative models. Researchers should replace Inception-V3 in all evaluation metrics. We provide an extensive DINOv2 leaderboard below and have added the results to paperswithcode.com.
Generative models directly memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that currently proposed diagnostic metrics do not properly detect memorization.

Here we provide code to compute the following 15 generative evaluation metrics using 8 different encoder networks:

Metrics:

Encoders:


Our multifaceted investigation of generative evaluation shows that diffusion models are unfairly punished by the Inception network: they synthesize more realistic images as judged by humans and their diversity more closely resembles the training data, yet are consistently ranked worse than GANs on metrics computed in Inception-V3 representation space.

Installation & Usage

Installation

First clone this repository, then navigate to the directory and pip install to install all required packages.

git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

We recommend you do this in a conda environment:

conda create --name dgm-eval pip python==3.10
conda activate dgm-eval
git clone git@github.com:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .

Usage

Computing metrics only requires the paths to either locally hosted image datasets or torchvision.datasets. Encoders are automatically downloaded. For example, the following will compute the Fréchet distance (fd), kernel distance (kd), precision/recall/density/coverage (prdc), and the C<sub>T</sub> score (ct) using DINOv2 (default) as the encoder.

python -m dgm_eval path/to/training_dataset path/to/generated_dataset \
				--test_path path/to/test_dataset \
				--model dinov2 \
				--metrics fd kd prdc ct

See scripts/run_experiments.sh or run python dgm_eval -h for further details on commandline parameters. As we suggest in the paper, metrics should be reported using the default model size (DINOv2-ViT-L/14) for final leaderboard values, but tracking progress during training is a factor of 4 more efficient with DINOv2-ViT-B/14. To use this architecture instead simply add --arch vitb14 as a commandline parameter.

Local datasets should either be un-conditional:

local/path/
	000000.png
	000001.png
	...

or conditional:

local/path/
	0/
		000000.png
		000001.png
		...
	1/
		000000.png
		000001.png
		...
	...

The directory should only include image files. To download and use a dataset from torchvision.datasets, just specify the dataset and train/test string:

python dgm_eval CIFAR10:train CIFAR10:test

A full example is as follows:

python -m dgm_eval CIFAR10:train CIFAR10:test \
					--model dinov2 \
					--metrics fd kd prdc \
					--device cuda \
					--batch_size 256 \
					--nsample 512 
					
									
>>> ....
>>> Num real: 512 Num fake: 512
>>> fd: 862.53745
>>> kd_value: 0.01095
>>> kd_variance: 0.00000
>>> precision: 0.90430
>>> recall: 0.91797
>>> density: 0.97969
>>> coverage: 0.94141

Data Access

Images

All generated data shown in this work can be accessed at the following link:

drive.google.com/drive/folders/1X0MFaUta90d3zF9xG4KchjR-8SE0cT_7?usp=sharing

Including:

Datasets of 100,000 image samples from 41 generative models across CIFAR10/, imagenet256/, LSUN Bedroom/, and FFHQ256/.
Training & test data at 256 x 256 resolution
Generated datasets for controlled experiments presented in the Appendix can be found in toy-datasets/

Human Evaluation

Data for human evaluation of image realism can be found at data/human-evaluation-realism/

Dinov2 Leaderboard


DINOv2 is the best suited model for generative evaluation. Our extensive quantitative and qualitative assessments showed that it distills a generalized representation space suitable for evaluation of diverse image datasets. Metrics computed in DINOv2 space show much better alignment with human evaluation than those in Inception-V3 space.

We have included leaderboard values on paperswithcode (links), and list all metrics in a table below:

Visualizing Heatmaps

Heatmaps can be visualized for each model on any given image datasets by the following, with examples following:

python -m dgm_eval CIFAR10:train CIFAR10:test \
					 --model inception \
					 --metrics fd \
					 --device cuda \
					 --batch_size 256 \
					 --nsample 50000 \
					 --heatmaps

Images	Inception	DINOv2

Citing

If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:

Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem

@inproceedings{stein2023exposing,
  title={Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models},
  author={Stein, George and Cresswell, Jesse and Hosseinzadeh, Rasa and Sui, Yi and Ross, Brendan and Villecroze, Valentin and Liu, Zhaoyan and Caterini, Anthony L and Taylor, Eric and Loaiza-Ganem, Gabriel},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

License

This data and code is licensed under the MIT License, copyright by Layer 6 AI.