Home

Awesome

Awesome Evaluation of Visual Generation

Visitor

This repository collects methods for evaluating visual generation.

overall_structure

Overview

What You'll Find Here

Within this repository, we collect works that aim to answer some critical questions in the field of evaluating visual generation, such as:

Updates

This repository is updated periodically. If you have suggestions for additional resources, updates on methodologies, or fixes for expiring links, please feel free to do any of the following:

Table of Contents

<a name="1."></a>

1. Evaluation Metrics of Generative Models

<a name="1.1."></a>

1.1. Evaluation Metrics of Image Generation

MetricPaperCode
Inception Score (IS)Improved Techniques for Training GANs (NeurIPS 2016)
Fréchet Inception Distance (FID)GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)Code Code
Kernel Inception Distance (KID)Demystifying MMD GANs (ICLR 2018)Code Code
CLIP-FIDThe Role of ImageNet Classes in Fréchet Inception Distance (ICLR 2023)Code Code
Precision-and-RecallAssessing Generative Models via Precision and Recall (2018-05-31, NeurIPS 2018) <br> Improved Precision and Recall Metric for Assessing Generative Models (NeurIPS 2019)Code Code
Renyi Kernel Entropy (RKE)An Information-Theoretic Evaluation of Generative Models in Learning Multi-modal Distributions (NeurIPS 2023)Code
CLIP Maximum Mean Discrepancy (CMMD)Rethinking FID: Towards a Better Evaluation Metric for Image Generation (CVPR 2024)Code

<a name="1.2."></a>

1.2. Evaluation Metrics of Video Generation

MetricPaperCode
FID-vidGANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)
Fréchet Video Distance (FVD)Towards Accurate Generative Models of Video: A New Metric & Challenges (arXiv 2018) <br> FVD: A new Metric for Video Generation (2019-05-04) <i> (Note: ICLR 2019 Workshop DeepGenStruct Program Chairs)</i>Code

<a name="1.3."></a>

1.3. Evaluation Metrics for Latent Representation

<a name="2."></a>

2. Evaluation Metrics of Condition Consistency

<a name="2.1."></a>

2.1 Evaluation Metrics of Multi-Modal Condition Consistency

MetricConditionPipelineCodeReferences
CLIP Score (a.k.a. CLIPSIM)Textcosine similarity between the CLIP image and text embeddingsCode PyTorch LightningCLIP Paper (ICML 2021). Metrics first used in CLIPScore Paper (arXiv 2021) and GODIVA Paper (arXiv 2021) applies it in video evaluation.
Mask AccuracySegmentation Maskpredict the segmentatio mask, and compute pixel-wise accuracy against the ground-truth segmentation maskany segmentation method for your setting
DINO SimilarityImage of a Subject (human / object etc)cosine similarity between the DINO embeddings of the generated image and the condition imageCodeDINO paper. Metric is proposed in DreamBooth.
<!-- | Identity Consistency | Image of a Face | | - | --> <!-- Papers for CLIP Similarity: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) (ICML 2021), [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718) (arXiv 2021), [GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions](https://arxiv.org/abs/2104.14806) (arXiv 2021) | [![Code](https://img.shields.io/github/stars/openai/CLIP.svg?style=social&label=CLIP)](https://github.com/openai/CLIP) [PyTorch Lightning](https://lightning.ai/docs/torchmetrics/stable/multimodal/clip_score.html) -->

<a name="2.2."></a>

2.2. Evaluation Metrics of Image Similarity

MetricsPaperCode
Learned Perceptual Image Patch Similarity (LPIPS)The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (2018-01-11) (CVPR 2018)Code Website
Structural Similarity Index (SSIM)Image quality assessment: from error visibility to structural similarity (TIP 2004)Code Code
Peak Signal-to-Noise Ratio (PSNR)-Code
Multi-Scale Structural Similarity Index (MS-SSIM)Multiscale structural similarity for image quality assessment (SSC 2004)PyTorch-Metrics
Feature Similarity Index (FSIM)FSIM: A Feature Similarity Index for Image Quality Assessment (TIP 2011)Code

The community has also been using DINO or CLIP features to measure the semantic similarity of two images / frames.

There are also recent works on new methods to measure visual similarity (more will be added):

<a name="3."></a>

3. Evaluation Systems of Generative Models

<a name="3.1."></a>

3.1. Evaluation of Unconditional Image Generation

<i>Note: Skew Inception Distance introduced</i>

<i>Note: Class-Aware Frechet Distance introduced</i>

<a name="3.2."></a>

3.2. Evaluation of Text-to-Image Generation

<!-- + [JourneyDB: A Benchmark for Generative Image Understanding](https://arxiv.org/abs/2307.00716) (2023-07-03, NeurIPS 2023) [![Code](https://img.shields.io/github/stars/JourneyDB/JourneyDB.svg?style=social&label=Official)](https://github.com/JourneyDB/JourneyDB) [![Website](https://img.shields.io/badge/Website-9cf)](https://journeydb.github.io/) -->

<a name="3.3."></a>

3.3. Evaluation of Text-Based Image Editing

<a name="3.4."></a>

3.4. Evaluation of Neural Style Transfer

<a name="3.5."></a>

3.5. Evaluation of Video Generation

3.5.1. Evaluation of Text-to-Video Generation

3.5.2. Evaluation of Image-to-Video Generation

3.5.3. Evaluation of Talking Face Generation

<a name="3.6."></a>

3.6. Evaluation of Text-to-Motion Generation

<a name="3.7."></a>

3.7. Evaluation of Model Trustworthiness

3.7.1. Evaluation of Visual-Generation-Model Trustworthiness

3.7.2. Evaluation of Non-Visual-Generation-Model Trustworthiness

Not for visual generation, but related evaluations of other models like LLMs

<a name="3.8."></a>

3.8. Evaluation of Entity Relation

<a name="4."></a>

4. Improving Visual Generation with Evaluation / Feedback / Reward

<!-- ## Evaluation Datasets - UCF101 - ImageNet - COCO -->

<a name="5."></a>

5. Quality Assessment for AIGC

5.1. Image Quality Assessment for AIGC

5.2. Aesthetic Predictors for Generated Images

<!-- ## Video Quality Assessment for AIGC - To be added -->

<a name="6."></a>

6. Study and Rethinking

6.1. Evaluation of Evaluations

6.2. Survey

6.3. Study

6.4. Competition

<a name="7."></a>

7. Other Useful Resources

<!-- Papers to read and to organize: - Rethinking FID: Towards a Better Evaluation Metric for Image Generation - Wasserstein Distortion: Unifying Fidelity and Realism -->