Awesome

STELLAR: Evaluation Metrics

Code for our paper: Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods

Authors: Panos Achlioptas, Alexandros Benetatos, Iordanis Fostiropoulos, Dimitris Skourtis

The codebase is maintained by Iordanis Fostiropoulos. For any questions please reach out.

License

Before downloading or using any part of the code in this repository, please review and acknowledge the terms and conditions set forth in both the "License Terms" and "Third Party License Terms" included in this repository. Continuing to download and use any part of the code in this repository confirms you agree with these terms and conditions.

Introduction

metrics_explanation Note: "Input Image" and "Additional Image" shown are found in CELEBMaksHQ dataset.

This work is based on our technical manuscript Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods. We proposed 5 metrics for evaluating human-centric personalization Text-2-Image models. The repository provides the implementation of 8 additional baseline metrics for Text-2-Image and Image-2-Image methods.

Evaluation metrics

There are several metrics provided from literature. We denote with ⭐ the ones that are introduced by our work.

We provide our own implementation of existing metrics and refer the user to their paper for the technical details of their work.

Name	Evaluation Type	Code Name	Reference
Aesth.	Image2Image	`aesth`	Link
$CLIP_I$	Image2Image	`clip`	Link
DreamSim	Image2Image	`dreamsim`	Link

$CLIP_T$	Text2Image	`clip`	Link
HPSv1	Text2Image	`hps`	Link
HPSv2	Text2Image	`hps`	Link
ImageReward	Text2Image	`im_reward`	Link
PickScore	Text2Image	`pick`	Link

APS ⭐	Personalized Text2Image	`aps`	Link
GoA ⭐	Object-centric	`goa`	Link
IPS ⭐	Personalized Text2Image	`ips`	Link
~~RFS~~* ⭐	Relation-centric	`rfs`	Link
SIS ⭐	Personalized Text2Image	`sis`	Link

RFS is currently pending merging to the current branch.

Usage

pip install git+https://github.com/stellar-gen-ai/stellar-metrics.git

1. Compute the Metrics for each Image

We want to compute the metric for each individual image. As such it can help diagnose the failure cases of a method.

$ python -m stellar_metrics --metric code_name --stellar-path ./stellar-dataset --syn-path ./model-output --save-dir ./save-dir

Optional you can specify --device, --batch-size and --clip-version for the backbone

NOTE there must be one-to-one correspondance between model-output and stellar-dataset. The stellar-dataset is used to calculate some of the metrics, such as identity preservation where the original image is required. Misconfiguration between syn-path and stellar-path can lead to incorrect results.

Mock Example

Calculate IPS

$ python -m stellar_metrics --metric ips --stellar-path ./tests/assets/mock_stellar_dataset --syn-path ./tests/assets/stellar_net --save-dir ./save-dir

Calculate CLIP

$ python -m stellar_metrics --metric clip --stellar-path ./tests/assets/mock_stellar_dataset --syn-path ./tests/assets/stellar_net --save-dir ./save-dir

2. Produce Analysis Table

$ python -m stellar_metrics.analysis --save-dir ./save-dir

Starter Example

Evaluating identity-centric qualities

Identity Preservation

Assess the facial resemblance between the input identity and the generated images in a rather coarse but specialized way. Our metric uses a face detector to isolate the identity's face in both input and generated images. It then employs a specialized face detection model to extract facial representation embeddings from the detected regions.

Attribute Preservation

Assess how well the generated images maintain specific fine-grained attributes of the identity in question, such as age, gender, and other invariant facial features (e.g.,~high cheekbones). Leveraging the annotations in Stellar images, we can evaluate these binary facial characteristics.

Stability of Identity Score

Serves as a measure for determining the extent of a model's sensitivity to different images of the same individual; further promoting models where the subject's identity is consistently well-captured irrespective of the input's image irrelevant variations (e.g., lighting conditions, subject's pose).

To achieve this goal, SIS necessitates having access to multiple images of the human subject (a condition met in Stellar's dataset by design); and is our only evaluation metric with such a more demanding requirement.

Object-centric Context Evaluation

We introduce specialized and interpretable metrics to evaluate two key aspects of the alignment between image and prompt; object representation faithfulness and the fidelity of depicted relationships.

Relation Fidelity Score

Evaluate the success of representing the desired prompt object-interactions on the generated image. Considering the difficulty of even specialized Scene Graph Generation (SGG) models to understand visual relations, this metric introduces a valuable localized insight into the ability of the personalized model to faithfully depict the prompted relations.