Home

Awesome

Implementation for PopAlign

Environment Setup

pip install -r requirements.txt

Additional Dependencies:

hpsv2 https://github.com/tgxs002/HPSv2
ImageReward https://github.com/THUDM/ImageReward
deepface https://github.com/serengil/deepface

Experiments

first, run

python 1_generate_divers_prompts.py

to generate diverse prompts based on basic prompts in data/training_prompts.csv

then,run

bash 2_generate_images.sh

to generate images based on diverse prompts and basic prompts. this script also runs a classifier on generated images

then run

python 3_generate_preferences.py

To generate preference data

then run

bash 4_train.sh

to train the model using pop-align.

Finally, run

bash eval.sh

which will evaluate the model on identity-specific and identity-neutral prompts. It will also run the classifier to compute the discrepancy metric, as well as a series of score model to compute the image quality metric.

Model Card

Implementation Details

This codebase trains and evaluates SDXL using the PopAlign Objective

The assets used in this work (datasets, preference models) are publicly available and are used according to their respective licenses. This code is released privately for the purposes of our submission, and will eventually be made public under the Apache 2.0 License (LICENSE).

Training Details

Training Data

This data uses images generated by SDXL as training samples

Training Procedure

Preprocessing

Training Samples are generated with identity-neutral prompts and identity-specific prompts. They are paired to create a preference data. See paper for more details.

Training Hyperparameters

We train with the following hyperparameters:

- Learning Rate: 5e-7
- Batch size: 8
- Steps: 750

Evaluation

We evaluate using HPS v2, PickScore, CLIP, LAION Aesthetics, as well as Deepface classifer for fairness

Testing Data, Factors & Metrics

Testing Data

We curate our own test data, which is included in this repo under data

Technical Specifications

Model Architecture and Objective

We use the SDXL architecture (U-Net, VAE, CLIP text encoder) and only fine-tune the U-Net with our objective.

Compute

We train with 4 NVIDIA A5000 GPUs for less than 1 day per experiments.