Awesome

LCM-Lookahead for Encoder-based Text-to-Image Personalization

LCM-Lookahead for Encoder-based Text-to-Image Personalization Rinon Gal1,2, Or Lichter1, Elad Richardson1, Or Patashnik1, Amit H. Bermano1, Gal Chechik2, Daniel Cohen-Or1 1Tel Aviv University, 2NVIDIA

Abstract: Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.

Description

This repo contains the official code, data and models for our LCM-Lookahead paper.

Updates

30/04/2024 Code released!

TODO:

Release code!
Add native Diffusers support.
Add option to skip shared-attention features for faster inference.
Add Insant-ID support?

Setup

Our code builds on Simo Ryu's excellent MinSDXL implementation. To set up the environment, please run:

conda env create -f environment.yaml
conda activate lcm_lh

Prepare pre-trained models

The following model checkpoints should be downloaded and placed somewhere accessible. If you only want to run inference, you only need to download IP-Adapter CLIP-extractor.

Path	Description
IR-SE50 Model	Pretrained IR-SE50 model taken from TreB1eN for use in our ID loss and encoder backbone on human facial domain.
IP-Adapter Model	Download the ip-adapter-plus-face_sdxl_vit-h.bin checkpoint. Used to initialize the adapter backbone.
IP-Adapter CLIP-extractor	Download the entire directory. Feature extractor used for IP-adapter models. Note: The plus-face_sdxl model uses the non-SDXL clip extractor. Please download the one from the link, and not from the sdxl_models directory.

Training

Update configs

Update your loss config file (config_files/losses.yaml) to point pretrained_arcface_path at the IR-SE50 checkpoint.

Prepare data

See config_files/celeba_example.yaml, config_files/generated_single_set_example.yaml, config_files/generated_multi_set_example.yaml for examples on how to set up a data configuration file for one or more sets.

We provide a data generation example script in scripts/faces_generation.py. Example pre-generated data is available here.

Train the model

An example command for training the model:

accelerate launch --num_processes <number_of_gpus> --multi_gpu train_encoder.py 
--pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 
--data_config_path config_files/<data_config.yaml> 
--pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix 
--losses_config_path config_files/<loss_config.yaml> 
--ip_adapter_feature_extractor_path <path_to_ip_adapter/image_encoder directory> 
--ip_adapter_model_path <path_to_ip_adapter/ip-adapter-plus-face_sdxl_vit-h.bin> 
--output_dir <path_to_model_output_directory>
--mixed_precision=fp16 
--instance_prompt "a photo of a face" 
--train_batch_size 4 
--learning_rate 5e-6 
--max_train_steps 5001 
--validation_prompt "a painting in the style of monet" 
--seed 0 
--checkpointing_steps 5000 
--checkpoints_total_limit 2 
--optimize_adapter 
--adapter_lr 1e-5 
--lcm_every_k_steps 1 
--lcm_batch_size 4 
--gradient_checkpointing 
--kv_drop_chance 0.05 
--adapter_drop_chance 0.1 
--adam_weight_decay 0.01 
--noisy_encoder_input 
--encoder_lora_rank 4 
--kvcopy_lora_rank 4 
--lcm_sample_scale_every_k_steps 5 
--lcm_sample_full_lcm_prob 0.5

Note: If you're training on a single GPU, remove the --multi_gpu argument.

Inference

Our pretrained models are available here.

To perform inference, run:

python infer.py 
--checkpoint_path <path_to_ckpt_dir>/model_ckpt.pt 
--ip_adapter_feature_extractor_path <path_to_ip_adapter/image_encoder directory>
--input_path <path_to_input_images (file/directory)>
--prompts <list_of_inference_prompts>
--out_dir <path_to_output_dir>
--num_images_per_prompt 1
--use_freeu True

See infer.py for additional config options.

Note: If you want to use the model version that doesn't employ extended attention, use the --no_kv flag and optionally discard --use_freeu.

Citation

If you make use of our work, please cite our paper:

@misc{gal2024lcmlookahead,
    title={LCM-Lookahead for Encoder-based Text-to-Image Personalization}, 
    author={Rinon Gal and Or Lichter and Elad Richardson and Or Patashnik and Amit H. Bermano and Gal Chechik and Daniel Cohen-Or},
    year={2024},
    eprint={2404.03620},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}