Awesome

Deformable One-shot Face Stylization via DINO Semantic Guidance

Shenzhen University

Abstract

This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach over existing state-of-the-art one-shot face stylization methods.

Overview

Given a single real-style paired reference, we fine-tune a deformation-aware generator $G_t$ that simultaneously realizes geometry deformation and appearance transfer.

Getting started

Requirements

We have tested on:

Both Linux and Windows
NVIDIA GPU + CUDA 11.6
Python 3.9
PyTorch 1.13.0
torchvision 0.14.0

Install all the libraries through pip install -r requirements.txt

Pretrained Models

Please download the pre-trained models from Google Drive.

Model	Description
StyleGANv2	StyleGANv2 model pretrained on FFHQ with 1024x1024 output resolution.
e4e_ffhq_encode	FFHQ e4e encoder.
alexnet	Pretrained alexnet, alex.pth and alexnet-owt-7be5be79.pth.
shape_predictor_68_face_landmarks	Face detector for extracting face landmarks.
style1	Generator with STNs trained on one-shot paired data source1.png and target1.png.
style2	Generator with STNs trained on one-shot paired data source2.png and target2.png.
style3	Generator with STNs trained on one-shot paired data source3.png and target3.png.
style4	Generator with STNs trained on one-shot paired data source4.png and target4.png.
style5	Generator with STNs trained on one-shot paired data source5.png and target5.png.
style6	Generator with STNs trained on one-shot paired data source6.png and target6.png.
style7	Generator with STNs trained on one-shot paired data source7.png and target7.png.
style8	Generator with STNs trained on one-shot paired data source8.png and target8.png.

By default, we assume that all models are downloaded and saved to the directory ./checkpoints.

Inference

Transfer the pretrained style onto a given image. Results are saved in the ./outputs/inference folder by default.

python inference.py --style=style3 --input_image=./data/test_inputs/002.png --alpha=0.8

Note: We use pretrained e4e for input image inversion, make sure the pretrained e4e has been downloaded and placed to ./checkpoints. Although using e4e can save inference time, the final results are sometimes different from the input images.

Generation

Random stylized faces

Generate random face images using pretrained styles. Results are saved in the ./outputs/generate folder by default.

python generate_samples.py --style=style1 --seed=2024 --alpha=0.8

Controllable face deformation

Generate random face images using pretrained styles with deformation control of different degrees. Results are saved in the ./outputs/control folder by default.

python deformation_control.py --style=style1 --alpha0=-0.5 --alpha1=1.

Train on your own style images

Prepare your (aligned) paired images as real-style samples, place them in the ./data/style_images_aligned folder. Make sure the pretrained StyleGANv2 stylegan2-ffhq-config-f.pt and alexnet alex.pth, alexnet-owt-7be5be79.pth have been downloaded and placed to ./checkpoints.

Start training on your own style images, run:

python train.py --style=[STYLE_NAME] --source=[REAL_IMAGE_PATH] --target=[TARGET_IMAGE_PATH]

For example,

python train.py --style=style1 --source=source1.png --target=target1.png

Note:

If your face images are not aligned, check the face model shape_predictor_68_face_landmarks.dat has downloaded and placed to ./checkpoints, and run the following command for face alignment:
```
python face_align.py --path=[YOUR_IMAGE_PATH] --output=[PATH_TO_SAVE]
```
DINO-ViT will be downloaded automatically. We use dino_vitb8 in our experiments.
The training requires ~22 GB VRAM. It averagely costs 13 mins tested on a single NVIDIA RTX 3090.
The trained generator will be saved in ./outputs/models.

Citation

@inproceedings{zhou2024deformable,
    title = {Deformable One-shot Face Stylization via DINO Semantic Guidance},
    author = {Yang Zhou, Zichong Chen, Hui Huang},
    booktitle = {CVPR},
    year = {2024}}

Acknowledgments

The StyleGANv2 is borrowed from this pytorch implementation by @rosinality. The implementation of e4e projection is also heavily from encoder4editing. This code also contains submodules inspired by Splice, few-shot-gan-adaptation, tps_stn_pytorch.