Home

Awesome

Deformable One-shot Face Stylization via DINO Semantic Guidance

Yang Zhou, Zichong Chen, Hui Huang

Shenzhen University

<p> <img src="assets/teaser.jpg" width="2873" alt=""/> </p>

[project page] [paper] [supplementary]

Abstract

This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach over existing state-of-the-art one-shot face stylization methods.

Overview

<p> <img src="assets/pipeline.jpg" width="2873" alt=""/> </p>

Given a single real-style paired reference, we fine-tune a deformation-aware generator $G_t$ that simultaneously realizes geometry deformation and appearance transfer.

Getting started

Requirements

We have tested on:

Install all the libraries through pip install -r requirements.txt

Pretrained Models

Please download the pre-trained models from Google Drive.

ModelDescription
StyleGANv2StyleGANv2 model pretrained on FFHQ with 1024x1024 output resolution.
e4e_ffhq_encodeFFHQ e4e encoder.
alexnetPretrained alexnet, alex.pth and alexnet-owt-7be5be79.pth.
shape_predictor_68_face_landmarksFace detector for extracting face landmarks.
style1Generator with STNs trained on one-shot paired data source1.png and target1.png.
style2Generator with STNs trained on one-shot paired data source2.png and target2.png.
style3Generator with STNs trained on one-shot paired data source3.png and target3.png.
style4Generator with STNs trained on one-shot paired data source4.png and target4.png.
style5Generator with STNs trained on one-shot paired data source5.png and target5.png.
style6Generator with STNs trained on one-shot paired data source6.png and target6.png.
style7Generator with STNs trained on one-shot paired data source7.png and target7.png.
style8Generator with STNs trained on one-shot paired data source8.png and target8.png.

By default, we assume that all models are downloaded and saved to the directory ./checkpoints.

Inference

Transfer the pretrained style onto a given image. Results are saved in the ./outputs/inference folder by default.

python inference.py --style=style3 --input_image=./data/test_inputs/002.png --alpha=0.8

Note: We use pretrained e4e for input image inversion, make sure the pretrained e4e has been downloaded and placed to ./checkpoints. Although using e4e can save inference time, the final results are sometimes different from the input images.

Generation

Random stylized faces

Generate random face images using pretrained styles. Results are saved in the ./outputs/generate folder by default.

python generate_samples.py --style=style1 --seed=2024 --alpha=0.8
<p> <img src="assets/generate.jpg" width="2873" alt=""/> </p>

Controllable face deformation

Generate random face images using pretrained styles with deformation control of different degrees. Results are saved in the ./outputs/control folder by default.

python deformation_control.py --style=style1 --alpha0=-0.5 --alpha1=1. 
<p> <img src="assets/deformation.jpg" width="2873" alt=""/> </p>

Train on your own style images

Prepare your (aligned) paired images as real-style samples, place them in the ./data/style_images_aligned folder. Make sure the pretrained StyleGANv2 stylegan2-ffhq-config-f.pt and alexnet alex.pth, alexnet-owt-7be5be79.pth have been downloaded and placed to ./checkpoints.

Start training on your own style images, run:

python train.py --style=[STYLE_NAME] --source=[REAL_IMAGE_PATH] --target=[TARGET_IMAGE_PATH] 

For example,

python train.py --style=style1 --source=source1.png --target=target1.png

Note:

Citation

@inproceedings{zhou2024deformable,
    title = {Deformable One-shot Face Stylization via DINO Semantic Guidance},
    author = {Yang Zhou, Zichong Chen, Hui Huang},
    booktitle = {CVPR},
    year = {2024}}

Acknowledgments

The StyleGANv2 is borrowed from this pytorch implementation by @rosinality. The implementation of e4e projection is also heavily from encoder4editing. This code also contains submodules inspired by Splice, few-shot-gan-adaptation, tps_stn_pytorch.