Awesome
Deformable One-shot Face Stylization via DINO Semantic Guidance
Yang Zhou, Zichong Chen, Hui Huang
Shenzhen University
<p> <img src="assets/teaser.jpg" width="2873" alt=""/> </p>[project page] [paper] [supplementary]
Abstract
This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach over existing state-of-the-art one-shot face stylization methods.
Overview
<p> <img src="assets/pipeline.jpg" width="2873" alt=""/> </p>Given a single real-style paired reference, we fine-tune a deformation-aware generator $G_t$ that simultaneously realizes geometry deformation and appearance transfer.
Getting started
Requirements
We have tested on:
- Both Linux and Windows
- NVIDIA GPU + CUDA 11.6
- Python 3.9
- PyTorch 1.13.0
- torchvision 0.14.0
Install all the libraries through pip install -r requirements.txt
Pretrained Models
Please download the pre-trained models from Google Drive.
Model | Description |
---|---|
StyleGANv2 | StyleGANv2 model pretrained on FFHQ with 1024x1024 output resolution. |
e4e_ffhq_encode | FFHQ e4e encoder. |
alexnet | Pretrained alexnet, alex.pth and alexnet-owt-7be5be79.pth. |
shape_predictor_68_face_landmarks | Face detector for extracting face landmarks. |
style1 | Generator with STNs trained on one-shot paired data source1.png and target1.png. |
style2 | Generator with STNs trained on one-shot paired data source2.png and target2.png. |
style3 | Generator with STNs trained on one-shot paired data source3.png and target3.png. |
style4 | Generator with STNs trained on one-shot paired data source4.png and target4.png. |
style5 | Generator with STNs trained on one-shot paired data source5.png and target5.png. |
style6 | Generator with STNs trained on one-shot paired data source6.png and target6.png. |
style7 | Generator with STNs trained on one-shot paired data source7.png and target7.png. |
style8 | Generator with STNs trained on one-shot paired data source8.png and target8.png. |
By default, we assume that all models are downloaded and saved to the directory ./checkpoints
.
Inference
Transfer the pretrained style onto a given image. Results are saved in the ./outputs/inference
folder by default.
python inference.py --style=style3 --input_image=./data/test_inputs/002.png --alpha=0.8
Note: We use pretrained e4e for input image inversion, make sure the pretrained e4e has been downloaded and placed
to ./checkpoints
. Although using e4e can save inference time, the final results are sometimes different from the input images.
Generation
Random stylized faces
Generate random face images using pretrained styles. Results are saved in the ./outputs/generate
folder by default.
python generate_samples.py --style=style1 --seed=2024 --alpha=0.8
<p>
<img src="assets/generate.jpg" width="2873" alt=""/>
</p>
Controllable face deformation
Generate random face images using pretrained styles with deformation control of different degrees.
Results are saved in the ./outputs/control
folder by default.
python deformation_control.py --style=style1 --alpha0=-0.5 --alpha1=1.
<p>
<img src="assets/deformation.jpg" width="2873" alt=""/>
</p>
Train on your own style images
Prepare your (aligned) paired images as real-style samples, place them in the ./data/style_images_aligned
folder.
Make sure the pretrained StyleGANv2 stylegan2-ffhq-config-f.pt
and alexnet alex.pth, alexnet-owt-7be5be79.pth
have
been downloaded and placed to ./checkpoints
.
Start training on your own style images, run:
python train.py --style=[STYLE_NAME] --source=[REAL_IMAGE_PATH] --target=[TARGET_IMAGE_PATH]
For example,
python train.py --style=style1 --source=source1.png --target=target1.png
Note:
- If your face images are not aligned, check the face model
shape_predictor_68_face_landmarks.dat
has downloaded and placed to./checkpoints
, and run the following command for face alignment:python face_align.py --path=[YOUR_IMAGE_PATH] --output=[PATH_TO_SAVE]
- DINO-ViT will be downloaded automatically. We use
dino_vitb8
in our experiments. - The training requires ~22 GB VRAM. It averagely costs 13 mins tested on a single NVIDIA RTX 3090.
- The trained generator will be saved in
./outputs/models
.
Citation
@inproceedings{zhou2024deformable,
title = {Deformable One-shot Face Stylization via DINO Semantic Guidance},
author = {Yang Zhou, Zichong Chen, Hui Huang},
booktitle = {CVPR},
year = {2024}}
Acknowledgments
The StyleGANv2 is borrowed from this pytorch implementation by @rosinality. The implementation of e4e projection is also heavily from encoder4editing. This code also contains submodules inspired by Splice, few-shot-gan-adaptation, tps_stn_pytorch.