Home

Awesome

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model<br><sub>Official PyTorch implementation of the CVPR 2023 paper</sub>

Open In Spaces Colab project_page arXiv

<p align="center"> <img src="assets/datid3d_result.gif"/> </p>

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model<br> Gwanghyun Kim, Se Young Chun <br> CVPR 2023 <br>

gwang-kim.github.io/datid_3d

Abstract: <br> Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information.
Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a novel pipeline of text-guided domain adaptation tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

Recent Updates

Requirements

Demo

Gradio Demo Open In Spaces

python datid3d_gradio_app.py
<p align="center"> <img src="assets/datid3d_gradio.gif" /> </p>

Colab Demo Open In Colab

Download Fine-tuned 3D Generative Models

Fine-tuned 3D generative models using DATID-3D pipeline are stored as *.pkl files. You can download the models in our Hugginface model pages.

mkdir finetuned
wget https://huggingface.co/gwang-kim/datid3d-finetuned-eg3d-models/resolve/main/finetuned_models/ffhq-pixar.pkl -O finetuned

Sample Images, Shapes and Videos

You can sample images and shapes (as .mrc files), pose-controlled videos using the shifted 3D generative model. For example:

# Sample images and shapes (as .mrc files) using the shifted 3D generative model

python datid3d_test.py --mode image \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--shape=True \
--network=finetuned/ffhq-pixar.pkl 
# Sample pose-controlled videos using the shifted 3D generative model

python datid3d_test.py --mode video \
--generator_type='ffhq' \
--outdir='test_runs' \
--seeds='100-200' \
--trunc='0.7' \
--grid=4x4 \
--network=finetuned/ffhq-pixar.pkl 

The results are saved to ~/test_runs/image or ~/test_runs/video.

Following EG3D, we visualize our .mrc shape files with UCSF Chimerax.

To visualize a shape in ChimeraX do the following:

  1. Import the .mrc file with File > Open
  2. Find the selected shape in the Volume Viewer tool
    1. The Volume Viewer tool is located under Tools > Volume Data > Volume Viewer
  3. Change volume type to "Surface"
  4. Change step size to 1
  5. Change level set to 10
    1. Note that the optimal level can vary by each object, but is usually between 2 and 20. Individual adjustment may make certain shapes slightly sharper
  6. In the Lighting menu in the top bar, change lighting to "Full"

Single-shot Text-guided 2D-to-3D

Text-guided Manipulated 3D Reconstruction

This includes alignment -> pose extraction -> 3D GAN inversion -> generation of images using fine-tuned generator. We use Deep3DFaceRecon as the pose estimation models. The prtrained pose estimation will be downloaded automatically for convinence. Or you can download the pretrained pose estimation model and BFM files, put epoch_20.pth in ~/pose_estimation/checkpoints/pretrained/ and put unzip BFM.zip in ~/pose_estimation/. For example:

# Text-guided manipulated 3D reconstruction from images using the shifted 3D generative model

python datid3d_test.py --mode manip \
--indir='input_imgs' \
--generator_type='ffhq' \
--outdir='test_runs' \
--trunc='0.7' \
--network=finetuned/ffhq-pixar.pkl 

The results are saved to ~/test_runs/manip_3D_recon/4_manip_result.

Text-guided Domain Adaptation of 3D Generator

You can do text-guided domain adaptation of 3D generator with your own text prompt using datid3d_train.py. For example:

python datid3d_train.py \
   --mode='ft' \
   --pdg_prompt='a FHD photo of face of beautiful Elf with silver hair in the live action movie' \
   --pdg_generator_type='ffhq' \
   --pdg_strength=0.7 \
   --pdg_num_images=1000 \
   --pdg_sd_model_id='stabilityai/stable-diffusion-2-1-base' \
   --pdg_num_inference_steps=50 \
   --ft_generator_type='same' \
   --ft_batch=20 \
   --ft_kimg=200

The results of each training run are saved to a newly created directory, for example ~/training_runs/00011-ffhq-data_ffhq_a_FHD_photo_of_face_of_beautiful_Elf_with_silver_hair_in_the_live_action_movie-gpus1-batch20-gamma5.

Citation

@inproceedings{kim2022datid3d,
  author = {Gwanghyun Kim and Se Young Chun},
  title = {DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model},
  booktitle = {CVPR},
  year = {2023}
}

Acknowledgements

We thank the contributions of public projects for sharing their code. We apply our pipelines to EG3D, one of the 3D generative models, and adopt Stable Diffusion as our text-to-image diffusion models and Deep3DFaceRecon as our pose estimation models. We also utilze a part of codes in HFGI3D.