Home

Awesome

<p align="center"> <font color=#008000>Fantasia3D</font>: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation </p>

<p align="center"> Rui Chen*, Yongwei Chen*, Ningxin Jiao, Kui Jia</p>
<p align="center"> ICCV2023
<p align="center"> *equal contribution

<p align="center">Paper | ArXiv | Project Page | Supp_material | Video</p>

<p align="center"> <img width="40%" src="assets/head_figure.jpg"/> </p>

https://user-images.githubusercontent.com/128572637/fe5a05d3-33af-41c4-a5c0-e74485797f08

https://user-images.githubusercontent.com/128572637/691a1c2d-0c55-4b2e-8dd6-82fddc2685a6

https://user-images.githubusercontent.com/128572637/99ab7e61-eb81-4b75-8138-3321b6633d78

https://user-images.githubusercontent.com/128572637/405fe77e-25c0-410f-b463-e1e3ded2f065

Update log

Please pull the latest code to improve performance!!

FAQs

Q1: About the use of normal and mask images as the input of stable diffusion model and analysis

Answer: Our initial hypothesis is that normal and mask images, representing local and silhouette information of shapes respectively, can benefit geometry learning. Additionally, we observed that the value range of the normal map is normalized to (-1, 1), which aligns with the data range required for latent space diffusion. Our empirical studies validate this hypothesis. Further support for our hypothesis comes from the presence of normal images in the LAION-5B dataset used for training Stable Diffusion (see Website for retrieval of normal data in LAION-5B). Therefore, the normal data is not considered an out-of-distribution (OOD) input for stable diffusion. To handle rough and coarse geometry in the early stage of learning, we directly utilize concatenated 64 $\times$ 64 $\times$ 4 (normal, mask) images as the latent code, inspired by Latent-NeRF, to achieve better convergence. However, using the normal map without VAE encoding in the world coordinate system may lead to inconsistencies with the data distribution of the latent space trained by VAE. This mismatch can cause the generated geometry to deviate from the text description in some cases. To address this issue, we employ a data augmentation technique by randomly rotating the normal map rendered from the current view. This approach brings the distribution of the normal map closer to the distribution of latent space data. We experimentally observe that it improves the alignment between the generated geometry and the text description. As the learning progresses, it becomes essential to render the 512 $\times$ 512 $\times$ 3 high-resolution normal image for capturing finer geometry details, and we choose to use normal image only in the later stage. This strategy strikes an accuracy-efficiency balance throughout the geometry optimization process.

Q2: Hypothesis-verification analysis of the disentangled representation

Answer: Previous methods (e.g., DreamFusion and Magic3D) couple the geometry and appearance generation together, following NeRF. Our adoption of the disentangled representation is mainly motivated by the difference of problem nature for generating surface geometry and appearance. In fact, when dealing with finer recovery of surface geometry from multi-view images, methods (e.g., VolSDF, nvdiffrec, etc) that explicitly take the surface modeling into account triumph; our disentangled representation enjoys the benefit similar to these methods. The disentangled representation also enables us to include the BRDF material representation in the appearance modeling, achieving better photo-realistic rendering by the BRDF physical prior.

Q3: Can Fantasia3D directly fine-tune the mesh given by the user?

Answer: Yes, it can. Fantasia3D can receive any mesh given by the user and fine-tune it using our method of user-guided generation. It can also naturally interface with the 3D generative method like shape-e and point-e. In a word, Fantasia3D can generate highly detailed and high-fidelity 3D content based on either the low-quality mesh given by the users or the ellipsoid.

Q4: What do you think is the reason why it could not replicate the same result with 4 or fewer GPUs when using the official configs?

Answer: The official configs are usually used under 8 GPUs. The sampling algorithm proposed in the supplementary materials contributes to global consistency in appearance and geometry modeling, and it depends on a large batch size. When using fewer GPUs, the overall batch size is significantly smaller, which can result in the inability to replicate the same results in the official configs. One possible solution is to manually increase the batch size in the configs.

Q5: How do strategy 0, 1 and 2 in appearance modeling come up?

Answer: The strategy weight is a hyperparameter. The parameter, i.g. $\omega(t) =\sigma ^{2}$, used in DreamFusion is an equation that increases as the time step t increases. This may apply to volume rendering, but it is not as suitable for surface rendering. In practice, I found that using the weight of the original version will make the rendered image over-saturated and lack detail and realism. This may be caused by the excessive weight of large t. Therefore, I want to use the weight that gradually decreases with the increase of t, so I get the strategy 0. As for the proposal of strategy 1, I have observed that in some cases, using strategy 0 can produce a more realistic appearance, but often strange colors appear. Therefore, I would like to switch to a more suitable weight. I then realized that the score function is essentially a directional gradient pointing towards the target distribution, and it can transform with the estimated noise, so I came up with Strategy 1, i.g.

 s(z_{t};t) =-\frac{1}{\sigma _{t}}\varepsilon(z_{t};t),

where $s$ is the score function. In practice, I observed that it can effectively alleviate the problem of strange colors when used in conjunctions with the time step range of [0.02, 0.98]. However, in some cases, using strategy 1 can lead to unrealistic results because the weight of small t is too large, resulting in a small step towards the target distribution and being in an out-of-distribution (OOD) state all the time. Hence, Strategy 2 was proposed to combine the advantages of Strategy 0 and Strategy 1.

Q6: How can I make the generated results more diverse?

Answer: Unlike Nerf-based volume rendering, directly using surface rendering to generate 3D assets can achieve diversity. You only need to change the parameters in the configuration file to achieve that. For example, different "sdf_init_shape_scale", "translation_y", "camera_random_jitter", "fovy_range", "negative_text" etc can bring you different results.

Q7: Can Fantasia3D generate photorealistic appearance without oversaturation and over-smooth?

Answer: Yes, it can. Just Original SDS loss combined with negative prompt and strategy 2 proposed in Fantasia3D is enough to address the oversaturated and over-smoothing problem. The appearance of the DMTet-based gallery in Sweetdreamer is generated by the code of appearance modeling in Fantasia3D. You can see that all the results are highly detailed and do have not the problem of oversaturation and over-smooth. I think the key is the disentangled representation and the adoption of the negative prompt and strategy 2. The recommended negative prompt is "shadow, oversaturated, low quality, unrealistic". The recommended positive prompt is "a DSLR photo of ...".

Q8: What is the difference between the official code and the reproduced version of threestudio?

Answer: For geometry modeling, official code has stronger generalization ability, more stable training process, and smoother geometry. For appearance modeling, the official code does not have the problem of oversaturation and oversmoothing and has the SOTA text-to-texture performance, as mentioned in Q7.

What do you want?

Considering that parameter tuning may require some experience, what kind of object do you want me to generate? Please speak freely in the issue area. I will take some time to implement some requirements and update the corresponding configuration files for your convenience in reproducing.

Contribute to Fantasia3D

Firstly, upload the videos converted from gifs using the Website, including the geometry or appearance, to the Gallery. Write down the text to generate the object, the performance, the resolution of the tetrahedron for geometry modeling, and the strategy adopted for appearance modeling.

Subsequently, upload the configuration file under the directory of configs. If you will upload the file about the user-guided generation, the guided mesh should also be uploaded under the directory of data. The naming rule of the file is as follows.

For the file of zero-shot geometry modeling:

{The key word of the text}_geometry_zero_shot_{the number of gpu}_gpu.json

For the file of user-guided geometry modeling:

{The key word of the text}_geometry_user_guided_{the number of gpu}_gpu.json

For the file of appearance modeling:

{The key word of the text}_appearance_strategy{the strategy adopted}_{the number of gpu}_gpu.json.

Install

We provide two choices to install the environment.

After the successful deployment of the environment, clone the repository of Fantasia3D and get started.

git clone https://github.com/Gorilla-Lab-SCUT/Fantasia3D.git
cd Fantasia3D

Start

All the results in the paper were generated using 8 3090 GPUs. We cannot guarantee that fewer than 8 GPUs can achieve the same effect.

# Multi-GPU training
...
# Geometry modeling using 8 GPU 
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/car_geometry.json
# Geometry modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/car_geometry.json
# Appearance modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/car_appearance_strategy0.json
# Appearance modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/car_appearance_strategy0.json
...
# Single GPU training (Only test on the pineapple). 
# Geometry modeling. It takes about 15 minutes on 3090 GPU.
python3  train.py --config configs/pineapple_geometry_single_gpu.json
# Appearance modeling. It takes about 15 minutes on 3090 GPU.
python3  train.py --config configs/pineapple_appearance_strategy0_single_gpu.json
# Multi-GPU training
...
# Geometry modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/Gundam_geometry.json
# Geometry modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/Gundam_geometry.json
# Appearance modeling using 8 GPU
python3 -m torch.distributed.launch --nproc_per_node=8 train.py --config configs/Gundam_appearance.json
# Appearance modeling using 4 GPU
python3 -m torch.distributed.launch --nproc_per_node=4 train.py --config configs/Gundam_appearance.json
...
# Single GPU training
# Geometry modeling (untested)
python3  train.py --config configs/Gundam_geometry.json 
# Appearance modeling (untested)
python3  train.py --config configs/Gundam_appearance.json

Tips

Coordinate System

<img width="30%" src="assets/coordinate_system.jpg"/>

Demos

You can download and watch some demos' training process in Google drive For more demos see here

https://user-images.githubusercontent.com/128572637/e3e8bb82-6be0-42d0-9da3-1e59664354dd

https://user-images.githubusercontent.com/128572637/856c12bf-f100-47fc-a22c-80f123bd0a6d

https://user-images.githubusercontent.com/128572637/5872edbf-f87f-4dfe-9f71-f3941b84b8d7

https://user-images.githubusercontent.com/128572637/17ce275c-26bc-482e-ab61-a61f442de458

https://user-images.githubusercontent.com/128572637/a0d6fe70-b055-44a9-a34d-1449672dca7f

https://user-images.githubusercontent.com/128572637/ed0c303c-7554-4589-a1f5-56c9c1916aef

https://user-images.githubusercontent.com/128572637/c9867d8e-8e61-4a09-afd2-5599d6a85074

https://user-images.githubusercontent.com/128572637/01b1cc2c-5c5f-478a-83d0-1dd0ae2ee9e2

https://user-images.githubusercontent.com/128572637/0a909afb-e18c-4450-8ac8-35d50ced754a

https://user-images.githubusercontent.com/128572637/dcc5c159-fc3e-4eb8-9017-72153196f5b4

https://user-images.githubusercontent.com/128572637/244950828-21956cae-e6c4-42ce-89cd-a912c271de51.mp4

https://user-images.githubusercontent.com/128572637/244950909-0eb363f6-9bf3-4553-9090-fd1fd0003d67.mp4

https://user-images.githubusercontent.com/128572637/af266a61-afd4-451b-b4b8-89e77e96233e

https://user-images.githubusercontent.com/128572637/c0a09f43-c07f-43e9-ab9f-c49aa3bc3e2c

https://user-images.githubusercontent.com/128572637/0071b97a-93ce-4332-9f80-a3297b54f8c3

https://user-images.githubusercontent.com/128572637/27d2bce3-f126-4f91-9bcd-1199563618e8

https://user-images.githubusercontent.com/128572637/4c3e3783-2297-4b52-b67d-3c5cff4db4f4

https://user-images.githubusercontent.com/128572637/5d8f7b7f-141d-4800-8772-8fc132522390

https://user-images.githubusercontent.com/128572637/3e23c5f1-31d8-49a8-9013-123a6e97ac3b

https://user-images.githubusercontent.com/128572637/162adc7d-a416-49e5-8dde-73590119b1a9

https://user-images.githubusercontent.com/128572637/2b20a978-df20-4150-b272-5dac58d64908

Todo

Acknowledgement

BibTex

@InProceedings{chen2023fantasia3d,
  author={Chen, Rui and Chen, Yongwei and Jiao, Ningxin and Jia, Kui},
  title={Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month={October},
  year={2023},
  pages={22246-22256}
}