Awesome
Official NeRFFaceSpeech Code
NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior, CVPR 2024 Workshop on AI for Content Creation (AI4CC)
Setting
We have confirmed that the code runs under the following conditions.
Python 3.7.16 // CUDA 11.7 //GPU 3090
git clone https://github.com/rlgnswk/NeRFFaceSpeech_Code.git
cd NeRFFaceSpeech_Code/
conda env create -f environment.yml
conda activate nerffacespeech
Please install Nvdiffrast inside the Deep3DFaceRecon_pytorch folder.
cd Deep3DFaceRecon_pytorch
git clone https://github.com/NVlabs/nvdiffrast
cd nvdiffrast
pip install .
Download
mkdir pretrained_networks
Download SadTalker_V0.0.2_256.safetensors https://github.com/OpenTalker/SadTalker/releases to NeRFFaceSpeech_Code\pretrained_networks\sad_talker_pretrained
Download https://huggingface.co/wsj1995/sadTalker/blob/af80749f8c9af3702fbd0272df14ff086986a1de/BFM09_model_info.mat to NeRFFaceSpeech_Code\pretrained_networks\BFM_for_3DMM-Fitting-Pytorch\BFM
@Thanks nitinmukesh's reports
Place Pretrained Weights at pretrained_networks/
Command (Generated from Latent Space)
python StyleNeRF/main_NeRFFaceSpeech_audio_driven_from_z.py \
--outdir=out_test_z --trunc=0.7 \
--network=pretrained_networks/ffhq_1024.pkl \
--test_data="test_data/test_audio/AdamSchiff_0.wav" \
--seeds=6;
Command (Generated from Real Image)
The inversion process for real image takes some time.
python StyleNeRF/main_NeRFFaceSpeech_audio_driven_from_image.py \
--outdir=out_test_real --trunc=0.7 \
--network=pretrained_networks/ffhq_1024.pkl \
--test_data="test_data/test_audio/AdamSchiff_0.wav" \
--test_img="test_data/test_img/32.png";
Command (Pose Varying)
The first command is for head pose varying only.
The second command is for head pose and exp varing by video-frames (at that time, audio input is only for the initial frame.)
The video frames should be pose-predictable.
python StyleNeRF/main_NeRFFaceSpeech_audio_driven_w_given_poses.py \
--outdir=out_test_given_pose --trunc=0.7 \
--network=pretrained_networks/ffhq_1024.pkl \
--test_data="test_data/test_audio/AdamSchiff_0.wav" \
--test_img="test_data/test_img/AustinScott0_0_cropped.jpg"\
--motion_guide_img_folder="driving_frames";
python StyleNeRF/main_NeRFFaceSpeech_video_driven.py \
--outdir=out_test_video_driven --trunc=0.7 \
--network=pretrained_networks/ffhq_1024.pkl \
--test_data="test_data/test_audio/AdamSchiff_0.wav" \
--test_img="test_data/test_img/DougJones_0_cropped.jpg"\
--motion_guide_img_folder="driving_frames";
Custom Data for Use
If you want to use new audio and image data, you must follow the formats of StyleNeRF for image data and Wav2Lip or SadTalker for audio data.
Post-processing @ nitinmukesh
There is an applicable post-processing method called GFPGAN. It is being applied to other methods as well and can help produce better results. Please refer to the issue!
Caution: Error Accumulation
The proposed method may not work well due to accumulated errors such as landmark prediction errors and inversion(reconsturction) errors.
Ethical Use
This project is intended for research and educational purposes only. Misuse of technology for deceptive practices is strictly discouraged
Acknowledgement
We appreciate StyleNeRF, PTI, Wav2Lip, SadTalker, Deep3Drecon and 3DMM-Fitting for sharing their codes and baselines.
Citation
@misc{kim2024nerffacespeech,
title={NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior},
author={Gihoon Kim and Kwanggyoon Seo and Sihun Cha and Junyong Noh},
year={2024},
eprint={2405.05749},
archivePrefix={arXiv},
primaryClass={cs.CV}}
@misc{kim2024nerffacespeech,
title={NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior},
author={Gihoon Kim, Kwanggyoon Seo, Sihun Cha and Junyong Noh},
booktitle={IEEE Computer Vision and Pattern Recognition Workshops},
year={2024}}