Home

Awesome

ProbTalk3D

Official PyTorch implementation for the paper:

ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE. (Accepted at ACM SIGGRAPH MIG 2024)

<a href='https://uuembodiedsocialai.github.io/ProbTalk3D/'><img src='https://img.shields.io/badge/Project-Website-blue'></a> <a href='https://arxiv.org/pdf/2409.07966'><img src='https://img.shields.io/badge/arXiv-Paper-red'></a> <a href='https://uuembodiedsocialai.github.io/ProbTalk3D/#video-container'><img src='https://img.shields.io/badge/Project-Video-Green'></a>

We propose ProbTalk3D, a VQ-VAE based probabilistic model for emotion controllable speech-driven 3D facial animation synthesis. ProbTalk3D first learns a motion prior using VQ-VAE codebook matching, then trains a speech and emotion conditioned network leveraging this prior. During inference, probabilistic sampling of latent codebook embeddings enables non-deterministic outputs.

Environment

<details><summary>Click to expand</summary>

System Requirement

Virtual Environment

To run our program, first create a virtual environment. We recommend using miniconda or miniforge. Once Miniconda or Miniforge is installed, open Command Prompt (make sure to run it as Administrator on Windows) and run the following commands:

conda create --name probtalk3d python=3.9
conda activate probtalk3d
pip install torch==2.1.1+cu121 torchvision==0.16.1+cu121 torchaudio==2.1.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html

Then, navigate to the project root folder and execute:

pip install -r requirements.txt
</details>

Dataset

<details><summary>Click to expand</summary>

Download 3DMEAD dataset following the instruction of EMOTE. This dataset represents facial animations using FLAME parameters.

Data Download and Preprocess

</details>

Model Training

<details><summary>Click to expand</summary> To train the model from scratch, follow the 2-stage training approach outlined below.

Stage 1

For the first stage of training, use the following commands:

Stage 2

After completing stage 1 training, execute the following command to proceed with stage 2 training. Set model.folder and model.version to the location where the motion prior checkpoint is stored:

</details>

Evaluation

<details><summary>Click to expand</summary>

Download the trained model weights from HERE and unzip them into the project root folder.

Quantitative Evaluation

We provide code to compute the evaluation metrics mentioned in our paper. To evaluate our trained model, run the following:

Qualitative Evaluation

For qualitative evaluation, refer to the script evaluation_quality.py.

</details>

Animation Generation

<details><summary>Click to expand</summary>

Download the trained model weights from HERE and unzip them into the project root folder.

Generate Prediction

Our model is trained to generate animations across 32 speaking styles (IDs), 8 emotions, and 3 intensities. Check all available conditions:

<details><summary>Click to expand</summary> ID:
M003, M005, M007, M009, M011, M012, M013, M019,
M022, M023, M024, M025, M026, M027, M028, M029,
M030, M031, W009, W011, W014, W015, W016, W018,
W019, W021, W023, W024, W025, W026, W028, W029

emotion:

neutral, happy, sad, surprised, fear, disgusted, angry, contempt

intensity (stands for low, medium, high intensity in order):

0, 1, 2
</details>

We provide several test audios. Run the following command to generate animations (with a random style) using the trained ProbTalk3D. This will produce .npy files that can be rendered into videos.

Render

The generated .npy files contain FLAME parameters and can be rendered into videos following the below instructions.

</details>

Comparison

<details><summary>Click to expand</summary>

For comparing with the diffusion model FaceDiffuser (modified version), navigate to the diffusion folder.

Model training

To train the model from scratch, execute the following command:

python main.py

Evaluation

To quantitatively evaluate our trained FaceDiffuser model, run the following command:

python evaluation_facediff.py --save_path "../model_weights/FaceDiffuser" --max_epoch 50

Animation Generation

Generate Prediction

To generate animations using our trained model, execute the following command. Modify the path and style settings as needed.

python predict.py --save_path "../model_weights/FaceDiffuser" --epoch 50 --subject "M009" --id "M009" --emotion 6 --intensity 1 --wav_path "../results/generation/test_audio/angry.wav"

Render

Navigate back to the project root folder and run the following command:

python render_vert.py result_folder=diffusion/results/generation audio_folder=results/generation/test_audio
</details> </details>

Citation

If you find the code useful for your work, please consider starring this repository and citing it:

@inproceedings{Probtalk3D_Wu_MIG24,
        author = {Wu, Sichun and Haque,  Kazi Injamamul and Yumak,  Zerrin},
        title = {ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE},
        booktitle = {The 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG '24), November 21--23, 2024, Arlington, VA, USA},
        year = {2024},
        location = {Arlington, VA, USA},
        numpages = {12},
        url = {https://doi.org/10.1145/3677388.3696320},
        doi = {10.1145/3677388.3696320},
        publisher = {ACM},
        address = {New York, NY, USA}
        } 

Acknowledgements

We borrow and adapt the code from Learning to Listen, CodeTalker, TEMOS, FaceXHuBERT, FaceDiffuser. We appreciate the authors for making their code available and facilitating future research. Additionally, we are grateful to the creators of the 3DMEAD datasets used in this project.

Any third-party packages are owned by their respective authors and must be used under their respective licenses.

License

This repository is released under CC-BY-NC-4.0-International License.