Home

Awesome

CodeTalker

Official PyTorch implementation for the paper:

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior, CVPR 2023.

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, Tien-Tsin Wong

<a href='https://arxiv.org/abs/2301.02379'><img src='https://img.shields.io/badge/arXiv-2301.02379-red'></a> <a href='https://doubiiu.github.io/projects/codetalker/'><img src='https://img.shields.io/badge/Project-Video-Green'></a> <a href='https://colab.research.google.com/github/Doubiiu/CodeTalker/blob/main/demo.ipynb'><img src='https://img.shields.io/badge/Demo-Open in Colab-blue'></a>

<p align="center"> <img src="figure.png" width="75%"/> </p>

We propose CodeTalker by casting speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. Given the raw audio and a 3D neutral face template, our CodeTalker can produce vivid and realistic 3D facial motions with subtle expressions and accurate lip movements.

Changelog

<!-- ## **TODO** - [ ] Provide online demo. -->

Environment

Other necessary packages:

pip install -r requirements.txt

IMPORTANT: Please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Dataset Preparation

VOCASET

Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder vocaset/. Download "FLAME_sample.ply" from voca and put it in vocaset/. Read the vertices/audio data and convert them to .npy/.wav files stored in vocaset/vertices_npy and vocaset/wav:

cd vocaset
python process_voca_data.py

BIWI

Follow the BIWI/README.md to preprocess BIWI dataset and put .npy/.wav files into BIWI/vertices_npy and BIWI/wav, and the templates.pkl into BIWI/.

Demo

Download the pretrained models from biwi_stage1.pth.tar & biwi_stage2.pth.tar and vocaset_stage1.pth.tar & vocaset_stage2.pth.tar. Put the pretrained models under BIWI and VOCASET folders, respectively. Given the audio signal,

Training / Testing

The training/testing operation shares a similar command:

sh scripts/<train.sh|test.sh> <exp_name> config/<vocaset|BIWI>/<stage1|stage2>.yaml <vocaset|BIWI> <s1|s2>

Please replace <exp_name> with your own experiment name, <vocaset|BIWI> by the name of your target dataset, i.e., vocaset or BIWI. Change the exp_dir in both scripts/train.sh and scripts/test.sh if needed. We just take an example for default commands below.

Training for Discrete Motion Prior

sh scripts/train.sh CodeTalker_s1 config/vocaset/stage1.yaml vocaset s1

Training for Speech-Driven Motion Synthesis

Make sure the paths of pre-trained models are correct, i.e., vqvae_pretrained_path and wav2vec2model_path in config/<vocaset|BIWI>/stage2.yaml.

sh scripts/train.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2

Testing

sh scripts/test.sh CodeTalker_s2 config/vocaset/stage2.yaml vocaset s2

Visualization with Audio

Modify the paths in scripts/render.sh and run:

sh scripts/render.sh

Evaluation on BIWI

We provide the reference code for Lip Vertex Error & Upper-face Dynamics Deviation. Remember to change the paths in scripts/cal_metric.sh, and run:

sh scripts/cal_metric.sh

Play with Your Own Data

Data Preparation

Training, Testing & Visualization

Citation

If you find the code useful for your work, please star this repo and consider citing:

@inproceedings{xing2023codetalker,
  title={Codetalker: Speech-driven 3d facial animation with discrete motion prior},
  author={Xing, Jinbo and Xia, Menghan and Zhang, Yuechen and Cun, Xiaodong and Wang, Jue and Wong, Tien-Tsin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={12780--12790},
  year={2023}
}

Notes

  1. Although our codebase allows for training with multi-GPUs, we did not test it and just hardcode the training batch size as one. You may need to change the data_loader if needed.

Acknowledgement

We heavily borrow the code from FaceFormer, Learn2Listen, and VOCA. Thanks for sharing their code and huggingface-transformers for their wav2vec2 implementation. We also gratefully acknowledge the ETHZ-CVL for providing the B3D(AC)2 dataset and MPI-IS for releasing the VOCASET dataset. Any third-party packages are owned by their respective authors and must be used under their respective licenses.

Related Work