Home

Awesome

FaceFormer

PyTorch implementation for the paper:

FaceFormer: Speech-Driven 3D Facial Animation with Transformers, CVPR 2022.

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

[Paper] [Project Page]

<p align="center"> <img src="framework.jpg" width="70%" /> </p>

Given the raw audio input and a neutral 3D face mesh, our proposed end-to-end Transformer-based architecture, FaceFormer, can autoregressively synthesize a sequence of realistic 3D facial motions with accurate lip movements.

Environment

Dependencies

Data

VOCASET

Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder VOCASET. Download "FLAME_sample.ply" from voca and put it in VOCASET/templates.

BIWI

Request the BIWI dataset from Biwi 3D Audiovisual Corpus of Affective Communication. The dataset contains the following subfolders:

Place the folders 'faces' and 'rigid_scans' in BIWI and place the wav files in BIWI/wav.

Demo

Download the pretrained models from biwi.pth and vocaset.pth. Put the pretrained models under BIWI and VOCASET folders, respectively. Given the audio signal,

Training and Testing on VOCASET

Data Preparation

Training and Testing

Visualization

Training and Testing on BIWI

Data Preparation

Training and Testing

Visualization

Using Your Own Dataset

Data Preparation

Training and Testing

Visualization

Citation

If you find this code useful for your work, please consider citing:

@inproceedings{faceformer2022,
title={FaceFormer: Speech-Driven 3D Facial Animation with Transformers},
author={Fan, Yingruo and Lin, Zhaojiang and Saito, Jun and Wang, Wenping and Komura, Taku},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}

Acknowledgement

We gratefully acknowledge ETHZ-CVL for providing the B3D(AC)2 database and MPI-IS for releasing the VOCASET dataset. The implementation of wav2vec2 is built upon huggingface-transformers, and the temporal bias is modified from ALiBi. We use MPI-IS/mesh for mesh processing and VOCA/rendering for rendering. We thank the authors for their excellent works. Any third-party packages are owned by their respective authors and must be used under their respective licenses.