Awesome

Awesome Audio-driven Talking Face Generation

2D Encoder-Decoder Based

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN [F Yin 2022] [arXiv] demo project page
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [Hang Zhou 2021] [CVPR] demo project page
Talking Head Generation with Audio and Speech Related Facial Action Units [S Chen 2021] [BMVC]
Speech Driven Talking Face Generation from a Single Image and an Emotion Condition [SE Eskimez 2021] [arXiv] project page
HeadGAN: Video-and-Audio-Driven Talking Head Synthesis [MC Doukas 2021] [arXiv] demo project page
Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning [Hao Zhu 2020] [IJCAI]
A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild [K R Prajwal 2020] [ACMMM] demo project page
Talking Face Generation with Expression-Tailored Generative Adversarial Network [D Zeng 2020] [ACMMM]
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [KR Prajwal 2020] [CVPR] demo project page
Robust One Shot Audio to Video Generation [N Kumar 2020] [CVPRW] demo project page
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation [Hang Zhou 2019] [AAAI] demo project page
Talking face generation by conditional recurrent adversarial network [Yang Song 2019] [IJCAI] demo project page
Realistic Speech-Driven Facial Animation with GANs [Konstantinos Vougioukas 2019] [IJCV] demo project page
Animating Face using Disentangled Audio Representations [G Mittal 2019] [WACV]
Lip Movements Generation at a Glance [Lele Chen 2018] [ECCV] demo project page
X2Face: A network for controlling face generation using images, audio, and pose codes [Olivia Wiles 2018] [ECCV] demo project page
Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network [HX Pham 2018] [arXiv] demo
You said that？ [Chung 2017] [BMVC] demo project page

Landmark Based

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation [YUANXUN LU 2021] [SIGGRAPH] demo project page
Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis [H Wu 2021] [ACMMM] demo project page
MakeItTalk: Speaker-Aware Talking-Head Animation [YANG ZHOU 2020] [SIGGRAPH] demo project page
Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture [Dipanjan Das, Sandika Biswas 2020] [ECCV]
A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors [R Zheng 2020] [ICPR]
Hierarchical Cross-Modal Talking Face Generation with Dynamic Pixel-Wise Loss [Lele Chen 2019] [CVPR] demo project page
Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks [SA Jalalifar 2018] [arXiv]
Synthesizing Obama: learning lip sync from audio [SUPASORN SUWAJANAKORN 2017] [SIGGRAPH] demo

3D Model Based

Everybody’s Talkin’: Let Me Talk as You Want [Linsen Song 2022] [TIFS] demo
One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning [Suzhen Wang 2022] [AAAI] demo projectpage
FaceFormer: Speech-Driven 3D Facial Animation with Transformers [Y Fan 2022] [CVPR] demo projectpage
Iterative Text-based Editing of Talking-heads Using Neural Retargeting [Xinwei Yao 2021] [ICML] demo
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [Yudong Guo 2021] [ICCV] demo projectpage
Audio-driven emotional video portraits [X Ji 2021] [CVPR] demo projectpage
FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning [C Zhang 2021] [ICCV] demo projectpage
Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset [Z Zhang 2021] [CVPR] demo projectpage
Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion [Suzhen Wang 2021] [IJCAI] demo projectpage
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [A Richard 2021] [ICCV] demo projectpage
3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head [Q Wang 2021] [arXiv]
Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation [L Li 2021] [AAAI] demo projectpage
Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary [S Zhang 2021 ] [ICASSP] demo projectpage
Neural Voice Puppetry: Audio-driven Facial Reenactment [Justus Thies 2020] [ECCV] demo projectpage
Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose [Ran Yi 2020] [arXiv] projectpage
Talking-head Generation with Rhythmic Head Motion [Lele Chen 2020] [ECCV] demo projectpage
Modality Dropout for Improved Performance-driven Talking Faces [‎Hussen Abdelaziz 2020] [ICMI]
Audio- and Gaze-driven Facial Animation of Codec Avatars [A Richard 2020] [arXiv] demo projectpage
Text-based editing of talking-head video [OHAD FRIED 2019] [arXiv] demo
Capture, Learning, and Synthesis of 3D Speaking Styles [D Cudeiro 2019] [CVPR] demo projectpage
Visemenet: audio-driven animator-centric speech animation [YANG ZHOU 2018] [TOG] demo
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks [N Sadoughi 2018] [TAC]
Speech-driven 3D Facial Animation with Implicit Emotional Awareness: A Deep Learning Approach [Hai X. Pham 2017] [IEEE Trans. Syst. Man Cybern.: Syst.]
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion [TERO KARRAS 2017] [TOG] demo projectpage
A deep learning approach for generalized speech animation [SARAH TAYLOR 2017] [SIGGRAPH] demo
End-to-end Learning for 3D Facial Animation from Speech [HX Pham 2017] [ICMI]
JALI: An Animator-Centric Viseme Model for Expressive Lip Synchronization [Pif Edwards 2016] [SIGGRAPH] demo

Survey

What comprises a good talking-head video generation?: A Survey and Benchmark [Lele Chen 2020] paper

Deep Audio-Visual Learning: A Survey [Hao Zhu 2020] paper

Handbook of Digital Face Manipulation and Detection [Yuxin Wang 2022] paper

Deep Learning for Visual Speech Analysis: A Survey paper

Datasets

GRID 2006 project page
TCD-TIMIT 2015 project page
LRW 2016 project page
MODALITY 2017 project page
ObamaSet 2017
Voxceleb1 2017 project page
Voxceleb2 2018 project page
LRS2-BBC 2018 project page
LRS3-TED 2018 project page
HDTF 2020 project page
CREMA-D 2014 project page
MSP-IMPROV 2016 project page
RAVDESS 2018 project page
MELD 2018 project page
MEAD 2020 project page
CAVSR1.0 1998
HIT Bi-CAV 2005
LRW-1000 2018 project page

Metrics

Metrics	Paper
PSNR (peak signal-to-noise ratio)	-
SSIM (structural similarity index measure)	Image quality assessment: from error visibility to structural similarity.
CPBD(cumulative probability of blur detection)	A no-reference image blur metric based on the cumulative probability of blur detection
LPIPS (Learned Perceptual Image Patch Similarity) -	The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
NIQE (Natural Image Quality Evaluator)	Making a ‘Completely Blind’ Image Quality Analyzer
FID (Fréchet inception distance)	GANs trained by a two time-scale update rule converge to a local nash equilibrium
LMD (landmark distance error)	Lip Movements Generation at a Glance
LRA (lip-reading accuracy)	Talking Face Generation by Conditional Recurrent Adversarial Network
WER(word error rate)	Lipnet: end-to-end sentencelevel lipreading.
LSE-D (Lip Sync Error - Distance)	Out of time: automated lip sync in the wild
LSE-C (Lip Sync Error - Confidence)	Out of time: automated lip sync in the wild
ACD(Average content distance)	Facenet: a unified embedding for face recognition and clustering.
CSIM(cosine similarity)	Arcface: additive angular margin loss for deep face recognition.
EAR(eye aspect ratio)	Real-time eye blink detection using facial landmarks. In: Computer Vision Winter Workshop
ESD(emotion similarity distance)	What comprises a good talking-head video generation?: A Survey and Benchmark