Home

Awesome

Awesome Audio-driven Talking Face Generation

2D Encoder-Decoder Based

Landmark Based

3D Model Based

Survey

What comprises a good talking-head video generation?: A Survey and Benchmark [Lele Chen 2020] paper

Deep Audio-Visual Learning: A Survey [Hao Zhu 2020] paper

Handbook of Digital Face Manipulation and Detection [Yuxin Wang 2022] paper

Deep Learning for Visual Speech Analysis: A Survey paper

Datasets

Metrics

MetricsPaper
PSNR (peak signal-to-noise ratio)-
SSIM (structural similarity index measure)Image quality assessment: from error visibility to structural similarity.
CPBD(cumulative probability of blur detection)A no-reference image blur metric based on the cumulative probability of blur detection
LPIPS (Learned Perceptual Image Patch Similarity) -The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
NIQE (Natural Image Quality Evaluator)Making a ‘Completely Blind’ Image Quality Analyzer
FID (Fréchet inception distance)GANs trained by a two time-scale update rule converge to a local nash equilibrium
LMD (landmark distance error)Lip Movements Generation at a Glance
LRA (lip-reading accuracy)Talking Face Generation by Conditional Recurrent Adversarial Network
WER(word error rate)Lipnet: end-to-end sentencelevel lipreading.
LSE-D (Lip Sync Error - Distance)Out of time: automated lip sync in the wild
LSE-C (Lip Sync Error - Confidence)Out of time: automated lip sync in the wild
ACD(Average content distance)Facenet: a unified embedding for face recognition and clustering.
CSIM(cosine similarity)Arcface: additive angular margin loss for deep face recognition.
EAR(eye aspect ratio)Real-time eye blink detection using facial landmarks. In: Computer Vision Winter Workshop
ESD(emotion similarity distance)What comprises a good talking-head video generation?: A Survey and Benchmark