Home

Awesome

A Survey on Facial Expression Recognition of Static and Dynamic Emotions

This is the official repository for our paper entitled “A Survey on Facial Expression Recognition of Static and Dynamic Emotions”. <br>

Yan Wang <sup>1</sup>, Shaoqi Yan <sup>1</sup>, Yang Liu <sup>1</sup>, Wei Song <sup>3</sup>, Jing Liu <sup>1</sup>, Yang Chang <sup>1</sup>, Xinji Mai <sup>1</sup>, Xiping Hu <sup>4</sup> <br>, Wenqiang Zhang <sup>1, 2</sup> Zhongxue Gan <sup>1</sup>

<sup>1</sup> Academy for Engineering & Technology, Fudan University <sup>2</sup> School of Computer Science, Fudan University <sup>3</sup> Shanghai Ocean University <sup>4</sup> Beijing Institute of Technology <br> <br>

Taxonomy Overview

images/taxonomy_overview.png

Taxonomy of FER of static and dynamic emotions. We present a hierarchical taxonomy that categorizes existing FER models by input type, task challenges, and network structures within a systematic framework, aiming to provide a comprehensive overview of the current FER research landscape. First, we have introduced datasets, metrics, and workflow (including literature and codes) into a public GitHub repository. Then, image-based SFER and video- based DFER overcome different task challenges using various learning strategies and model designs. Following, we analyzed recent advances of FER on benchmark datasets. Finally, we discussed and concluded some important issues and potential trends in FER, highlighting directions for future developments.

Comparisons with SOTA FER-related reviews

images/comparisons.png

Datasets

images/Datasets.jpg

Image-based static facial frames (Above) and video-based dynamic facial sequences (Below) of seven basic emotions in the lab and wild. Samples are from (a) JAFFE, (b) CK+, (c) SFEW, (d) ExpW, (e) RAF-DB, (f) AffectNet, (g) EmotioNet, (h) CK+, (i) Oulu-CASIA, (j) DFEW, (k) FERV39k, and (l) MAFW.

CategoriesDatasetsYearECTEmotionTraining NumbersTesting Numbers
ModalityScene
Image-based SFER DatasetsLabJAFFE1998PSev213213
CK+2010P/ISev241241
MMI2010PSev370370
Oulu-CASIA2011PSev720240
RaFD2010PSev, C1,448160
WildFER-20132013P/ISev28,7093,589
SFEW 2.02011P/ISev958436
EmotioNet2016P/ISev, C80,00020,000
RAF-DB2017P/ISev, Com12,2713,068
AffectNet2017P/ISev, Con.283,9013,500
ExpW2017P/ISev75,04816,745
Lab (3D)BU-3DFE2006PSev2,000500
Bosphorus2008PSev2,3262,326
4DFAB2018P/ISev1,440k360k
Video-based DFER DatasetsLabCK+2010P/ISev241241
MMI2010P/ISev1,4501,450
Oulu-CASIA2011PSix2,160720
WildAFEW 8.02011P/ISev773383
CAER2019P/ISev9,2402,640
DFEW2020P/ISev12,0003,000
FERV39k2022P/ISE35,8873,000
MAFW2022P/ISev, C, A, D, H, Com8,0362,009

Summary of the in-the-lab or in-the-wild datasets with static and dynamic emotions for FER training and evaluation. ECT: Elicitation; P: Posed; I: Instinctive; Sev: Seven Emotions (Happy, Angry, Surprise, Fear, Sad, Disgust, Neutral); C: Contempt; A: Anxiety; D: Disappointment; H: Helplessness; Com: Compound.

Workflow of Generic Facial Expression Recognition

images/Workflow.jpg

The workflow and main components of generic facial expression recognition.

Image-based Static FER

Image-based static facial expression recognition (SFER) involves extracting features from a single image, which captures complex spatial information that related to facial expressions, such as landmarks, and their geometric structures and relationships. In the following, we will first introduce the general architecture of SFER, and then elaborate specific design of SFER methods from the challenge-solving perspectives, including disturbance-invariant SFER, 3D SFER, uncertainty-aware SFER, compound SFER, cross-domain SFER, limited-supervised SFER, and cross-modal SFER.

General SFER

images/gernearl_based_SFER_00.jpg

The architecture of general SFER. Figure is reproduced based on (a) CNN-based model, (b) GCN-based model, and (c) Transformer-based model.

Disturbance-invariant SFER

images/Disturbance_invariant_SFER_00.jpg

The architecture of disturbance-invariant SFER. Figure is reproduced based on (a) Attention-based model (AMP-Net) and (b) Decomposition-based model.

3D SFER

images/3D_SFER_00.jpg

The architecture of 3D SFER. Figure is reproduced based on (a) GAN-based learning (GAN-Int) and (b) Multi-view learning (MV-CNN).

Uncertainty-aware SFER

images/Uncertainy_aware_SFER_00.jpg

The architecture of uncertainty-aware SFER. Figure is reproduced based on (a) the label uncertainty learning (LA-Net) and (b) data uncertainty learning (LNSU-Net).

Compound SFER

Compound emotions refer to complex emotional states formed by the combination of at least two basic emotions, which are not independent, discrete categories but exist within a continuous emotional spectrum composed of multiple dimensions. Compared with discrete "basic" emotions or a few dimensions, compound emotions provide a more accurate representation of the diversity and continuity of human complex emotions.

Cross-domain SFER

images/crossdomain_sfer.jpg

The architecture of cross-domain SFER. Figure is reproduced based on (a) the transfer learning-based model (CSRL) and (b) the adaption learning-based model (AGRA).

Weak-supervised SFER

images/Weak-supervised.jpg

The architecture of weak-supervised SFER. Figure is reproduced based on the Ada-CM.

Cross-modal SFER

images/Cross-modal.png

The architecture of cross-modal SFER. Figure is reproduced based on the CEprompt.

Video-based Dynamic Facial Expression Recognition

The video-based DFER involves analyzing facial expressions that change over time, necessitating a framework that effectively integrates spatial and temporal information. The core objective of DFER is to extract and learn the features of expression changes from video sequences or image sequences. Due to the complexity and diversity of input video or image sequences, DFER faces various task challenges. Based on different solution approaches, these challenges can be categorized into seven basic types: general DFER, sampling-based DFER, expression intensity-aware DFER, multi-modal DFER, static to dynamic FER, self-supervised DFER, and cross-modal DFER.

General DFER

images/generaldfer.png

The architecture of general DFER. Figure is reproduced based on (a) CNN-RNN based model (SAANet) and (b) the transformer-based model (EST).

Sampling-based DFER

images/fig9-Sampling-based_dfer_00.jpg

The architecture of sampling-based DFER. Figure is reproduced based on explainable sampling (Freq-HD).

Expression Intensity-aware DFER

Facial expressions are inherently dynamic, with intensity either gradually shifting from neutral to peak and back or abruptly transitioning from peak to neutral, making the accurate capture of these fluctuations essential for understanding expression dynamics.

Static to Dynamic FER

The static to dynamic FER utilized the high-performance SFER knowledge to explore appearance features and dynamic dependencies.

Multi-modal DFER

images/multi_modal_fusion_dfer.png

The architecture of multi-modal DFER. Figure is reproduced based on the fusion-based model (T-MEP).

Self-supervised DFER

images/Self_supervised_DFER_00.jpg

The architecture of self-supervised DFER. This is reproduced based on the MAE-DFER.

Visual-Language DFER

images/DFER_CLIP.png

The architecture of vision-language DFER. Figure is reproduced based on DFER-CLIP.

RECENT ADVANCES OF FER ON BENCHMARK DATASETS

Performance (WAR) of image-based SFER and video-based DFER methods on four in-the-lab datasets:

MethodYearTypeBackboneMMICK+Oulu-CASIA
IL-VGG2018StaticVGG-1674.6891.6484.58
FMPN2019StaticCNNs82.7498.60-
LDL-ALSG2020StaticResNet-5070.0393.0863.94
IE-DBN2021StaticVGG-16-96.0285.21
im-cGAN2023StaticGAN-98.1093.34
Mul-DML2024StaticResNet-1881.5798.47-
STC-NLSTM2018Dynamic3DCNN84.5399.8093.45
SAANet2020DynamicVGG-16-97.3882.41
MGLN2020DynamicVGG-16-98.7790.40
MSDmodel2021DynamicCNN89.9999.1087.33
DPCNet2022DynamicCNN-99.70-
STACM2023DynamicCNN82.7199.0891.25

Performance (WAR) of image-based SFER methods on three in-the-wild datasets:

Task ChallengesMethodYearBackboneSFEWRAF-DBAffectNet
General SFERIFSL2020VGG1646.5076.90-
OAENet2021VGG16-86.5058.70
MA-Net2021ResNet18-88.4064.53
D³Net2021ResNet1862.1688.79-
TransFER2021ResNet50-90.9166.23
VTFF2023Transformer-88.1461.85
HASs2023ResNet5065.1491.04-
APViT2023Transformer61.9291.9866.91
POSTER2023CNN-IR50-92.0567.31
MGR³Net2024ResNet50-91.0566.36
Disturbance-invariant SFERPG-Unit2018VGG16-83.2755.33
IDFL2021ResNet50-86.9659.20
FDRL2021ResNet1862.1689.47-
AMP-Net2022ResNet50-88.0663.23
PACVT2023ResNet18-88.2160.68
IPD-FER2023ResNet1858.4388.89-
Latent-OFER2023ResNet18-89.60-
RAC+RSL2023ResNet18-89.7762.16
Uncertainty-aware SFERSCN2020ResNet18-87.0360.23
DMUE2021ResNet1857.1288.7662.84
RUL2021ResNet18-88.98-
EASE2022VGG1660.1289.5661.82
EAC2022ResNet18-89.9965.32
LA-Net2023ResNet18-91.5664.54
LNSU-Net2024ResNet18-89.7765.73
Weak-supervised SFERAda-CM2022ResNet1852.4384.4257.42
E2E-WS2022ResNet1854.5688.8960.04
DR-FER2023ResNet50-90.5366.85
WSCFER2023IResNet-91.7267.71
Cross-modal SFERCLEF2023CLIP-90.0965.66
VTA-Net2024ResNet-18-72.17-
CEPrompt2024ViT-B/16-92.4367.29

Performance (Accuracy) of 3D SFER methods on BU-3DE and Bosphorus datasets:

MethodYearBackboneModalityBU-3DEBosphorus
JPE-GAN2018CNN2D/-81.20/--/-
DA-CNN2019ResNet50-/3D-/87.69-/-
GAN-Int2021VGGNet162D+3D/3D88.47/83.20-/-
FFNet-M2021VGGNet162D+3D/3D89.82/87.2887.65/82.86
CMANet2022VGGNet162D+3D/3D90.24/84.0389.36/81.25
DrFER2024ResNet18-/3D-/89.15-/86.77

Performance (WAR) of cross-domain SFER methods on four widely-used datasets:

MethodYearBackboneSource DatasetJAFFECK+FER-2013AffectNet
ECAN2022ResNet50RAF-DB57.2879.7756.46-
AGRA2022ResNet50RAF-DB61.585.2758.95-
PASM2022VGGNet16RAF-DB-79.6554.78-
CWCST2023VGGNet16RAF-DB2.069.0189.6457.4452.66
DMSRL2023VGGNet16RAF-DB2.069.4891.2656.1650.94
CSRL2023ResNet18RAF-DB66.6788.3755.53-

Performance (WAR/UAR) of video-based DFER methods on four widely-used datasets. TI: Time Interpolation; DS: Dynamic Sampling; GWS: Group-weighted Sampling. *: Tunable Param (M):

Task ChallengesMethodYearSample StrategiesBackboneComplexity (GFLOPs)AFEW (WAR/UAR)DFEW (WAR/UAR)FERV39k (WAR/UAR)MAFW (WAR/UAR)
General DFERTFEN2021TIResNet-18--56.60/45.57--
FormerDFER2021DSTransformer9.1G50.92/47.4265.70/53.69-43.27/31.16
EST2023DSResNet-18N/A54.26/49.5765.85/53.94--
LOGO-Former2023DSResNet-1810.27G-66.98/54.2148.13/38.22-
MSCM2023DSResNet-188.11G56.40/52.3070.16/58.49--
SFT2024DSResNet-1817.52G55.00/50.14-47.80/35.1647.44/33.39
CDGT2024DSTransformer8.3G55.68/51.5770.07/59.1650.80/41.34-
LSGTNet2024DSResNet-18--72.34/61.3351.31/41.30-
Sampling-based DFEREC-STFL2020TIResNet-188.32G53.26/-54.72/43.60--
DPCNet2022GWSResNet-509.52G51.67/47.8666.32/57.11--
FreqHD2023FreqHDResNet-18--54.98/44.2443.93/32.24-
M3DFEL2023DSR3D181.66G-69.25/56.1047.67/35.94-
Expression Intensity-aware DFERCEFL-Net2022Clip-basedResNet-18-53.98/-65.35/---
NR-DFERnet2023DSResNet-186.33G53.54/48.3768.19/54.21--
GCA+IAL2023DSResNet-189.63G-69.24/55.7148.54/35.82-
Static to Dynamic FERS2D2023DSViT-B/16--76.03/61.8252.56/41.2857.37/41.86
AEN2023DSTransformer-54.64/50.8869.37/56.6647.88/38.18-
Multi-modal DFERT-ESFL2022DSTransformer----48.18/33.28
T-MEP2023DS-6G52.96/50.2268.85/57.16-52.85/39.37
OUS2024DSCLIP-52.96/50.2268.85/57.16-52.85/39.37
MMA-DFER2024DSTransformer--77.51/67.01-58.52/44.11
Self-supervised DFERMAE-DFER2023DSResNet-1850G-74.43/63.4152.07/43.1254.31/41.62
HiCMAE2024DSResNet-1832G-73.10/61.92-54.84/42.10
Visual-Language DFERCLIPER2023DSCLIP-ViT-B/1688M*56.43/52.0070.84/57.5651.34/41.23-
DFER-CLIP2023DSCLIP-ViT-B/3292G-71.25/59.6151.65/41.2752.55/39.89
EmoCLIP2024DSCLIP-ViT-B/32--62.12/58.0436.18/31.4141.46/34.24
A³lign-DFER2024DSCLIP-ViT-L/14--74.20/64.0951.77/41.8753.22/42.07
UMBEnet2024DSCLIP--73.93/64.5552.10/44.0157.25/46.92
FineCLIPER2024DSCLIP-ViT-B/1620M*-76.21/65.9853.98/45.2256.91/45.01

Citation

If you find our work useful, please cite our paper:

@article{wang_surveyfer_2024,
       title = {A Survey on Facial Expression Recognition of Static and Dynamic Emotions},
       author = {Wang, Yan and Yan, Shaoqi and Liu, Yang and Song, Wei and Liu, Jing and Chang, Yang and Mai, Xinji and Hu, Xiping and Zhang, Wenqiang and Gan, Zhongxue},
       journal = {arXiv preprint arXiv:2408.15777},
       year = {2024}
}