Awesome

Facial Affective Behavior Analysis with Instruction Tuning

[Project Page] [Data] [Paper] [Checkpoints-AU] [Checkpoints-Emotion]

Facial Affective Behavior Analysis with Visual Instruction Tuning (ECCV 2024) [Paper] Yifan Li, Anh Dao, Wentao Bao, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

Abstract

Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images. However, traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors. The advent of Multi-modal Large Language Models (MLLMs) has been proven successful in general visual understanding tasks. However, directly harnessing MLLMs for FABA is challenging due to the scarcity of datasets and benchmarks, neglecting facial prior knowledge, and low training efficiency. To address these challenges, we introduce (i) an instruction-following dataset for two FABA tasks, i.e., facial emotion and action unit recognition, (ii) a benchmark FABA-Bench with a new metric considering both recognition and generation ability, and (iii) a new MLLM EmoLA as a strong baseline to the community. Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i.e., finegrained facial movement, interpretability, and reasoning. Moreover, to build an effective and efficient FABA MLLM, we introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM. We conduct extensive experiments on FABA-Bench and four commonly-used FABA datasets. The results demonstrate that the proposed facial prior expert can boost the performance and EmoLA achieves the best results on our FABA-Bench. On commonly-used FABA datasets, EmoLA is competitive rivaling taskspecific state-of-the-art models.

Installation
Download
Dataset
Training
Evaluation

Installation

You can install all the libraries by executing the bash file

bash install.sh

The installed libraries include main packages

conda create -n emola python=3.10 -y
conda activate emola
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

and extra libraries

pip install flash-attn --no-build-isolation
pip install wandb==0.15.12
pip install deepspeed==0.12.6
pip install rouge
pip install ipdb

Downloading

After that, the pretrained LLaVA-1.5-7b (link) is required to be downloaded under the ./checkpoints/ directory.

The FABAInstruct (link) dataset is required to be downloaded and unzip under the ./data/ directory. Note that the landmark features have already been extracted by the pretrained Insightface model and put into the FABAInstruct dataset. The preprocessing codes are in .feature_extraction/landmark. The overall directory tree of FABA-Instruct dataset is like:

data/
└── FABAInstruct/
    ├── train/
    │   ├── train_au_anno.json
    │   ├── train_emotion_anno.json
    │   ├── images/
    │   └── landmarks/
    └── eval/
        ├── eval_au_anno.json
        ├── eval_emotion_anno.json
        ├── images/
        └── landmarks/

Dataset

The statistics of the FABA-Instruct dataset is given by:

Statistics	Value
Total images	19877
Emotion training samples	19474
Emotion testing samples	403
Emotion description average length	50.47
AU training samples	15838
AU testing samples	325
AU description average length	207.35

The ennotation include several keys, which are given as:

{
    "source": "the dataset",
    "image": "the relative path of each image",
    "id": "the name of each image",
    "conversation": "the question and answer pair",
    "landmark": "the relative path of the landmark feature"
},

The annotation demo of FABAInstruct-AU is:

  {
    "source": "AffectNet",
    "image": "FABAInstruct/train/images/AffectNet/AffectNet_v2/1194/4dfc35e0be3712362e8795bf27fe332cbd809e08a19340d935d4ec51.jpg",
    "id": "AffectNet/AffectNet_v2/1194/4dfc35e0be3712362e8795bf27fe332cbd809e08a19340d935d4ec51",
    "conversations": [
      {
        "from": "human",
        "value": "In this face, what are the discernible action units?\n<image>"
      },
      {
        "from": "gpt",
        "value": "This image depicts a person displaying a range of facial muscle movements that align with specific Facial Action Coding System (FACS) action units (AUs). Here are the AUs that appear to be in play based on the facial expression:\n\n1. AU4 - Brow Lowerer: The eyebrows are drawn together and downward, which usually indicates a frown or a scowl.\n2. AU9 - Nose Wrinkler: The wrinkles on the nose suggest that the nose wrinkler muscle is engaged.\n3. AU15 - Lip Corner Depressor: The corners of the lips are pulled down, which could suggest sadness or disgust.\n4. AU17 - Chin Raiser: The chin appears to be pushed upward, creating a wrinkled or pouted appearance on the lower face.\n\nPlease note that interpreting AUs and the corresponding emotions can be subjective, and the context of the facial expression and other nonverbal cues are important for accurate interpretation."
      }
    ],
    "landmark": "FABAInstruct/train/landmarks/AffectNet/AffectNet_v2/1194/4dfc35e0be3712362e8795bf27fe332cbd809e08a19340d935d4ec51.npy"
  },

The demo annotation for FABA-Instruct-Emotion is:

  {
    "source": "AffectNet",
    "image": "FABAInstruct/train/images/AffectNet/AffectNet_v2/916/93ebd6d360782d3d42c0a7b381c120764baaa9f23c8c9c42062f1599.jpg",
    "id": "AffectNet/AffectNet_v2/916/93ebd6d360782d3d42c0a7b381c120764baaa9f23c8c9c42062f1599",
    "conversations": [
      {
        "from": "human",
        "value": "What is the emotion of this face?\n<image>"
      },
      {
        "from": "gpt",
        "value": "The person in the image appears to have a gentle expression. There is a softness in their eyes, and the mouth is neither turned downward nor upward significantly, which could be interpreted as a neutral or emotional calmness. The eyes also seem a bit reflective or thoughtful. Without full context, it is difficult to pinpoint the exact emotion, but there does not appear to be an overt display of strong emotions such as joy, sadness, anger, or surprise."
      }
    ],
    "landmark": "FABAInstruct/train/landmarks/AffectNet/AffectNet_v2/916/93ebd6d360782d3d42c0a7b381c120764baaa9f23c8c9c42062f1599.npy"
  },

The illustration of FABA-Instruct-AU:

<a href="https://johnx69.github.io/FABA/"><img src="assets/faba_au.png" width="90%"></a> The illustration of FABA-Instruct-Emotion: <a href="https://johnx69.github.io/FABA/"><img src="assets/faba_emotion.png" width="90%"></a>

Training

Training EmoLA

# AU task
bash scripts/train_emola/train_emola_au.sh 
# Emotion task
bash scripts/train_emola/train_emola_emotion.sh

Evaluation

Testing EmoLA

# AU task
bash scripts/infer_emola/infer_emola_au.sh
# Emotion task
bash scripts/infer_emola/infer_emola_emotion.sh

The results for FABAInstruct-AU should be close to:

$S_{re}$	$S_{ge}$	$S_{rege}$
56.3	35.2	91.5

The results for FABAInstruct-emotion should be close to:

$S_{re}$	$S_{ge}$	$S_{rege}$
64.5	31.7	96.2

You can refer to the ./output_test/ directory for the generation results.

Citation

If you find our FABA-Instruct or EmoLA useful for your research, please cite using the following BibTeX:

@article{li2024facial,
  title={Facial Affective Behavior Analysis with Instruction Tuning},
  author={Li, Yifan and Dao, Anh and Bao, Wentao and Tan, Zhen and Chen, Tianlong and Liu, Huan and Kong, Yu},
  journal={arXiv preprint arXiv:2404.05052},
  year={2024}
}

Acknowledgement

The images in FABA-Instruct are from AffectNet, and our codebase is based on LLaVA. Thanks for these great projects!

NSF Acknowledgement

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. 1949694 and 2040209. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.