Awesome

SERAB: Speech Emotion Recognition Adaptation Benchmark

This repo contains a "simplified" implementation of SERAB, which includes:

BYOL-A training and utility functions (Original repo: https://github.com/nttcslab/byol-a)
BYOL-A and transformer-inspired models
- Kudos to Phil Wang for his implementation of CvT (https://github.com/lucidrains/vit-pytorch)
Benchmark tests for SERAB
TFDS scripts to load SERAB data

Update: BYOL-S was one of the strongest submissions of the HEAR NeurIPS 2021 Challenge! Leaderboard results: https://neuralaudio.ai/hear2021-results.html

Demo

A quick demo detailing the SERAB evaluation procedure on a Colab notebook is available

Environment setup

Libraries to reproduce the environment are detailed in serab.yml.

To reproduce the environment, run:

conda env create -f serab.yml

To install the external source files from patches, copy the following after cloning the repo:

cd SERAB/
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/config.yaml
patch --ignore-whitespace < config.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/train.py
patch < train.diff
cd byol_a/
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/augmentations.py
patch < augmentations.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/common.py
patch < common.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/dataset.py
patch < dataset.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/models.py
mv models.py models/audio_ntt.py

Evaluate a (pre-trained model) using SERAB

In this simplified version, only PyTorch models can be used.

Before running the evaluation, make sure that the config file config.yaml is correctly setup for your model.

To run a pre-existing model, run:

python clf_benchmark.py --model_name {MODEL_NAME} --dataset_name {DATASET_NAME}

By default, grid-search-based classifier hyperparameter optimization is performed. To run a pre-existing model with the "default" classifiers, add the model_selection --none key:

python clf_benchmark.py --model_name {MODEL_NAME} --dataset_name {DATASET_NAME} --model_selection none

To run a model on all the SERAB datasets, <a href="https://dvc.org/">DVC</a> can be used.

Make the appropriate modifications in dvc.yaml and run:

dvc repro

Train a model "à la BYOL-A"

Models can be pre-trained on a subsample of AudioSet that only contains speech.

You might need to do changes in train.py and config.yaml before starting training.

To train a model, run:

python train.py {MODEL_NAME}  # or dvc repro

As training time is usually long (10-20h depending on the model), we recommend using tmux to attach & detach terminals from a given session.

SERAB datasets

While CREMA-D and SAVEE are already integrated into TFDS, the other datasets were added as <a href="https://www.tensorflow.org/datasets/add_dataset">tensorflow datasets</a>.

The code to load these datasets can be found in tensorflow_datasets.

Here are the steps to download and load the SERAB datasets:

In the tensorflow_datasets folder, create the folders download/manual
Download the compressed datasets (.zip files) under tensorflow_datasets/download/manual/

Link to the SERAB Datasets:

AESDD: http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/
CaFE: https://zenodo.org/record/1478765
EmoDB: http://emodb.bilderbar.info/download/
EMOVO: http://voice.fub.it/activities/corpora/emovo/index.html
IEM4 (restricted access): https://sail.usc.edu/iemocap/
RAVDESS: https://smartlaboratory.org/ravdess/
SAVEE (restricted access): http://kahlan.eps.surrey.ac.uk/savee/Download.html
ShEMO: https://github.com/mansourehk/ShEMO
SUBESCO: https://zenodo.org/record/4526477#.YcyUeGjMJPY

Ensure all samples in a given datasets are all mono or stereo! You can use stereo_to_mono.py in serab.utils to convert all stereo audios to mono.
Build each dataset using the TFDS CLI:

cd tensorflow_datasets/{DATASET_NAME}
tfds build  # Download and prepare the dataset to `~/tensorflow_datasets/

The datasets are now ready to use!

Citation

If you are using this code, please cite the paper:

@article{scheidwasser2021serab,
  title={SERAB: A multi-lingual benchmark for speech emotion recognition},
  author={Scheidwasser-Clow, Neil and Kegler, Mikolaj and Beckmann, Pierre and Cernak, Milos},
  journal={arXiv preprint arXiv:2110.03414},
  year={2021}
}