Awesome
<!-- omit in toc -->EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Guides
- Requirements and Installation
- Model Checkpoints
- Feature Extraction
- Data Preparation
- Pre-Training
- Fine-Tuning
- Inference and Evaluation
News :fire:
- We release EAT-large (20 epochs) with SOTA performance on AS-2M, AS-20K, ESC-50 and SPC-2.
- We have updated the checkpoints and code, and now EAT seamlessly supports variable-length audio throughout training, feature extraction, inference, and evaluation phases.
Introduction
EAT is an audio SSL model with high effectiveness and efficiency during self-supervised pre-training. You can find details in the paper EAT: Self-Supervised Pre-Training with Efficient Audio Transformer.
Requirements and Installation
<!-- To run the EAT code, you have two options for setting up your environment: manual setup or using our Docker image. --> <!-- omit in toc --> <!-- #### Manual Environment Setup -->The minimum environment requirements are Python >= 3.8
and PyTorch >= 1.13
. You could find the versions of other dependencies we use in requirements.txt
.
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT
<!-- omit in toc -->
<!-- #### Using Docker Image :whale:
We also provide a Docker image for an easier and more consistent setup. The Docker image will be released soon, containing all necessary dependencies pre-installed. -->
Model Checkpoints
You could download the EAT-base (10 epochs) checkpoints by Google Drive.
- AS-2M Pre-trained
- AS-2M Pre-trained+Fine-tuned (AS-2M)
- AS-2M Pre-trained+Fine-tuned (AS-20K)
:warning: Due to the limited amount of AudioSet data we possess compared to other models, we highly recommend pre-training the EAT model with your own data, which would probably perform better than the given one.
Update!!!!! :new: (RECOMMEND)
We have introduced two new variants of the EAT pre-training model and their fine-tuned versions, each designed to enhance performance through either extended pre-training epochs or scaling up the model size.
Links for model checkpoints:
- EAT-base_epoch30 (pre-training)
- EAT-base_epoch30 (fine-tuning on AS-2M)
- EAT-large_epoch20 (pre-training)
- EAT-large_epoch20 (fine-tuning on AS-2M)
Performance metrics:
Model | Backbone | Parameters | Pre-training <br> Epoch | AS-20K <br> mAP(%) | AS-2M <br> mAP(%) |
---|---|---|---|---|---|
EAT-base | ViT-B | 88M | 10 | 40.3 | 48.6 |
EAT-base | ViT-B | 88M | 30 | 41.3 | 48.9 |
EAT-large | ViT-L | 309M | 20 | 42.0 | 49.5 |
Feature Extraction
We provide the script for extracting audio features from the last layer of EAT encoder. The features are stored in .npy
format and the sample rate of the extracted features is ~50Hz. EAT could provide frame-level features and utterance-level features (denoted by the CLS token).
To extract latent representations from audio clips, you could use our pre-trained checkpoint, fine-tuned checkpoint or your owns, then please run the script feature_extract.sh
by:
bash EAT/scripts/feature_extract.sh
Data Preparation
The main dataset in our experiment is AudioSet. Regrettably, we are unable to release the data due to copyright restrictions. Data manifest is available at here. We follow the file format in wav2vec and data2vec, where .tsv
format file is for index while .lbl
and .csv
format files are specific for classification task. You could modify the files for your own database.
Pre-Training
Our codes are adapted from Audio-MAE and data2vec. We employ pretraining_AS2M.yaml
as our default pre-training config. To pre-train the EAT model on Audioset, you could run the script pretraining_AS2M.sh
by:
bash EAT/scripts/pretraining_AS2M.sh
If you need to pre-train the EAT model on other datasets where audio lengths are not fixed at 10 seconds, you can refer to the instructions in
feature_extract/readme.md
Fine-Tuning
We employ finetuning.yaml
as our default fine-tuning config. To fine-tune the EAT model in different downstream tasks, you could run the script finetuning_{task}.sh
, where {task}
includes AS20K
, AS2M
, ESC50
and SPCv2
. For example, you can fine-tune EAT on AS20K
by executing:
bash EAT/scripts/finetuning_AS20K.sh
Inference and Evaluation
For inference on single AudioSet audio clip with fine-tuned models, you could use our EAT checkpoints fine-tuning on AS-2M (recommended) or AS-20K
and run the script inference.sh
by:
bash EAT/scripts/inference.sh
An example output is as follows:
# top_k_prediction = 12
************ Acoustic Event Inference ************
LABEL PREDICTION
Percussion 0.523
Drum kit 0.437
Vibraphone 0.420
Drum 0.316
Music 0.303
Snare drum 0.277
Glockenspiel 0.225
Marimba, xylophone 0.223
Cymbal 0.213
Bass drum 0.207
Hi-hat 0.196
Mallet percussion 0.170
**************************************************
For comprehensive evaluation on the entire AudioSet eval dataset with fine-tuned EAT models, you could run the evaluation script eval.sh
by:
bash EAT/scripts/eval.sh
This script will give you the evaluation value of mAP on AudioSet test dataset.
Per-class AP can be found under the path ./EAT/ap_log.txt
. You could also refer to our results of finetuned EAT models on evaluation set of Audioset under the path ./EAT/results
.
Performance
Pre-training on AS-2M, EAT gains state-of-the-art (SOTA) performance on several audio and speech classification datasets including AS-20K, AS-2M, ESC-50 and SPC-2.
Efficiency
EAT achieves a total pre-training time reduction of ~15x compared to BEATs and ~10x relative to Audio-MAE. It costs only 10 epochs during EAT's pre-training on AS-2M.
Experiment Logs
We report the experiment logs using wandb. We have published a short WandB report detailing the training process and performance metrics of the EAT model. You could visit it here.
<!-- omit in toc -->TODO
- release the final EAT large
- update codes and checkpoints for friendly usage
- release the docker image
Citation
If you find our EAT codes and models useful, please cite the following paper:
@article{chen2024eat,
title={EAT: Self-Supervised Pre-Training with Efficient Audio Transformer},
author={Chen, Wenxi and Liang, Yuzhe and Ma, Ziyang and Zheng, Zhisheng and Chen, Xie},
journal={arXiv preprint arXiv:2401.03497},
year={2024}
}
<!-- omit in toc -->
Reference and Acknowledgement
Our codebase is based on the awesome Audio-MAE and data2vec repo.