Home

Awesome

EVAR ~ Evaluation package for Audio Representations

This repository offers a comprehensive evaluation package for audio representations (ARs) as employed in our papers. Its key features include:

In early 2021, we lacked a cohesive codebase for evaluating models across various tasks under consistent test settings, which prompted the creation of this repository. By the end of 2021, other similar options, such as (SERAB, SUPERB, HEAR 2021 NeurIPS Challenge, and HARES), had emerged. However, this repository was developed independently for our specific study.

This evaluation package is intended for researchers who wish to compare ARs under the same test setup as employed in our study. The papers used EVAR are:

Update History

Jun 20, 2024 -- Supported autio to text or text to audio retrieval (ATR).

Mar 25, 2024 -- Fixed minor issues.

Jan 25, 2024 -- Supported zero-shot evaluation for CLAP models.

<details><summary>Older history</summary>

Jan 12, 2024 -- Supported weighted CE loss with fine-tuning and added more models.

Jan 17, 2023 -- Supported evaluating multilayer features by stacking layer-wise features.

Jan 12, 2023 -- Supported Fine-tuning on AudioSet20K and additional models.

</details>

1. Quick start (Linear evaluation)

The following show how to prepare CREMA-D dataset and evaluate OpenL3 (music) features on CREMA-D.

  1. Follow the steps in "2-1. Step 1: Install modeules, and download depending source code", in short:

    git clone https://github.com/nttcslab/eval-audio-repr.git evar
    cd evar
    curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
    curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
    curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py
    curl https://raw.githubusercontent.com/XinhaoMei/WavCaps/master/retrieval/tools/utils.py -o evar/utils/wavcaps_utils.py
    pip install -r requirements.txt
    
  2. Download CREMA-D dataset. This downloads all the .wav files under a folder downloads/cremad.

    $ python evar/utils/download_cremad.py downloads/cremad
    
  3. Preprocess (resample) data samples. This will make copies of all the .wav files under downloads/cremad to work/48k/cremad with a sampling rate of 48,000 Hz.

    $ python prepare_wav.py downloads/cremad work/48k/cremad 48000

  4. Prepare OpenL3 code and weight. Our implementation (evar/ar_openl3.py) uses torchopenl3.

    $ pip install torchopenl3

  5. Evaluate. The 48,000 Hz .wav files from work/48k/cremad are encoded to mbedding vectors by the OpenL3, then linear evaluation program taks the embeddings as input. The result will be appended to a file results/scores.csv.

    $ python lineareval.py config/openl3mus.yaml cremad

2. Setup

Warning: Setup takes long, especially downloading datasets.

You will:

  1. Install modeules, and download external source code.
  2. Download datasets and create metadata files.
  3. Download model implementation and weights.

2-0. Step 0: Clone as evar.

To make it easy, we clone as evar.

git clone https://github.com/nttcslab/eval-audio-repr.git evar

2-1. Step 1: Install modeules, and download depending source code

Run following once to download your copy of the external source code.

curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py

Install modules listed on requirements.txt. If you use Anaconda, you might create an environment as the following example:

conda create -n evar python=3.8
conda activate evar
pip install -r requirements.txt

2-2. Step 2: Setup datasets

See πŸ‘‰ Preparing-datasets.md.

2-3. Step 3: Setup models

See πŸ‘‰ Preparing-models.md.

3. Linear evaluation

The following describes the evaluation steps with an exemplar command line:

$ python lineareval.py config/openl3mus.yaml cremad

The followings show the structure of the folders:

evar/
  evar           Evaluation codes.
  evar/utils     Helper utilitiy codes.
  evar/metadata  <SOME CSVs TO BE CREATED IN SETUP STEPS> Metadata (file name/split/label) CSV files.
  external       Folder to clone/store external resources such as codes and weights.
  logs           <CREATED RUNTIME> Folder to store logs.
  results        <CREATED RUNTIME> `scores.csv` will accumulate resulting scores.
  work           <TO BE CREATED IN SETUP> Folder to serve .wav samples.
  work/16k         for 16,000 Hz samples.
  work/22k         for 22,000 Hz samples -- not 22,050 Hz, For COALA.
  work/32k         for 32,000 Hz samples.
  work/44k         for 44,100 Hz samples.
  work/48k         for 48,000 Hz samples.

3-1. Example

The followings is a example of evaluating BYOL-A with GTZAN. (See Evaluation-examples.md for example command lines.)

$ python 2pass_lineareval.py config/byola.yaml gtzan batch_size=64
>>> python lineareval.py config/byola.yaml gtzan --options=batch_size=64 --lr=None --hidden=() --standard_scaler=True --mixup=False --early_stop_epochs=None --seed=42 --step=2pass_1_precompute_only
   :

Train:443, valid:197, test:290, multi label:False
 using network pretrained weight: AudioNTT2020-BYOLA-64x96d2048.pth
<All keys matched successfully>
Logging to logs/gtzan_ar_byola.AR_BYOLA_6bd7e19e/log.txt
['features.0.weight', 'features.0.bias', 'features.1.weight', 'features.1.bias', 'features.1.running_mean', 'features.1.running_var', 'features.1.num_batches_tracked', 'features.4.weight', 'features.4.bias', 'features.5.weight', 'features
.5.bias', 'features.5.running_mean', 'features.5.running_var', 'features.5.num_batches_tracked', 'features.8.weight', 'features.8.bias', 'features.9.weight', 'features.9.bias', 'features.9.running_mean', 'features.9.running_var', 'features.9.num_batches_tracked', 'fc.0.weight', 'fc.0.bias', 'fc.3.weight', 'fc.3.bias']                                                                                                                                                              
using spectrogram norimalization stats: [-3.7112076  3.5103734]
  (module): AR_BYOLA(
    (to_feature): ToLogMelSpec(
      (to_spec): MelSpectrogram(
        Mel filter banks size = (64, 513), trainable_mel=False
        (stft): STFT(n_fft=1024, Fourier Kernel size=(513, 1, 1024), iSTFT=False, trainable=False)
  :

Getting gtzan_ar_byola.AR_BYOLA_6bd7e19e train embeddings...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:03<00:00,  2.28it/s]
Getting gtzan_ar_byola.AR_BYOLA_6bd7e19e valid embeddings...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  2.30it/s]
Getting gtzan_ar_byola.AR_BYOLA_6bd7e19e test embeddings... 
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:02<00:00,  2.23it/s]
>>> python lineareval.py config/byola.yaml gtzan --options=batch_size=64 --lr=None --hidden=() --standard_scaler=True --mixup=False --early_stop_epochs=None --seed=42 --step=2pass_2_train_test
  :

Train:443, valid:197, test:290, multi label:False
 using cached embeddings: embs-gtzan_ar_byola.AR_BYOLA_6bd7e19e-train-1
 using cached embeddings: embs-gtzan_ar_byola.AR_BYOLA_6bd7e19e-valid-1
 using cached embeddings: embs-gtzan_ar_byola.AR_BYOLA_6bd7e19e-test-1
πŸš€ Started Linear evaluation:
 stats|train: mean=-0.0000, std=0.9079
 stats|valid: mean=-0.0333, std=1.0472
Training model: MLP(
  (mlp): Sequential(
    (0): Linear(in_features=2048, out_features=10, bias=True)
  )
)
Details - metric: acc, loss: <function loss_nll_with_logits at 0x7f7a1a2a0160>, optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.0003
    weight_decay: 1e-08
), n_class: 10
epoch 0001/200: lr: 0.0003000: loss=33.254899 val_acc=0.2436548 val_loss=40.7875748
epoch 0002/200: lr: 0.0003000: loss=25.966087 val_acc=0.3959391 val_loss=35.5625954
epoch 0003/200: lr: 0.0003000: loss=21.259017 val_acc=0.4517766 val_loss=32.1851768
  :
epoch 0103/200: lr: 0.0003000: loss=0.646740 val_acc=0.6751269 val_loss=21.1744614
epoch 0104/200: lr: 0.0003000: loss=0.635991 val_acc=0.6751269 val_loss=21.1834354
Training complete in 0m 1s
Best val_acc@84 = 0.6852791878172588
Best val_loss@84 = 20.660442352294922
 stats|test: mean=-0.0388, std=0.9933
Linear evaluation: gtzan_ar_byola.AR_BYOLA_39f1b473 gtzan -> 0.75862

results/scores.csv example:

BYOLA,gtzan,0.7586206896551724,39f1b473,"Linear evaluation: gtzan_ar_byola.AR_BYOLA_39f1b473 gtzan -> 0.75862
{'audio_repr': 'ar_byola.AR_BYOLA', 'weight_file': 'external/byol_a/pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', 'feature_d': 2048, 'sample_rate': 16000, 'n_fft': 1024, 'window_size': 1024, 'hop_size': 160, 'n_mels': 64, 'f_min': 60, 'f_max': 7800, 'temporal_pooling_type': 'mean_max', 'batch_size': 64, 'lr_lineareval': 0.0003, 'lr_finetune_frozen': 0.001, 'lr_finetune_finetune': 0.001, 'report_per_epochs': 20, 'early_stop_epochs': 20, 'task_metadata': 'evar/metadata/gtzan.csv', 'task_data': 'work/16k/gtzan', 'unit_samples': 480000, 'id': 'gtzan_ar_byola.AR_BYOLA_6bd7e19e', 'runtime_cfg': {'lr': 0.0003, 'seed': 44, 'hidden': [], 'standard_scaler': True, 'mixup': False, 'epochs': 200, 'early_stop_epochs': 20, 'id': 'fd0d06e8'}}
logs/gtzan_ar_byola.AR_BYOLA_6bd7e19e/gtzan-ar-byola.BYOLA-LE_39f1b473_0.75862.csv"

4. Fine-tuning

The fine-tuning command line is analogous to that of the linear evaluation; we utilize the script finetune.py as demonstrated in the following example:

$ python finetune.py config/byola.yaml as20k --lr=1.0 --freq_mask 30 --time_mask 100 --mixup 0.3 --rrc True

The following parameters are configurable within the .yaml file:

4-1. Fine-tuning example

The followings is a example of evaluating BYOL-A on AudioSet20K.

/lab/eval$ python finetune.py config/byola.yaml as20k --lr=1.0 --freq_mask 30 --time_mask 100 --mixup 0.3 --rrc True
+task_metadata=evar/metadata/as20k.csv,+task_data=work/16k/as,+unit_samples=160000
Logging to logs/as20k_ar_byola.AR_BYOLA_bd42a61e/log.txt
  :
πŸš€ Start fine-tuning  with logging in logs/as20k_ar_byola.AR_BYOLA_bd42a61e
  :
 ** Fine-tuning using Evaluation set result as test result **
 using mixup with alpha=0.3
 using SpecAugmentation with 30, 100.
 using RandomResizeCrop(virtual_crop_size=(1.0, 1.5), time_scale=(0.6, 1.5), freq_scale=(0.6, 1.5))
Epoch [0] iter: 0/86, elapsed: 4.085s, lr: 0.00000000 loss: 0.71351832
Epoch [0] iter: 10/86, elapsed: 4.724s, lr: 0.02325581 loss: 0.71286535
Epoch [0] iter: 20/86, elapsed: 4.377s, lr: 0.04651163 loss: 0.70928347
Epoch [0] iter: 30/86, elapsed: 4.481s, lr: 0.06976744 loss: 0.70343441
Epoch [0] iter: 40/86, elapsed: 4.372s, lr: 0.09302326 loss: 0.70040292
Epoch [0] iter: 50/86, elapsed: 4.412s, lr: 0.11627907 loss: 0.69242024
Epoch [0] iter: 60/86, elapsed: 4.175s, lr: 0.13953488 loss: 0.68464863
Epoch [0] iter: 70/86, elapsed: 4.103s, lr: 0.16279070 loss: 0.67849201
Epoch [0] iter: 80/86, elapsed: 3.967s, lr: 0.18604651 loss: 0.66996628
validating
Saved weight as logs/as20k_ar_byola.AR_BYOLA_bd42a61e/weights_ep0it85-0.00786_loss0.6650.pth
as20k_ar_byola.AR_BYOLA_bd42a61e-lr1.0mu3fm30tm100tx5R | epoch/iter 0/85: val mAP: 0.00786, loss: 0.66500, best: 0.00786@0
Epoch [1] iter: 0/86, elapsed: 37.298s, lr: 0.20000000 loss: 0.66475827
Epoch [1] iter: 10/86, elapsed: 5.657s, lr: 0.22325581 loss: 0.65429634
  :
Epoch [199] iter: 70/86, elapsed: 4.784s, lr: 0.00000224 loss: 0.02135683
Epoch [199] iter: 80/86, elapsed: 4.399s, lr: 0.00000040 loss: 0.02403579
validating
as20k_ar_byola.AR_BYOLA_bd42a61e-lr1.0mu3fm30tm100tx5R | epoch/iter 199/85: val mAP: 0.22109, loss: 0.02174, best: 0.22579@159
Best mAP: 0.22579
Finetuning as20k_ar_byola.AR_BYOLA_bd42a61e-lr1.0mu3fm30tm100tx5R on as20k -> mean score: 0.22579, best weight: logs/as20k_ar_byola.AR_BYOLA_bd42a61e/weights_ep159it85-0.22579_loss0.0214.pth, score file: logs/as20k_ar_byola.AR_BYOLA_bd42a61e/as20k_ar-byola.BYOLA-FT_bd42a61e_0.22579.csv, config: {'audio_repr': 'ar_byola.AR_BYOLA', 'weight_file': 'external/byol_a/pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', 'feature_d': 2048, 'sample_rate': 16000, 'n_fft': 1024, 'window_size': 1024, 'hop_size': 160, 'n_mels': 64, 'f_min': 60, 'f_max': 7800, 'temporal_pooling_type': 'mean_max', 'batch_size': 256, 'lr_lineareval': 0.0003, 'report_per_epochs': 20, 'early_stop_epochs': 20, 'warmup_epochs': 5, 'mixup': 0.3, 'ft_bs': 256, 'ft_lr': 0.001, 'ft_early_stop_epochs': -1, 'ft_epochs': 200, 'ft_freq_mask': 30, 'ft_time_mask': 100, 'ft_rrc': True, 'task_metadata': 'evar/metadata/as20k.csv', 'task_data': 'work/16k/as', 'unit_samples': 160000, 'id': 'as20k_ar_byola.AR_BYOLA_bd42a61e', 'training_mask': 0.5, 'optim': 'sgd', 'unit_sec': None, 'runtime_cfg': {'lr': 1.0, 'seed': 42, 'hidden': [], 'mixup': 0.3, 'bs': 256, 'freq_mask': 30, 'time_mask': 100, 'rrc': True, 'epochs': 200, 'early_stop_epochs': -1, 'n_class': 527, 'id': '1f5f3070'}}

The fine-tuning results will be stored in results/ft-scores.csv.

5. Zero-shot

You can evaluate a zero-shot (ZS) classification using an evaluator script, zeroshot.py.

Prepare data for ZS

ZS uses the original, intact task data to ensure the best performance. You need to prepare data specifically for ZS. Please see the Zero-shot evaluation data.

Be sure to download the AudioSet class label definition if you evaluate models on it.

wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv

NOTE about Captions:

While ZS requires converting a label into caption text, we implemented it in the class_to_caption function in the zeroshot.py. You can edit the conversion rule in the function for your purposes.

5-1. ZS example

The ESC-50 example follows:

$ python zeroshot.py config/wavcaps.yaml esc50

+task_metadata=evar/metadata/esc50.csv,+task_data=work/original/ESC-50-master,+unit_samples=160000
Logging to logs/esc50_ar_wavcaps.AR_WavCaps_be6742a7/log.txt
{'audio_repr': 'ar_wavcaps.AR_WavCaps', 'weight_file': 'external/WavCaps/HTSAT-BERT-PT.pt', 'feature_d': 768, 'sample_rate': 32000, 'n_fft': 1024, 'window_size': 1024, 'hop_size': 320, 'n_mels': 64, 'f_min': 50, 'f_max': 14000, 'window': 'hanning', 'training_mask': 0.0, 'flat_f
eatures': False, 'batch_size': 128, 'lr_lineareval': 0.0003, 'report_per_epochs': 50, 'early_stop_epochs': 20, 'warmup_epochs': 5, 'mixup': 0.5, 'ft_bs': 128, 'ft_lr': 2.0, 'ft_early_stop_epochs': -1, 'ft_epochs': 200, 'ft_freq_mask': 8, 'ft_time_mask': 64, 'ft_noise': 0.0, 'ft
_rrc': True, 'name': '', 'task_metadata': 'evar/metadata/esc50.csv', 'task_data': 'work/32k/esc50', 'unit_samples': 160000, 'id': 'esc50_ar_wavcaps.AR_WavCaps_d7371b11', 'task_name': 'esc50', 'return_filename': False, 'runtime_cfg': {'id': '468067f3'}}
Train:1600, valid:0, test:400, multi label:False  

Captions: ['airplane can be heard', 'breathing can be heard', 'brushing teeth can be heard'] ...
Getting esc50_ar_wavcaps.AR_WavCaps_d7371b11 test embeddings...      
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:04<00:00,  1.07s/it]
Train:1600, valid:0, test:400, multi label:False   
Getting esc50_ar_wavcaps.AR_WavCaps_d7371b11 test embeddings...
  :
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:02<00:00,  1.51it/s]
esc50 result: 0.9485
Zero-shot evaluation: esc50_ar_wavcaps.AR_WavCaps_221affa2 zs_esc50 -> 0.94850
{'audio_repr': 'ar_wavcaps.AR_WavCaps', 'weight_file': 'external/WavCaps/HTSAT-BERT-PT.pt', 'feature_d': 768, 'sample_rate': 32000, 'n_fft': 1024, 'window_size': 1024, 'hop_size': 320, 'n_mels': 64, 'f_min': 50, 'f_max': 14000, 'window': 'hanning', 'batch_size': 128, 'lr_lineareval': 0.0003, 'report_per_epochs': 50, 'early_stop_epochs': 20, 'warmup_epochs': 5, 'mixup': 0.5, 'ft_bs': 128, 'ft_lr': 2.0, 'ft_early_stop_epochs': -1, 'ft_epochs': 200, 'ft_freq_mask': 8, 'ft_time_mask': 64, 'ft_noise': 0.0, 'ft_rrc': True, 'name': '', 'task_metadata': 'evar/metadata/esc50.csv', 'task_data': 'work/original/ESC-50-master', 'unit_samples': 160000, 'id': 'esc50_ar_wavcaps.AR_WavCaps_be6742a7', 'task_name': 'esc50', 'return_filename': False, 'mean': None, 'std': None, 'runtime_cfg': {'id': '468067f3'}}
 -> results/scores.csv

6. Autio to text or text to audio retrieval (ATR)

ATR is available with the retr_a2t_t2a.py.

Refer to the data setup instructions.

NOTE: Our current implementation is evaluation only (for testing the CLAP models).

Tasks are:

6-1. ATR example

WavCaps examples.

python retr_a2t_t2a.py config/wavcaps.yaml clotho
python retr_a2t_t2a.py config/wavcaps.yaml audiocaps
python retr_a2t_t2a.py config/wavcaps.yaml ja_audiocaps

The following is the WavCaps example evaluated on AudioCaps. We confirm the results close to the paper.

$ python retr_a2t_t2a.py config/wavcaps.yaml audiocaps
+task_metadata=evar/metadata/audiocaps.csv,+task_data=work/original/audiocaps,+unit_samples=320000

Logging to logs/WavCaps-HTSAT-BERT-PT_audiocaps_0afe26da/log.txt
{'audio_repr': 'ar_wavcaps.AR_WavCaps', 'weight_file': 'external/WavCaps/HTSAT-BERT-PT.pt', 'feature_d': 768, 'sample_rate': 32000, 'n_fft': 1024, 'window_size': 1024, 'hop_size': 320, 'n_mels': 64, 'f_min': 50, 'f_max': 14000, 'window': 'hanning', 'batch_size': 128, 'lr_lineareval': 0.0003, 'report_per_epochs': 50, 'early_stop_epochs': 20, 'warmup_epochs': 5, 'mixup': 0.5, 'ft_bs': 128, 'ft_lr': 2.0, 'ft_early_stop_epochs': -1, 'ft_epochs': 200, 'ft_freq_mask': 8, 'ft_time_mask': 64, 'ft_noise': 0.0, 'ft_rrc': True, 'name': '', 'task_metadata': 'evar/metadata/audiocaps.csv', 'task_data': 'work/original/audiocaps', 'unit_samples': 320000, 'id': 'WavCaps-HTSAT-BERT-PT_audiocaps_0afe26da', 'task_name': 'audiocaps', 'return_filename': False, 'mean': None, 'std': None, 'runtime_cfg': {'id': '468067f3'}}
AR_WavCaps(
  (backbone): ASE(
    (audio_encoder): AudioEncoder(
      (audio_enc): HTSAT_Swin_Transformer(
        (audio_feats_extractor): AudioFeature(
  :
)
Getting WavCaps-HTSAT-BERT-PT_audiocaps_0afe26da embeddings for 957 samples from test split ...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 957/957 [00:16<00:00, 58.52it/s]
Embedding dimensions = audio:torch.Size([4785, 1024]), caption:torch.Size([4785, 1024])
test: Caption to audio: r1: 50.99, r5: 82.24, r10: 88.82, r50: 98.75, medr: 1.00, meanr: 5.48, mAP10: 36.491
test: Audio to caption: r1: 37.43, r5: 72.12, r10: 84.74, r50: 97.66, medr: 2.00, meanr: 7.32, mAP10: 52.044

7. Other information

7-1. Supported datasets

The followings are supported datasets with a short name and subdomain:

  1. AudioSet20K (as20k, SER)
  2. AudioSet (as, SER) * experimental
  3. ESC-50 (esc50, SER)
  4. US8K (us8k, SER)
  5. FSD50K (fsd50k, SER)
  6. SPCV1/V2 (spcv1 or spcv2, NOSS)
  7. VoxForge (voxforge, NOSS)
  8. VoxCeleb1 (vc1, NOSS)
  9. CREMA-D (cremad, NOSS)
  10. GTZAN (gtzan, Music)
  11. NSynth instrument family (nsynth, Music)
  12. Pitch Audio Dataset (Surge synthesizer) (surge, Music)
  13. (ML-)AudioCaps (ATR)
  14. Clotho (ATR)

7-2. Supported pre-trained models

The followings are supported:

License

See LICENSE for the detail.

Acknowledgements / References