Awesome

Kaggle Freesound Audio Tagging 2019 Competition

:medal_sports:

spectrogram of https://freesound.org/people/envirOmaniac2/sounds/376452

This is Eric BOUTEILLON's proposed solution for Kaggle Freesound Audio Tagging 2019 Competition and DCASE 2019 Task 2.

Table of Content

Kaggle Freesound Audio Tagging 2019 Competition

Indicators :+1: were added to sections containing major contributions from the author.

Motivation of this repository

This repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix.

These new techniques were applied to our submitted audio tagging system to the Kaggle Freesound Audio Tagging 2019 challenge carried out within the DCASE 2019 Task 2 challenge [3]. Purpose of this challenge consist of predicting the audio labels for every test clips using machine learning techniques trained on a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting.

TL;DR - give me code!

Provided Jupyter notebooks result in a lwlrap of .738 in public leaderboard, that is to say 12th position in this competition.

You can also find resulting weights of CNN-model-1 and VGG-16 training in a public kaggle dataset. Note I am no longer using git-lfs to store weights due to quota issues.

Installation

This competition required to performed inference in a Kaggle kernel without change in its configuration. So it was important to use same version of pytorch and fastai as the Kaggle kernel configuration during the competition to be able to load locally generated CNN weights. So it is important to use pytorch 1.0.1 and fastai 1.0.51.

Installation method 1 - Identical to author

To get same configuration as my local system, here are the steps, tested on GNU Linux Ubuntu 18.04.2 LTS:

Clone this repository

git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git

Install anaconda3
Type in a linux terminal:

conda create --name freesound --file spec-file.txt

You are ready to go!

Note: My configuration has CUDA 10 installed, so you may have to adapt version of pytorch and cudatoolkit to your own configuration in the spec-file.txt.

Installation method 2 - Use conda recommended packages

This method does not guarantee to get the exact same configuration as the author as newer package may be installed by conda.

Clone this repository

git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git

Install anaconda3
Type in a linux terminal:

conda update conda
conda create -n freesound python=3.7 anaconda
conda activate freesound
conda install numpy pandas scipy scikit-learn matplotlib tqdm seaborn pytorch==1.0.1 torchvision cudatoolkit=10.0 fastai==1.0.51 -c pytorch -c fastai
conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd
conda install -c conda-forge librosa

Notes:

My configuration has CUDA 10 installed, so you may have to adapt version of pytorch and cudatoolkit to your own configuration
You may have inconsistency warnings because we use libjpeg-turbo

Hardware / Software

During the competition I use the following:

Intel Core i7 4790k
Nvidia RTX 2080 ti
24 GB RAM
Ubuntu 18.04.2 LTS
Detailed list of installed python package with conda (more than necessary) are available in requirements.txt and spec-file.txt.
Nvidia drivers 418.67, CUDA 10.1, CuDNN 7.3.1

Reproduce results

Download dataset from Kaggle
(optional) Download my weights dataset from Kaggle
Unpack dataset in input folder so you environment looks like:

├── code
│   ├── inference-kernel.ipynb
│   ├── training-cnn-model1.ipynb
│   └── training-vgg16.ipynb
├── images
│   ├── all_augmentations.png
│   └── model-explained.png
├── input
│   ├── test
│   │   └── ...
│   ├── train_curated
│   │   └── ...
│   ├── train_noisy
│   │   └── ...
│   ├── sample_submission.csv
│   ├── train_curated.csv
│   ├── train_noisy.csv
│   └── keep.txt
├── LICENSE
├── README.md
├── requirements.txt
├── spec-file.txt
└── weights
    ├── cnn-model-1
    │   └── work
    │       ├── models
    │       │   └── keep.txt
    │       ├── stage-10_fold-0.pkl
    │       ├── ...
    │       └── stage-2_fold-9.pkl
    └── vgg16
        └── work
            ├── models
            │   └── keep.txt
            ├── stage-10_fold-0.pkl
            ├── ...
            └── stage-2_fold-9.pkl

Type in command-line:

conda activate freesound
jupyter notebook

Your web-browser should open and then select the notebook you want to execute. Recommended order:

Enjoy!

Notes:

Run first a training-*.ipynb notebook to train one of the models. :smile:
During CNN model training, a work folder and a preprocessed folders will be created, you may want to change their location: it is as easy as updating variables WORKand PREPROCESSED.
If you want to use the provided weights (or your own) with the inference notebook on your local setup, simply update folder paths pointed by models_list. I kept the paths used within the Kaggle kernel for the competition.
An inference kernel is available on kaggle using both kaggle dataset and my weights dataset just fork it

Solution Description

Audio Data Preprocessing

Audio clips were first trimmed of leading and trailing silence (threshold of 60 dB), then converted into 128-bands mel-spectrogram using a 44.1 kHz sampling rate, hop length of 347 samples between successive frames, 2560 FFT components and frequencies kept in range 20 Hz – 22,050 Hz. Last preprocessing consisted in normalizing (mean=0, variance=1) the resulting images and duplicating to 3 channels.

Models Summary

In this section, we describe the neural network architectures used:

Version 1 consists in an ensemble of a custom CNN "CNN-model-1" defined in Table 1 and a VGG-16 with batch-normalization. Both are trained in the same manner.

Version 2 consist of only our custom CNN "CNN-model-1", defined in Table 1.

Version 3 is evaluated for Judge award and it is same model as version 2.

Input 128 × 128 × 3
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU
concat(AdaptiveAvgPool2d + AdaptiveMaxPool2d)
Flatten−1024-BN-Dropout 25%
Dense-512-Relu-BN-Dropout 50%
Dense-80

Table 1: CNN-model-1. BN: Batch Normalisation, ReLU: Rectified Linear Unit,

Data Augmentation

One important technique to leverage a small training set is to augment this set using data augmentation. For this purpose we created a new augmentation named SpecMix. This new augmentation is an extension of SpecAugment [1] inspired by mixup [2].

SpecAugment applies 3 transformations to augment a training sample: time warping, frequency masking and time masking on mel-spectrograms.

mixup creates a virtual training example by computing a weighted average of two samples inputs and targets.

SpecMix :+1:

SpecMix is inspired from the two most effective transformations from SpecAugment and extends them to create virtual multi-labels training examples:

Frequency replacement is applied so that f consecutive mel-frequency channels [f0, f0+f) are replaced from another training sample, where f is first chosen from a uniform distribution from minimal to maximum the frequency mask parameter F, and f0 is chosen from [0, ν−f). ν is the number of mel frequency channels.
Time replacement is applied so that t consecutive time steps [t0, t0+t) are replaced from another training sample, where t is first chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ−t). τ is the number of time samples.
Target of the new training sample is computed as the weighted average of each original samples. The weight for each original sample is proportional to the number of pixel from that sample. Our implementation uses same replacement sample for Frequency replacement and Time replacement, so it gives us a new target computed based on:

equation

Figure 1: Comparison of mixup, SpecAugment and SpecMix compare augmentations

Others data augmentation

We added other data augmentation techniques:

mixup before SpecMix. A small improvement is observed (lwlrap increased by +0.001). mixup is first applied on current batch, generating new samples for the current batch and then SpecMix is applied on these newly created samples. In the end, combining mixup and SpecMix, up to four samples are involved in the generation of one single sample.
zoom and crop: a random zoom between 1. and max1,05 is applied with probability 75% is applied, a small improvement is seen (lwlrap increased by +0.001).
lighting: a random lightning and contrast change controlled is applied with probability 75%.

Training - warm-up pipeline :+1:

At training time, we give to the network batches of 128 augmented excerpts of randomly selected sample mel-spectrograms. We use a 10-fold cross validation setup and the fastai library [4].

Training is done in 4 stages, each stage generating a model which is used for 3 things:

warm-up the model training in the next stage
help in a semi-supervised selection of noisy elements
participate in the test prediction (except model 1)

An important point of this competition, is that we are not allowed to use external data nor pretrained models. So our pipeline presented below only used curated and noisy sets from the competition:

Stage 1 : Train a model (model1) from scratch only using the noisy set. Then compute cross-validated lwlrap on noisy set (lwlrap1).
Stage 2 : Train a model (model2) only on curated set but use model1 as pretrained model. Then compute cross-validated lwlrap on noisy set (lwlrap2).
Stage 3 : Let's start semi-supervised learning: our algorithm select samples from noisy set that are (almost) correctly classified by both model1 and model2. This algorithm simply keep sample from noisy set getting a geometric mean of (lwlrap1, lwlrap2) higher or equal to 0,5. A maximum of 5 samples per fold and per label is selected. Then train a model (model3) on curated plus selected noisy samples and use model2 as pretrained model. Then compute cross-validated lwlrap on noisy set (lwlrap3).
Stage 4 : Let's continue semi-supervised learning: our algorithm select again samples from noisy set that are strictly correctly classified by model3. This algorithm simply keep sample from noisy set getting a lwlrap3 equal to 1. Then train a model (model4) on curated plus selected noisy samples and use model3 as pretrained model.
Last stage: ensemble predictions on test set from model2, model3 and model4.

Figure 2: warm-up pipeline model-explained

Inference

For inference we split the test audio clips in windows of 128 time samples (2 seconds), windows were overlapping. Then these samples are fed into our models to obtain predictions. All predictions linked to an audio clip are averaged to get the final predictions to submit.

This competition had major constraints for test prediction inference: submission must be made through a Kaggle kernel with time constraints. As our solution requires a GPU, the inference of the whole unseen test set shall be done in less than an hour.

In order to match this hard constraint, we took following decisions:

use same preprocessing and inputs for all models,
limit the final ensemble to two models only,
limit the overlapping of windows,
as the unseen test set was reported to be three times the public test set by organizers, then we made sure to infer the public test set in less than 1,000 seconds, which should allow the kernel to infer the unseen test set in about 3,000 seconds and keep a 20% time margin for safety.

Results

To asses the performance of our system, we provide results in Table 2. Evaluation of performances on noisy set and curated set were cross-validated using 10-folds. Evaluation on test set predictions are values reported by the public leaderbord. The metric used is lwlrap (label-weighted label-ranking average precision).

Model	lwlrap noisy	lwlrap curated	leaderboard
model1	0.65057	0.41096	N/A
model2	0.38142	0.86222	0.723
model3	0.56716	0.87930	0.724
model4	0.57590	0.87718	0.724
ensemble	N/A	N/A	0.733

Table 2: Empirical results of CNN-model-1 using proposed warm-up pipeline

Each stage of the warm-up pipeline generates a model with excellent prediction performance on the test test. As one can see in Figure 3, each model would give us a silver medal with the 25th position on the public leaderboard. Moreover these warm-up models bring sufficient diversity on their own, as a simple averaging of their predictions (lwlrap .733) gives 16th position on the public leaderboard.

Final 12th position of the author was provided by version 1, which is an average of the predictions given by CNN-model-1 and VGG-16, both trained the same way.

Figure 3: Public leaderboard

Conclusion

This git repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix. These techniques leveraged both clean and noisy sets and were shown to give excellent results.

These results are reproducible, description of requirements, steps to reproduce and source code are available on GitHub1. Source code is released under an open source license (MIT).

Ackowledgment

These results were possible thanks to the infinite support of my 5 years-old boy, who said while I was watching the public leaderboard: “Dad, you are the best and you will be at the very top”. ❤️

I also thank the whole kaggle community for sharing knowledge, ideas and code. In peculiar daisuke for his kernels during the competition and mhiro2 for his simple CNN-model and all the competition organizers.

References

[1] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019.

[2] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. "mixup: Beyondempirical risk minimization". arXiv preprint arXiv:1710.09412, 2017.

[3] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, and Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Submitted to DCASE2019 Workshop, 2019. URL: https://arxiv.org/abs/1906.02975

[4] fastai, Howard, Jeremy and others, 2018, URL: https://github.com/fastai/fastai