Home

Awesome

Kaggle Freesound Audio Tagging 2019 Competition

MIT license dcase2019 Kaggle leaderboard :medal_sports:

spectrogram of https://freesound.org/people/envirOmaniac2/sounds/376452

This is Eric BOUTEILLON's proposed solution for Kaggle Freesound Audio Tagging 2019 Competition and DCASE 2019 Task 2.

Table of Content

Indicators :+1: were added to sections containing major contributions from the author.

Motivation of this repository

This repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix.

These new techniques were applied to our submitted audio tagging system to the Kaggle Freesound Audio Tagging 2019 challenge carried out within the DCASE 2019 Task 2 challenge [3]. Purpose of this challenge consist of predicting the audio labels for every test clips using machine learning techniques trained on a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting.

TL;DR - give me code!

Provided Jupyter notebooks result in a lwlrap of .738 in public leaderboard, that is to say 12th position in this competition.

You can also find resulting weights of CNN-model-1 and VGG-16 training in a public kaggle dataset. Note I am no longer using git-lfs to store weights due to quota issues.

Installation

This competition required to performed inference in a Kaggle kernel without change in its configuration. So it was important to use same version of pytorch and fastai as the Kaggle kernel configuration during the competition to be able to load locally generated CNN weights. So it is important to use pytorch 1.0.1 and fastai 1.0.51.

Installation method 1 - Identical to author

To get same configuration as my local system, here are the steps, tested on GNU Linux Ubuntu 18.04.2 LTS:

  1. Clone this repository
git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git
  1. Install anaconda3

  2. Type in a linux terminal:

conda create --name freesound --file spec-file.txt

You are ready to go!

Note: My configuration has CUDA 10 installed, so you may have to adapt version of pytorch and cudatoolkit to your own configuration in the spec-file.txt.

Installation method 2 - Use conda recommended packages

This method does not guarantee to get the exact same configuration as the author as newer package may be installed by conda.

  1. Clone this repository
git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git
  1. Install anaconda3

  2. Type in a linux terminal:

conda update conda
conda create -n freesound python=3.7 anaconda
conda activate freesound
conda install numpy pandas scipy scikit-learn matplotlib tqdm seaborn pytorch==1.0.1 torchvision cudatoolkit=10.0 fastai==1.0.51 -c pytorch -c fastai
conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd
conda install -c conda-forge librosa

Notes:

Hardware / Software

During the competition I use the following:

Reproduce results

  1. Download dataset from Kaggle

  2. (optional) Download my weights dataset from Kaggle

  3. Unpack dataset in input folder so you environment looks like:

├── code
│   ├── inference-kernel.ipynb
│   ├── training-cnn-model1.ipynb
│   └── training-vgg16.ipynb
├── images
│   ├── all_augmentations.png
│   └── model-explained.png
├── input
│   ├── test
│   │   └── ...
│   ├── train_curated
│   │   └── ...
│   ├── train_noisy
│   │   └── ...
│   ├── sample_submission.csv
│   ├── train_curated.csv
│   ├── train_noisy.csv
│   └── keep.txt
├── LICENSE
├── README.md
├── requirements.txt
├── spec-file.txt
└── weights
    ├── cnn-model-1
    │   └── work
    │       ├── models
    │       │   └── keep.txt
    │       ├── stage-10_fold-0.pkl
    │       ├── ...
    │       └── stage-2_fold-9.pkl
    └── vgg16
        └── work
            ├── models
            │   └── keep.txt
            ├── stage-10_fold-0.pkl
            ├── ...
            └── stage-2_fold-9.pkl
  1. Type in command-line:
conda activate freesound
jupyter notebook

Your web-browser should open and then select the notebook you want to execute. Recommended order:

Enjoy!

Notes:

Solution Description

Audio Data Preprocessing

Audio clips were first trimmed of leading and trailing silence (threshold of 60 dB), then converted into 128-bands mel-spectrogram using a 44.1 kHz sampling rate, hop length of 347 samples between successive frames, 2560 FFT components and frequencies kept in range 20 Hz – 22,050 Hz. Last preprocessing consisted in normalizing (mean=0, variance=1) the resulting images and duplicating to 3 channels.

Models Summary

In this section, we describe the neural network architectures used:

Version 1 consists in an ensemble of a custom CNN "CNN-model-1" defined in Table 1 and a VGG-16 with batch-normalization. Both are trained in the same manner.

Version 2 consist of only our custom CNN "CNN-model-1", defined in Table 1.

Version 3 is evaluated for Judge award and it is same model as version 2.

Input 128 × 128 × 3
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU
concat(AdaptiveAvgPool2d + AdaptiveMaxPool2d)
Flatten−1024-BN-Dropout 25%
Dense-512-Relu-BN-Dropout 50%
Dense-80

Table 1: CNN-model-1. BN: Batch Normalisation, ReLU: Rectified Linear Unit,

Data Augmentation

One important technique to leverage a small training set is to augment this set using data augmentation. For this purpose we created a new augmentation named SpecMix. This new augmentation is an extension of SpecAugment [1] inspired by mixup [2].

SpecAugment applies 3 transformations to augment a training sample: time warping, frequency masking and time masking on mel-spectrograms.

mixup creates a virtual training example by computing a weighted average of two samples inputs and targets.

SpecMix :+1:

SpecMix is inspired from the two most effective transformations from SpecAugment and extends them to create virtual multi-labels training examples:

  1. Frequency replacement is applied so that f consecutive mel-frequency channels [f0, f0+f) are replaced from another training sample, where f is first chosen from a uniform distribution from minimal to maximum the frequency mask parameter F, and f0 is chosen from [0, ν−f). ν is the number of mel frequency channels.
  2. Time replacement is applied so that t consecutive time steps [t0, t0+t) are replaced from another training sample, where t is first chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ−t). τ is the number of time samples.
  3. Target of the new training sample is computed as the weighted average of each original samples. The weight for each original sample is proportional to the number of pixel from that sample. Our implementation uses same replacement sample for Frequency replacement and Time replacement, so it gives us a new target computed based on:

equation

Figure 1: Comparison of mixup, SpecAugment and SpecMix compare augmentations

Others data augmentation

We added other data augmentation techniques:

Training - warm-up pipeline :+1:

At training time, we give to the network batches of 128 augmented excerpts of randomly selected sample mel-spectrograms. We use a 10-fold cross validation setup and the fastai library [4].

Training is done in 4 stages, each stage generating a model which is used for 3 things:

An important point of this competition, is that we are not allowed to use external data nor pretrained models. So our pipeline presented below only used curated and noisy sets from the competition:

Figure 2: warm-up pipeline model-explained

Inference

For inference we split the test audio clips in windows of 128 time samples (2 seconds), windows were overlapping. Then these samples are fed into our models to obtain predictions. All predictions linked to an audio clip are averaged to get the final predictions to submit.

This competition had major constraints for test prediction inference: submission must be made through a Kaggle kernel with time constraints. As our solution requires a GPU, the inference of the whole unseen test set shall be done in less than an hour.

In order to match this hard constraint, we took following decisions:

Results

To asses the performance of our system, we provide results in Table 2. Evaluation of performances on noisy set and curated set were cross-validated using 10-folds. Evaluation on test set predictions are values reported by the public leaderbord. The metric used is lwlrap (label-weighted label-ranking average precision).

Modellwlrap noisylwlrap curatedleaderboard
model10.650570.41096N/A
model20.381420.862220.723
model30.567160.879300.724
model40.575900.877180.724
ensembleN/AN/A0.733

Table 2: Empirical results of CNN-model-1 using proposed warm-up pipeline

Each stage of the warm-up pipeline generates a model with excellent prediction performance on the test test. As one can see in Figure 3, each model would give us a silver medal with the 25th position on the public leaderboard. Moreover these warm-up models bring sufficient diversity on their own, as a simple averaging of their predictions (lwlrap .733) gives 16th position on the public leaderboard.

Final 12th position of the author was provided by version 1, which is an average of the predictions given by CNN-model-1 and VGG-16, both trained the same way.

Figure 3: Public leaderboard public leaderboard

Conclusion

This git repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix. These techniques leveraged both clean and noisy sets and were shown to give excellent results.

These results are reproducible, description of requirements, steps to reproduce and source code are available on GitHub1. Source code is released under an open source license (MIT).

Ackowledgment

These results were possible thanks to the infinite support of my 5 years-old boy, who said while I was watching the public leaderboard: “Dad, you are the best and you will be at the very top”. ❤️

I also thank the whole kaggle community for sharing knowledge, ideas and code. In peculiar daisuke for his kernels during the competition and mhiro2 for his simple CNN-model and all the competition organizers.

References

[1] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019.

[2] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. "mixup: Beyondempirical risk minimization". arXiv preprint arXiv:1710.09412, 2017.

[3] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, and Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Submitted to DCASE2019 Workshop, 2019. URL: https://arxiv.org/abs/1906.02975

[4] fastai, Howard, Jeremy and others, 2018, URL: https://github.com/fastai/fastai