Home

Awesome

WavAugment

WavAugment performs data augmentation on audio data. The audio data is represented as pytorch tensors.

It is particularly useful for speech data. Among others, it implements the augmentations that we found to be most useful for self-supervised learning (Data Augmenting Contrastive Learning of Speech Representations in the Time Domain, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [arxiv]):

Internally, WavAugment uses libsox and allows interleaving of libsox- and pytorch-based effects.

Requirements

Installation

To install WavAugment, run the following command:

git clone git@github.com:facebookresearch/WavAugment.git && cd WavAugment && python setup.py develop

Testing

Requires pytest (pip install pytest)

python -m pytest -v --doctest-modules

Usage

First of all, we provide thouroughly documented examples, where we demonstrate how a data-augmented dataset interface works. We also provide a Jupyter-based tutorial (open in colab) that illlustrates how one can apply various useful effects to a piece of speech (recorded over the mic or pre-recorded).

The EffectChain

The central object is the chain of effects, EffectChain, that are applied on a torch.Tensor to produce another torch.Tensor. This chain can have multiple effects composed:

import augment
effect_chain = augment.EffectChain().pitch(100).rate(16_000)

Parameters of the effect coincide with those of libsox (http://sox.sourceforge.net/libsox.html); however, you can also randomize the parameters by providing a python Callable and mix them with standard parameters:

import numpy as np
random_pitch_shift = lambda: np.random.randint(-100, +100)
# the pitch will be changed by a shift somewhere between (-100, +100)
effect_chain = augment.EffectChain().pitch("-q", random_pitch_shift).rate(16_000)

Here, the flag-q makes pitch run faster at some expense of the quality. If some parameters are provided by a Callable, this Callable will be invoked every time EffectChain is applied (eg. to generate random parameters).

Applying the chain

To apply a chain of effects on a torch.Tensor, we code the following:

output_tensor = augment.EffectChain().pitch(100).rate(16_000).apply(input_tensor, \
    src_info=src_info, target_info=target_info)

WavAugment expects input_tensor to have a shape of (channels, length). As input_tensor does not contain important meta-information, such as sampling rate, we need to provide it manually. This is done by passing two dictionaries, src_info (meta-information about the input format) and target_info (our expectated format for the output).

At minimum, we need to set the sampling rate for the input tensor: {'rate': 16_000}.

Example usage

Below is a small gist of a potential usage:

import augment
import numpy as np

x, sr = torchaudio.load(test_wav)

# input signal properties
src_info = {'rate': sr}

# output signal properties
target_info = {'channels': 1, 
               'length': 0, # not known beforehand
               'rate': 16_000}
# write down the chain of effects with their string parameters and call .apply()
# effects are specified as a chain of method calls with parameters that can be 
# strings, numbers, or callables. The latter case is used for generating randomized
# transformations
random_pitch = lambda: np.random.randint(-400, -200)
y = augment.EffectChain().pitch(random_pitch).rate(16_000).apply(x, \
    src_info=src_info, target_info=target_info)

Important notes

It often happens that a command-line invocation of sox would change effect chain. To get a better idea of what sox executes internally, you can launch it with a -V flag, eg by running:

sox -V tests/test.wav out.wav reverb 0 50 100

we will see something like:

sox INFO sox: effects chain: input        16000Hz  1 channels
sox INFO sox: effects chain: reverb       16000Hz  2 channels
sox INFO sox: effects chain: channels     16000Hz  1 channels
sox INFO sox: effects chain: dither       16000Hz  1 channels
sox INFO sox: effects chain: output       16000Hz  1 channels

This output tells us that the reverb effect changes the number of channels, which are squashed into 1 channel by the channel effect. Sox also added dither effect to hide processing artifacts.

WavAugment remains explicit and doesn't add effects under the hood. If you want to emulate a Sox command that decomposes into several effects, we advise to consult sox -V and apply the effects manually. Try it out on some files before running a heavy machine-learning job.

Citation

If you find WavAugment useful in your research, please consider citing:

@article{wavaugment2020,
  title={Data Augmenting Contrastive Learning of Speech Representations in the Time Domain},
  author={Kharitonov, Eugene and Rivi{\`e}re, Morgane and Synnaeve, Gabriel and Wolf, Lior and Mazar{\'e}, Pierre-Emmanuel and Douze, Matthijs and Dupoux, Emmanuel},
  journal={arXiv preprint arXiv:2007.00991},
  year={2020}
}

Contributing

See the CONTRIBUTING file for how to help out.

License

WavAugment is MIT licensed, as found in the LICENSE file.