Home

Awesome

Open-Unmix for NNabla

Build Status

This repository contains the NNabla implementation of Open-Unmix, a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. Open-Unmix provides ready-to-use models that allow users to separate pop music into four stems: vocals, drums, bass and the remaining other instruments.

⚠️ Please note that this implementation was not tested to yield the same results as the pytorch version. For that reason, we do not provide an inference method in the nnabla version for now. If you want to use umx or umx-hq in your work, use the pytorch version for now. See the current implementation status here

Related Projects: open-unmix-pytorch | open-unmix-nnabla | musdb | museval | norbert

The Model

Open-Unmix is based on a three-layer bidirectional deep LSTM. The model learns to predict the magnitude spectrogram of a target, like vocals, from the magnitude spectrogram of a mixture input. Internally, the prediction is obtained by applying a mask on the input. The model is optimized in the magnitude domain using mean squared error and the actual separation is done in a post-processing step involving a multichannel wiener filter implemented using norbert. To perform separation into multiple sources, multiple models are trained for each particular target. While this makes the training less comfortable, it allows great flexibility to customize the training data for each target source.

Input Stage

Open-Unmix operates in the time-frequency domain to perform its prediction. The input of the model is either:

In that case, the model computes spectrograms with nnabla STFT on the fly.

In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins), where nb_frames and nb_bins are the time and frequency-dimensions of a Short-Time-Fourier-Transform.

The input spectrogram is standardized using the global mean and standard deviation for every frequency bin across all frames. Furthermore, we apply batch normalization in multiple stages of the model to make the training more robust against gain variation.

Dimensionality reduction

The LSTM is not operating on the original input spectrogram resolution. Instead, in the first step after the normalization, the network learns to compresses the frequency and channel axis of the model to reduce redundancy and make the model converge faster.

Bidirectional-LSTM

The core of open-unmix is a three layer bidirectional LSTM network. Due to its recurrent nature, the model can be trained and evaluated on arbitrary length of audio signals. Since the model takes information from past and future simultaneously, the model cannot be used in an online/real-time manner. An uni-directional model can easily be trained as described here.

Output Stage

After applying the LSTM, the signal is decoded back to its original input dimensionality. In the last steps the output is multiplied with the input magnitude spectrogram, so that the models is asked to learn a mask.

Separation

Inference is not supported right now and will be added later.

Getting started

Installation

For installation we recommend to use the Anaconda python distribution. To create a conda environment for open-unmix, simply run:

conda env create -f environment-X.yml where X is either [cpu, gpu], depending on your system.

Design Choices / Contributions

Authors

Fabian-Robert Stöter, Antoine Liutkus, Inria and LIRMM, Montpellier, France

References

<details><summary>If you use open-unmix for your research – Cite Open-Unmix</summary>
@article{stoter19,  
  author={F.-R. St\\"oter and S. Uhlich and A. Liutkus and Y. Mitsufuji},  
  title={Open-unmix: a reference implementation for source separation},  
  journal={Journal of Open Source Software},  
  year=2019,  
  note={submitted}}
}
</p> </details> <details><summary>If you use the MUSDB dataset for your research - Cite the MUSDB18 Dataset</summary> <p>
@misc{MUSDB18,
  author       = {Rafii, Zafar and
                  Liutkus, Antoine and
                  Fabian-Robert St{\"o}ter and
                  Mimilakis, Stylianos Ioannis and
                  Bittner, Rachel},
  title        = {The {MUSDB18} corpus for music separation},
  month        = dec,
  year         = 2017,
  doi          = {10.5281/zenodo.1117372},
  url          = {https://doi.org/10.5281/zenodo.1117372}
}
</p> </details> <details><summary>If compare your results with SiSEC 2018 Participants - Cite the SiSEC 2018 LVA/ICA Paper</summary> <p>
@inproceedings{SiSEC18,
  author="St{\"o}ter, Fabian-Robert and Liutkus, Antoine and Ito, Nobutaka",
  title="The 2018 Signal Separation Evaluation Campaign",
  booktitle="Latent Variable Analysis and Signal Separation:
  14th International Conference, LVA/ICA 2018, Surrey, UK",
  year="2018",
  pages="293--305"
}
</p> </details>

⚠️ Please note that the official acronym for open-unmix is UMX.

License

MIT

Acknowledgements

<p align="center"> <img src="https://raw.githubusercontent.com/sigsep/website/master/content/open-unmix/logo_INRIA.svg?sanitize=true" width="200" title="inria"> <img src="https://raw.githubusercontent.com/sigsep/website/master/content/open-unmix/anr.jpg" width="100" alt="anr"> </p>