Awesome
Learning the Beauty in Songs: Neural Singing Voice Beautifier
<div align="center"> <a href="https://neuralsvb.github.io" target="_blank">Demo Page</a> </div>
This repository is the official PyTorch implementation of our ACL-2022 paper.
0. Dataset (PopBuTFy) Acquirement
Audio samples
- You can download the dataset from here. Please send us an email for registration (See in apply_form).
- Dataset preview.
Text labels
NeuralSVB does not need text as input, but the ASR model to extract PPG needs text. Thus we also provide the text labels of PopBuTFy.
<!-- We recommend mixing [LibriTTS](https://www.openslr.org/60/) with PopBuTFy to train the ASR model. -->1. Preparation
Environment Preparation
Most of the required packages are in https://github.com/NATSpeech/NATSpeech/blob/main/requirements.txt
Or you can prepare environments with the Requirements.txt file in the repository directory.
pip install Requirements.txt
Data Preparation
- Extract embeddings of vocal timbre:
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/save_emb.yaml
- Pack the dataset:
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/para_bin.yaml
Vocoder Preparation
We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.
Please unzip pre-trained vocoder into checkpoints
before training your acoustic model.
This singing vocoder is trained on 100+ hours singing data (including Chinese and English songs).
PPG Extractor Preparation
We provide the pre-trained model of PPG Extractor.
Please unzip pre-trained PPG extractor into checkpoints
before training your acoustic model.
After the instructions above, the directory structure should be as follows:
.
|--data
|--processed
|--PopBuTFy (unzip PopBuTFy.zip)
|--data
|--directories containing wavs
|--binary
|--PopBuTFyENSpkEM
|--checkpoints
|--1009_pretrain_asr_english
|--
|--config.yaml
|--1012_hifigan_all_songs_nsf
|--
|--config.yaml
2. Training Example
CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset
3. Inference
Inference from packed test set
CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset --infer
Inference results will be saved in ./checkpoints/EXP_NAME/generated_
by default.
We provided:
- the pre-trained model of NSVB (en version);
Remember to put the pre-trained models in checkpoints
directory.
Inference from raw inputs
WIP.
Limitations
See Appendix D "Limitations and Solutions" in our paper.
Citation
If this repository helps your research, please cite:
@inproceedings{liu-etal-2022-learning-beauty,
title = "Learning the Beauty in Songs: Neural Singing Voice Beautifier",
author = "Liu, Jinglin and
Li, Chengxi and
Ren, Yi and
Zhu, Zhiying and
Zhao, Zhou",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.549",
pages = "7970--7983",}
Issues
- Before raising a issue, please check our Readme and other issues for possible solutions.
- We will try to handle your problem in time but we could not guarantee a satisfying solution.
- Please be friendly.
Acknowledgements
- r9y9's wavenet_vocoder
- Po-Hsun-Su's ssim
- descriptinc's melgan
- Official espnet
- Official PyTorch Lightning
The framework of this repository is based on DiffSinger, and is a predecessor of NATSpeech.