Home

Awesome

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

<a href='https://gongchenghhu.github.io/TASLP-demo/'><img src='https://img.shields.io/badge/Demo-Page-blue'></a> <a href='https://arxiv.org/abs/2312.14398'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

Introduction

This is the code for the ZMM-TTS submitted to the IEEE TASLP. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

<br> <p align="center"> <img src="img/overview.png" width="95%"> <br> Overview </p> <be>

Welcome to try our code and pre-trained model on different languages!

Release

Samples

Samples are provided on our demo page.

Installation

ZMM-TTS requires Python>=3.8, and a reasonly recent version of PyTorch. To install ZMM-TTS and make a quick synthesis, you can run from this repository:

git clone https://github.com/nii-yamagishilab-visitors/ZMM-TTS.git

cd ZMM-TTS
pip3 install -r requirements.txt
#In addition, you may need to install these libraries to support full functionality.
pip install transformers  #For support XLSR-53 and XphoneBERT model.
pip install speechbrain   #For extracting speaker embedding.

If you want to try IPA representations, you need to install Epitran.

Pre-trained self-supervised model

ModelModalityLangTraining data
XLSR-53Audio5356K hours
ECAPA-TDNNAudio> 52794 hours
XPhoneBERTText94330M sentences

Usage

Multilingual multispeaker dataset MM6

In my paper, the training data we used contained GlobalPhone, and unfortunately that is not an open source data. Considering the scarcity of publicly multilingual and multilingual speaker databases for speech synthesis, I designed the following training database based on the MLS and NHT Swedish databases and called it MM6. (It seems that NST is no longer open for downloads in Swedish, in which case you should apply this data from The Norwegian Language Bank). If you have GlobalPhone dataset, you can try the same training data Dataset/train_paper.txt as our paper.

LanguageGenderSpeakersSentencesDurations (h)Database
EnglishFemale20400013.9MLS
EnglishMale20400013.9MLS
FrenchFemale20400013.9MLS
FrenchMale20400013.9MLS
GermanFemale20400013.9MLS
GermanMale20400013.9MLS
PortugueseFemale16374113.0MLS
PortugueseMale20417514.5MLS
SpanishFemale20351912.2MLS
SpanishMale20378613.1MLS
SwedishFemale000
SwedishMale20400013.9NST

Download and norm data

You can generate MM6 dataset through following download and norm scripts:

bash scripts/download.sh   #download the MLS data.
python prepare_data/creat_meta_data_mls.py #Generate speaker-gender-language balance data.
#We recommend that you use sv56 to normalize the MLS audio.
bash scripts/norm_wav.sh

Please contact The Norwegian Language Bank if you want to get NHT Swedish data, and extract it to the Dataset/origin_data/. Or, you could simply consider excluding the Swedish language.

#The Swedish audio already normalize
python prepare_data/creat_meta_data_swe.py

This MM6 is a multilingual dataset with a largely balanced mix of speakers and genders, and we encourage you to experiment with other tasks as well.

Preprocess

After you download and nom the wav, you can generate in Dataset folder as:

|--Dataset
     |--MM6
         |--wavs          #Store audio files
     |--preprocessed_data #Store preprocessed data: text, features,...
         |--MM6
             |--train.txt      

you can find wav in Dataset/MM6/wavs/ and meta file in Dataset/preprocessed_data/ZMM6/train.txt. The train.txt looks like:

Name|Database|Language|Speaker|text
7756_9025_000004|MM6|English|7756|on tiptoe also i followed him and just as his hands were on the wardrobe door my hands were on his throat he was a little man and no match for me
bash scripts/extract_discrete.sh
bash scripts/extract_spk.sh
python prepare_data/extract_text_seq_from_raw_text.py
python prepare_data/compute_mel.py
python prepare_data/compute_attention_prior.py

Train model

#Using XphoneBERT:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT
#Using Characters (Letters):
python txt2vec/train.py --dataset MM6 --config MM6_Letters
#Using IPA:
python txt2vec/train.py --dataset MM6 --config MM6_IPA
#If you want to train a model without a language layer, you could use xxx_wo config like:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT_wo

NOTE: Please set needUpdate: True in model.yaml after 1/4 iteration, when you use XphoneBERT.

python vec2mel/train.py --dataset MM6 --config MM6

For the training of txt2vec and vec2mel model, we used a batch_size of 16 and trained for 1.2M steps. It took about 3 days on 1 Tesla A100 GPU.

python prepare_data/creat_lists.py
python vec2wav/train.py -c Config/vec2wav/vec2wav.yaml
#If you want to train a model without a language layer:
python vec2wav/train.py -c Config/vec2wav/vec2wav_wo.yaml

For the training of vec2wav , we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

python Vocoder_HifiGAN_Model/train.py --config Config/config_16k_mel.json

For the training of HifiGAN, we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

Test model

To do

Citation

If you use this code, result, or MM6 dataset in your paper, please cite our work as:

@article{gong2023zmm,
  title={ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations},
  author={Gong, Cheng and Wang, Xin and Cooper, Erica and Wells, Dan and Wang, Longbiao and Dang, Jianwu and Richmond, Korin and Yamagishi, Junichi},
  journal={arXiv preprint arXiv:2312.14398},
  year={2023}
}

References

License

The code in this repository is released under the BSD-3-Clause license as found in the LICENSE file. The txt2vec, vec2mel and vec2wav subfolder have MIT License. The sv56scripts has GPL License.