Awesome

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Introduction

This is the code for the ZMM-TTS submitted to the IEEE TASLP. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

<br> <p align="center"> <img src="img/overview.png" width="95%"> <br> Overview </p> <be>

Welcome to try our code and pre-trained model on different languages!

Release

[20/01] 🔥 We released code and model pre-trained on 6 language (English, French, German, Portuguese, Spanish and Swedish) public datasets.

Samples

Samples are provided on our demo page.

Installation

ZMM-TTS requires Python>=3.8, and a reasonly recent version of PyTorch. To install ZMM-TTS and make a quick synthesis, you can run from this repository:

git clone https://github.com/nii-yamagishilab-visitors/ZMM-TTS.git

cd ZMM-TTS
pip3 install -r requirements.txt
#In addition, you may need to install these libraries to support full functionality.
pip install transformers  #For support XLSR-53 and XphoneBERT model.
pip install speechbrain   #For extracting speaker embedding.

If you want to try IPA representations, you need to install Epitran.

Pre-trained self-supervised model

Model	Modality	Lang	Training data
XLSR-53	Audio	53	56K hours
ECAPA-TDNN	Audio	> 5	2794 hours
XPhoneBERT	Text	94	330M sentences

Usage

Multilingual multispeaker dataset MM6

In my paper, the training data we used contained GlobalPhone, and unfortunately that is not an open source data. Considering the scarcity of publicly multilingual and multilingual speaker databases for speech synthesis, I designed the following training database based on the MLS and NHT Swedish databases and called it MM6. (It seems that NST is no longer open for downloads in Swedish, in which case you should apply this data from The Norwegian Language Bank). If you have GlobalPhone dataset, you can try the same training data Dataset/train_paper.txt as our paper.

Language	Gender	Speakers	Sentences	Durations (h)	Database
English	Female	20	4000	13.9	MLS
English	Male	20	4000	13.9	MLS
French	Female	20	4000	13.9	MLS
French	Male	20	4000	13.9	MLS
German	Female	20	4000	13.9	MLS
German	Male	20	4000	13.9	MLS
Portuguese	Female	16	3741	13.0	MLS
Portuguese	Male	20	4175	14.5	MLS
Spanish	Female	20	3519	12.2	MLS
Spanish	Male	20	3786	13.1	MLS
Swedish	Female	0	0	0
Swedish	Male	20	4000	13.9	NST

Download and norm data

You can generate MM6 dataset through following download and norm scripts:

bash scripts/download.sh   #download the MLS data.
python prepare_data/creat_meta_data_mls.py #Generate speaker-gender-language balance data.
#We recommend that you use sv56 to normalize the MLS audio.
bash scripts/norm_wav.sh

Please contact The Norwegian Language Bank if you want to get NHT Swedish data, and extract it to the Dataset/origin_data/. Or, you could simply consider excluding the Swedish language.

#The Swedish audio already normalize
python prepare_data/creat_meta_data_swe.py

This MM6 is a multilingual dataset with a largely balanced mix of speakers and genders, and we encourage you to experiment with other tasks as well.

Preprocess

After you download and nom the wav, you can generate in Dataset folder as:

|--Dataset
     |--MM6
         |--wavs          #Store audio files
     |--preprocessed_data #Store preprocessed data: text, features,...
         |--MM6
             |--train.txt

you can find wav in Dataset/MM6/wavs/ and meta file in Dataset/preprocessed_data/ZMM6/train.txt. The train.txt looks like:

Name|Database|Language|Speaker|text
7756_9025_000004|MM6|English|7756|on tiptoe also i followed him and just as his hands were on the wardrobe door my hands were on his throat he was a little man and no match for me

1. Extract discrete code index and representations:

bash scripts/extract_discrete.sh

1. Extract speaker embeddings:

bash scripts/extract_spk.sh

1. Extract text sequences:

python prepare_data/extract_text_seq_from_raw_text.py

1. Extract mel spectrograms:

python prepare_data/compute_mel.py

1. Compute a priori alignment probabilities：

python prepare_data/compute_attention_prior.py

Train model

1. Train txt2vec model:

#Using XphoneBERT:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT
#Using Characters (Letters):
python txt2vec/train.py --dataset MM6 --config MM6_Letters
#Using IPA:
python txt2vec/train.py --dataset MM6 --config MM6_IPA
#If you want to train a model without a language layer, you could use xxx_wo config like:
python txt2vec/train.py --dataset MM6 --config MM6_XphoneBERT_wo

NOTE: Please set needUpdate: True in model.yaml after 1/4 iteration, when you use XphoneBERT.

2. Train vec2mel model:

python vec2mel/train.py --dataset MM6 --config MM6

For the training of txt2vec and vec2mel model, we used a batch_size of 16 and trained for 1.2M steps. It took about 3 days on 1 Tesla A100 GPU.

3. Train vec2wav model:

python prepare_data/creat_lists.py
python vec2wav/train.py -c Config/vec2wav/vec2wav.yaml
#If you want to train a model without a language layer:
python vec2wav/train.py -c Config/vec2wav/vec2wav_wo.yaml

For the training of vec2wav , we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

1. Train HifiGAN model:

python Vocoder_HifiGAN_Model/train.py --config Config/config_16k_mel.json

For the training of HifiGAN, we used a batch_size of 16 and trained for 1M steps. It took about 3 days on 1 Tesla A100 GPU.

Test model

1. Prepare test data:
- a. test meta file Dataset/MM6/test.txt.
- b. ref speaker embedding in Dataset/MM6/test_spk_emb/.
1. Generate sample
```
bash test_scripts/quick_test.sh
```
Of course, you can download our pre-trained model from google driver. And put it in the corresponding Train_log directory. The training log can be found in the corresponding Train_log files.
1. The result would be found in test_result files.

To do

Scripts for few-shot training.
Scripts for zero-shot inference on any language.

Citation

If you use this code, result, or MM6 dataset in your paper, please cite our work as:

@article{gong2023zmm,
  title={ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations},
  author={Gong, Cheng and Wang, Xin and Cooper, Erica and Wells, Dan and Wang, Longbiao and Dang, Jianwu and Richmond, Korin and Yamagishi, Junichi},
  journal={arXiv preprint arXiv:2312.14398},
  year={2023}
}

References

Comprehensive-Transformer-TTS, the txt2vec and vec2mel model were built on this project.
XPhoneBERT, a Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech.
MSMC-TTS, the vec2wav model was built on this project.
HifiGAN, Vocoder.
wav2vec2-codebook-indicesThe scripts for extracting the discrete code index and representations.

License

The code in this repository is released under the BSD-3-Clause license as found in the LICENSE file. The txt2vec, vec2mel and vec2wav subfolder have MIT License. The sv56scripts has GPL License.