Home

Awesome

X-E-Speech: Joint Training Framework of Non-Autoregressive Cross-lingual Emotional Text-to-Speech and Voice Conversion

Openreview githubio GitHub Repo stars GitHub

In this paper, we propose a cross-lingual emotional speech generation model, X-E-Speech, which achieves the disentanglement of speaker style and cross-lingual content features by jointly training non-autoregressive (NAR) voice conversion (VC) and text-to-speech (TTS) models. For TTS, we freeze the style-related model components and fine-tune the content-related structures to enable cross-lingual emotional speech synthesis without accent. For VC, we improve the emotion similarity between the generated results and the reference speech by introducing the similarity loss between content features for VC and text for TTS.

Visit our demo page for audio samples.

We also provide the pretrained models.

<table style="width:100%"> <tr> <td><img src="x-speech-biger.png" height="300"></td> </tr> </table>

Todo list

Pre-requisites

  1. Clone this repo: git clone https://github.com/X-E-Speech/X-E-Speech-code.git

  2. CD into this repo: cd X-E-Speech-code

  3. Python=3.7, Install python requirements: pip install -r requirements.txt

    You may need to install:

    1. espeak for English: apt-get install espeak
    2. pypinyin pip install pypinyin and jieba for Chinese
    3. pyopenjtalk for Japenese: pip install pyopenjtalk
  4. Download Whisper-large-v2 and put it under directory 'whisper-pretrain/'

  5. Download the VCTK, Aishell3, JVS dataset (for training cross-lingual TTS and VC)

  6. Download the ESD dataset (for training cross-lingual emotional TTS and VC)

  7. Build Monotonic Alignment Search

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

Inference Example

Download the pretrained checkpoints and run:

#For Cross-lingual Chinese TTS
inference-cross-lingual-TTS-cn.py
#For Cross-lingual English TTS
inference-cross-lingual-TTS-en.py
#For Cross-lingual Emotional English TTS
inference-cross-lingual-emotional-TTS-en.py
#For Cross-lingual Emotional VC, you need to refer to preprocess_weo.py to generate npy files first
inference-cross-lingual-emotional-VC.py

Training Example

  1. Preprocess-resample to 16KHz

Copy datasets to the dataset folder and then resample the audios to 16KHz by dataset/downsample.py. This will rewrite the original wav files, so please copy but not cut your original dataset!

  1. Preprocess-whisper

Generate the whisper encoder output.

python preprocess_weo.py  -w dataset/vctk/ -p dataset/vctk_largev2
python preprocess_weo.py  -w dataset/aishell3/ -p dataset/aishell3_largev2
python preprocess_weo.py  -w dataset/jvs/ -p dataset/jvs_largev2
python preprocess_weo.py  -w dataset/ESD/ -p dataset/ESD_largev2
  1. Preprocess-g2p

I provide the g2p results for my dataset in filelist. If you want to do g2p to your datasets:

Refer to filelist/train_test_split.py to split the dataset into train set and test set.

  1. Train cross-lingual TTS and VC

Train the whole model by cross-lingual datasets:

python train_whisper_hier_multi_pure_3.py  -c configs/cross-lingual.json -m cross-lingual-TTS

Freeze the speaker-related part and finetune the content related part by mono-lingual dataset:

python train_whisper_hier_multi_pure_3_freeze.py  -c configs/cross-lingual-emotional-freezefinetune-en.json -m cross-lingual-TTS-en
  1. Train cross-lingual emotional TTS and VC

Train the whole model by cross-lingual emotional datasets:

python train_whisper_hier_multi_pure_esd.py  -c configs/cross-lingual-emotional.json -m cross-lingual-emotional-TTS

Freeze the speaker-related part and finetune the content related part by mono-lingual dataset:

python train_whisper_hier_multi_pure_esd_freeze.py  -c configs/cross-lingual-emotional-freezefinetune-en.json -m cross-lingual-emotional-TTS-en

Change the model structure

There is two SynthesizerTrn in models_whisper_hier_multi_pure.py. The difference is the n_langs.

So if you want to train this model for more than 3 languages, change the number of n_langs.

References