Home

Awesome

TransferTTS (Zero-shot VITS) - PyTorch Implementation (-Ongoing-)

Note!!(09.23.)

In current, this is just a implementation of zero-shot system; Not the implementation of the first contribution of the paper: Transfer learning framework using wav2vec2.0. As the future work, the model equipped with complete implementations of the two contributions (zero-shot and transfer-learning) will be implemented in the follwoing repository. Congratulations on being awarded the best paper in INTERSPEECH 2022.

Overview

Unofficial PyTorch Implementation of Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. Most of codes are based on VITS

  1. MelStyleEncoder from StyleSpeech is used instead of the reference encoder.
  2. Implementation of untranscribed data training is omitted.
  3. LibriTTS dataset (train-clean-100 and train-clean-360) is used. Sampling rate is set to 22050Hz.
<p align="center"> <img src="img/Overview.jpg" width="80%"> </p>

Pre-requisites (from VITS)

  1. Python >= 3.6
  2. Clone this repository
  3. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  4. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

Preprocessing

Run

python prepare_wav.py --data_path [LibriTTS DATAPATH]

for some preparations.

Training

Train your model with

python train_ms.py -c configs/libritts.json -m libritts_base

Inference

python inference.py --ref_audio [REF AUDIO PATH] --text [INPUT TEXT]

References