Home

Awesome

<p align="center"> <br> <img src="https://raw.githubusercontent.com/as-ideas/TransformerTTS/master/docs/transformer_logo.png" width="400"/> <br> </p> <h2 align="center"> <p>A Text-to-Speech Transformer in TensorFlow 2</p> </h2>

Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS). <br> This repo is based, among others, on the following papers:

Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:

(older versions are available also for WaveRNN)

For quick inference with these vocoders, checkout the Vocoding branch

Non-Autoregressive

Being non-autoregressive, this Transformer model is:

🔈 Samples

Can be found here.

These samples' spectrograms are converted using the pre-trained MelGAN vocoder.<br>

Try it out on Colab:

Open In Colab

Updates

📖 Contents

Installation

Make sure you have:

Install espeak as phonemizer backend (for macOS use brew):

sudo apt-get install espeak

Then install the rest with pip:

pip install -r requirements.txt

Read the individual scripts for more command line arguments.

Pre-Trained LJSpeech API

Use our pre-trained model (with Griffin-Lim) from command line with

python predict_tts.py -t "Please, say something."

Or in a python script

from data.audio import Audio
from model.factory import tts_ljspeech

model = tts_ljspeech()
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

You can specify the model step with the --step flag (CL) or step parameter (script).<br> Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).

<b>IMPORTANT:</b> make sure to checkout the correct repository version to use the API.<br> Currently 493be6345341af0df3ae829de79c2793c9afd0ec

Dataset

You can directly use LJSpeech to create the training dataset.

Configuration

Custom dataset

Prepare a folder containing your metadata and wav files, for instance

|- dataset_folder/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

if metadata.csv has the following format wav_file_name|transcription you can use the ljspeech preprocessor in data/metadata_readers.py, otherwise add your own under the same file.

Make sure that:

Training

Change the --config argument based on the configuration of your choice.

Train Aligner Model

Create training dataset

python create_training_data.py --config config/training_config.yaml

This will populate the training data directory (default transformer_tts_data.ljspeech).

Training

python train_aligner.py --config config/training_config.yaml

Train TTS Model

Compute alignment dataset

First use the aligner model to create the durations dataset

python extract_durations.py --config config/training_config.yaml

this will add the durations.<session name> as well as the char-wise pitch folders to the training data directory.

Training

python train_tts.py --config config/training_config.yaml

Training & Model configuration

Resume or restart training

Monitor training

tensorboard --logdir /logs/directory/

Tensorboard Demo

Prediction

With model weights

From command line with

python predict_tts.py -t "Please, say something." -p /path/to/weights/

Or in a python script

from model.models import ForwardTransformer
from data.audio import Audio
model = ForwardTransformer.load_model('/path/to/weights/')
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

Model Weights

Access the pre-trained models with the API call.

Old weights

Model URLCommitVocoder Commit
ljspeech_tts_model0cd7d33aca5990
ljspeech_melgan_forward_model1c1cb03aca5990
ljspeech_melgan_autoregressive_model_v21c1cb03aca5990
ljspeech_wavernn_forward_model1c1cb033595219
ljspeech_wavernn_autoregressive_model_v21c1cb033595219
ljspeech_wavernn_forward_modeld9ccee63595219
ljspeech_wavernn_autoregressive_model_v2d9ccee63595219
ljspeech_wavernn_autoregressive_model_v12f3a1b53595219

Maintainers

Special thanks

MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.

Erogol and the Mozilla TTS team for the lively exchange on the topic.

Copyright

See LICENSE for details.