Home

Awesome

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model

I implement yet another text-to-speech model, dc-tts, introduced in Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. My goal, however, is not just replicating the paper. Rather, I'd like to gain insights about various sound projects.

Requirements

Data

<img src="https://image.shutterstock.com/z/stock-vector-korean-alphabet-korean-hangul-pattern-693680611.jpg" height="200" align="right"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Kate_Winslet_March_18%2C_2014_%28headshot%29.jpg/890px-Kate_Winslet_March_18%2C_2014_%28headshot%29.jpg" height="200" align="right"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Nick_Offerman_at_UMBC_%28cropped%29.jpg/440px-Nick_Offerman_at_UMBC_%28cropped%29.jpg" height="200" align="right"> <img src="https://image.shutterstock.com/z/stock-vector-lj-letters-four-colors-in-abstract-background-logo-design-identity-in-circle-alphabet-letter-418687846.jpg" height="200" align="right">

I train English models and an Korean model on four different speech datasets. <p> 1. LJ Speech Dataset <br/> 2. Nick Offerman's Audiobooks <br/> 3. Kate Winslet's Audiobook <br/> 4. KSS Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available, and it has 24 hours of reasonable quality samples. Nick's and Kate's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours and 5 hours long, respectively. Finally, KSS Dataset is a Korean single speaker speech dataset that lasts more than 12 hours.

Training

You can do STEP 2 and 3 at the same time, if you have more than one gpu card.

Training Curves

<img src="fig/training_curves.png">

Attention Plot

<img src="fig/attention.gif">

Sample Synthesis

I generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

Generated Samples

DatasetSamples
LJ50k 200k 310k 800k
Nick40k 170k 300k 800k
Kate40k 160k 300k 800k
KSS400k

Pretrained Model for LJ

Download this.

Notes