Awesome
Parallel Tacotron2
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
<p align="center"> <img src="img/parallel_tacotron.png" width="80%"> </p> <p align="center"> <img src="img/parallel_tacotron2.png" width="40%"> </p>Updates
-
2021.05.25:
<p align="center"> <img src="img/debugging.png" width="80%"> </p>Only the soft-DTW remains the last hurdle!
Following the author's advice on the implementation, I took several tests on each module one by one under a supervised duration signal with L1 loss (FastSpeech2). Until now, I can confirm that all modules except soft-DTW are working well as follows (Synthesized Spectrogram, GT Spectrogram, Residual Alignment, and W from LearnedUpsampling from top to bottom).For the details, please check the latest commit log and the updated Implementation Issues section. Also, you can find the ongoing experiments at https://github.com/keonlee9420/FastSpeech2/commits/ptaco2.
-
2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.
I'm waiting for your contribution!
Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.
Training
Requirements
-
You can install the Python dependencies with
pip3 install -r requirements.txt
-
Install fairseq (official document, github) to utilize
LConvBlock
. Please check #5 to resolve any issue on installing.
Datasets
The supported datasets:
- LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- (will be added more)
Preprocessing
After downloading the datasets, set the corpus_path
in preprocess.yaml
and run the preparation script:
python3 prepare_data.py config/LJSpeech/preprocess.yaml
Then, run the preprocessing script:
python3 preprocess.py config/LJSpeech/preprocess.yaml
Training
Train your model with
python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!
Inference
Inference
For a single inference, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
The generated utterances will be saved in output/result/
.
Batch Inference
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
to synthesize all utterances in preprocessed_data/LJSpeech/val.txt
.
TensorBoard
Use
tensorboard --logdir output/log/LJSpeech
to serve TensorBoard on your localhost.
Implementation Issues
Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent NaN value (gradient) on forward and backward calculations. (NaN indicates that something is wrong in the network)
Text Encoder
- Use the
FFTBlock
of FastSpeech2 for the transformer block of the text encoder. - Use dropout
0.2
for theConvBlock
of the text encoder. - To restore "proprietary normalization engine",
- Apply the same text normalization as in FastSpeech2.
- Implement
grapheme_to_phoneme
function. (See ./text/init).
Residual Encoder
- Use
80 channels
mel-spectrogrom instead of128-bin
. - Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge.
Duration Predictor & Learned Upsampling
- Use
nn.SiLU()
for the swish activation. - When obtaining
W
andC
, concatenation operation is applied amongS
,E
, andV
after frame-domain (T domain) broadcasting ofV
.
Decoder
- Use
LConvBlock
and regular sinusoidal positional embedding. - Iterative mel-spectrogram is projected by a linear layer.
- Apply
nn.Tanh()
to eachLConvBLock
output (following activation pattern of decoder part in FastSpeech2).
Loss
- Use optimization & scheduler of FastSpeech2 (which is from Attention is all you need as described in the original paper).
- Base on pytorch-softdtw-cuda (post) for the soft-DTW.
- Implement customized soft-DTW in
model/soft_dtw_cuda.py
, reflecting the recursion suggested in the original paper. - In the original soft-DTW, the final loss is not assumed and therefore only
E
is computed. But employed as a loss function, jacobian product is added to return target derivetive ofR
w.r.t. inputX
. - Currently, the maximum batch size is
8
in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.- In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet.
- Implement customized soft-DTW in
Citation
@misc{lee2021parallel_tacotron2,
author = {Lee, Keon},
title = {Parallel-Tacotron2},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
}