Awesome
$\text{A}^3\text{T}$: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
Code for paper $\text{A}^3\text{T}$: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing.
Checkpoints: HuggingFace Model Hub.
Demo: Listen me
:fire: This work has been implemented by PaddleSpeech, where they extend $\text{A}^3\text{T}$ to a multilingual version.
Note: If you just want to learn how did we implement the pre-training model, please take a look at this class ESPnetMLMEncAsDecoderModel
.
0. Setup
This repos is forked from ESPnet, please setup your environment according to ESPnet's instruction.
An alternative solution is to use our docker image:
docker pull richardbaihe/pytorch:a3t
And inside this docker container, use
conda activate espnet
Our forced aligner and phoneme tokenizer are from HTK and included in tools
folder.
Our codebase support the training and evaluation of LJSpeech, VCTK, and LibriTTS. Here, we take the VCTK as an example in this README.
Our vocoder is from https://github.com/kan-bayashi/ParallelWaveGAN
The Fastspeech2 checkpoint is downloaded from https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv
1. Data preprocess
After setup ESPnet environment, Please follow egs2/vctk/sedit/README.md
.
2. Inference with speech editing or new speaker TTS
We provide a python script for vctk speech editing and prompt-based TTS decoding bin/sedit_inference.py
, where you can find an example in the main
function.
3. Train you own model
Please follow egs2/vctk/sedit/README.md
.
To cite our work:
@InProceedings{pmlr-v162-bai22d,
title = {{A}$^3${T}: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing},
author = {Bai, He and Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Li, Xintong and Huang, Liang},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {1399--1411},
year = {2022},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17--23 Jul},
publisher = {PMLR},
pdf = {https://proceedings.mlr.press/v162/bai22d/bai22d.pdf},
url = {https://proceedings.mlr.press/v162/bai22d.html},
}
@inproceedings{bai2021segatron,
title={Segatron: Segment-aware transformer for language modeling and understanding},
author={Bai, He and Shi, Peng and Lin, Jimmy and Xie, Yuqing and Tan, Luchen and Xiong, Kun and Gao, Wen and Li, Ming},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={14},
pages={12526--12534},
year={2021}
}