Awesome
COMOSPEECH
Implementation of the CoMospeech. For all details check out our paper accepted to ACM MM 2023: CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model.
Authors: Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo.
Update
2024-04-26
- We propose FlashSpeech, an efficient zero-shot speech synthesizer based on the latent consistency model and adversarial training. (Paper).
2023-12-01
- We also propose a well-designed Singing Voice Conversion (SVC) version based on consistency model (Code).
2023-11-30
- We find that zero-mean Gaussian noise instead of the prior in grad-tts can also achieve similar performance. We alse release the new code and checkpoints.
2023-10-21
- We add Heun’s 2nd order method support for teacher model (can be used for teacher model sampling and better ODE trajectory for consistency distillation).
Abstract
Demo page: link.
Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines.
Prepare
Build monotonic_align
code (Cython):
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..
Inference
Run script inference.py
by providing path to the text file, path to the checkpoint, number of sampling :
python inference.py -f <text file> -c <checkpoint> -t <sampling steps>
Check out folder called out
for generated audios. Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. In addition, we use the same vocoder in Grad-TTS. You can download it and put into checkpts folder.
Training
We use LJSpeech datasets and follow the train/test/val split in fastspeech2, you can change the split in fs2_txt folder. Then run script train.py
,
python train.py
Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. While training Comospeech, teacher checkpoint directory should be provide.
Checkpoints trained on LJSpeech can be download from here.
Acknowledgement
I would like to extend a special thanks to authors of Grad-TTS, since our code base is mainly borrowed from Grad-TTS.
Contact
You are welcome to send pull requests or share some ideas with me. Contact information: Zhen YE ( zhenye312@gmail.com )