Home

Awesome

StyleTTS 2 API

[!CAUTION] The Streaming API is not fully implemented yet.

Original Repo - CLI Tool - Streaming API

(GPL licensed due to Phonemizer. Should I switch to OpenPhonemizer and make it MIT-licensed?)

StyleTTS 2 is by Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani. I am not affiliated with them.

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Online demo: Hugging Face (thank @fakerybakery for the wonderful online demo)

Open In Colab Slack

TODO

Pre-requisites

  1. Python >= 3.7
  2. Clone this repository:
git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
  1. Install python requirements:
pip install -r requirements.txt

On Windows add:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U

Also install phonemizer and espeak if you want to run the demo:

pip install phonemizer
sudo apt-get install espeak-ng
  1. Download and extract the LJSpeech dataset, unzip to the data folder and upsample the data to 24 kHz. The text aligner and pitch extractor are pre-trained on 24 kHz data, but you can easily change the preprocessing and re-train them using your own preprocessing. For LibriTTS, you will need to combine train-clean-360 with train-clean-100 and rename the folder train-clean-460 (see val_list_libritts.txt as an example).

Streaming API

You can use StyleTTS 2 in your projects by launching the HTTP API with streaming support. Synthesize text from your frontend apps, etc by making HTTP calls to the API server. The server uses Flask. It has not been extensively tested and should not be used for production purposes.

API documentation may be found in the API_DOCS.md file.

Launch server:

python api.py

Python API

You can now use StyleTTS 2 directly in your programs! A pip-compatible package is coming soon.

Multi-Speaker Inference:

from scipy.io.wavfile import write
import msinference
text = 'Hello world!'
voice = msinference.compute_style('voice.wav')
wav = msinference.inference(text, voice, alpha=0.3, beta=0.7, diffusion_steps=7, embedding_scale=1)
write('result.wav', 24000, wav)

LJSpeech Inference:

from scipy.io.wavfile import write
import ljinference
text = 'Hello world!'
noise = torch.randn(1,1,256).to('cuda' if torch.cuda.is_available() else 'cpu')
wav = ljinference.inference(text, noise, diffusion_steps=7, embedding_scale=1)
write('result.wav', 24000, wav)

For longer text, you can help implement #54 or use Tortoise TTS for splitting:

from tortoise.utils.text import split_and_recombine_text
import numpy as np
from scipy.io.wavfile import write
import msinference
text = 'Long text here...'
texts = split_and_recombine_text(text)
audios = []
voice = msinference.compute_style('voice.wav')
for t in texts:
    audios.append(msinference.inference(t, voice, alpha=0.3, beta=0.7, diffusion_steps=7, embedding_scale=1))
write('result.wav', 24000, np.concatenate(audios))

GUI

You can run inference (finetuning coming soon) on a GUI based on the online demo powered by Gradio.

python app.py

NOTE: Only the multi-speaker tab supports long-text currently.

Note: the online demo will be updated more frequently as changes are pushed directly to it (rather than through PRs). If you would like to use the latest (potentially unstable) version, use Docker:

docker run -it -p 7860:7860 --platform=linux/amd64 --gpus all registry.hf.space/styletts2-styletts2:latest python app.py

Training

First stage training:

accelerate launch train_first.py --config_path ./Configs/config.yml

Second stage training (DDP version not working, so the current version uses DP, again see #7 if you want to help):

python train_second.py --config_path ./Configs/config.yml

You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at log_dir.

The data list format needs to be filename.wav|transcription|speaker, see val_list.txt as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.

Important Configurations

In config.yml, there are a few important configurations to take care of:

Pre-trained modules

In Utils folder, there are three pre-trained models:

Common Issues

Finetuning

The script is modified from train_second.py which uses DP, as DDP does not work for train_second.py. Please see the bold section above if you are willing to help with this problem.

python train_finetune.py --config_path ./Configs/config_ft.yml

Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100. The samples can be found at #65 (comment).

If you are using a single GPU (because the script doesn't work with DDP) and want to save training speed and VRAM, you can do (thank @korakoe for making the script at #100):

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

Open In Colab

Common Issues

@Kreevoz has made detailed notes on common issues in finetuning, with suggestions in maximizing audio quality: #81. Some of these also apply to training from scratch. @IIEleven11 has also made a guideline for fine-tuning: #128.

Inference

Please refer to Inference_LJSpeech.ipynb (single-speaker) and Inference_LibriTTS.ipynb (multi-speaker) for details. For LibriTTS, you will also need to download reference_audio.zip and unzip it under the demo before running the demo.

You can import StyleTTS 2 and run it in your own code. However, the inference depends on a GPL-licensed package, so it is not included directly in this repository. A GPL-licensed fork has an importable script, as well as an experimental streaming API, etc. A fully MIT-licensed package that uses gruut (albeit lower quality due to mismatch between phonemizer and gruut) is also available.

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

Common Issues

References

License

NOTE: By contributing to this software you agree that the license may be changed in the future once I find a phonemizer replacement.

This package depends on phonemizer, which is GPL licensed. Check out the original repository for a MIT-licensed version w/o the API!. I'm working on a permissively licensed Phonemizer - coming soon!

NOTE: By contributing to this project you agree that the authors may change the license in the future

Copyright (C) 2023 Aaron (Yinghao) Li (under the MIT license). Modifications copyright (C) 2023-2024 mrfakename.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

This software was previously licensed under the MIT license:

MIT License

Copyright (c) 2023 Aaron (Yinghao) Li

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.