Home

Awesome

StyleTTS 2: The Python Package

This package makes StyleTTS2, an approach to human-level text-to-speech, accessible with an inference module that uses strictly MIT licensed libraries. See Conditions and Terms of Use, Common Issues, and Notes below.

Quick Start

  1. Ensure you are running Python >= 3.9 (currently supports 3.9, 3.10 due to some other library dependencies)
  2. [Optional] Downloaded the StyleTTS2 LibriTTS checkpoint and corresponding config file. Both are available to download at https://huggingface.co/yl4579/StyleTTS2-LibriTTS. You can also provide paths to your own checkpoint and config file (just ensure it is the same format as the original one).
  3. Install the package using pip:
pip install styletts2
  1. Try it out either in Python shell or in your code:
from styletts2 import tts

# No paths provided means default checkpoints/configs will be downloaded/cached.
my_tts = tts.StyleTTS2()

# Optionally create/write an output WAV file.
out = my_tts.inference("Hello there, I am now a python package.", output_wav_file="test.wav")

# Specific paths to a checkpoint and config can also be provided.
other_tts = tts.StyleTTS2(model_checkpoint_path='/PATH/TO/epochs_2nd_00020.pth', config_path='/PATH/TO/config.yml')

# Specify target voice to clone. When no target voice is provided, a default voice will be used.
other_tts.inference("Hello there, I am now a python package.", target_voice_path="/PATH/TO/some_voice.wav", output_wav_file="another_test.wav")

Inference function reference

def inference(self,
              text: str,
              target_voice_path=None,
              output_wav_file=None,
              output_sample_rate=24000,
              alpha=0.3,
              beta=0.7,
              diffusion_steps=5,
              embedding_scale=1,
              ref_s=None)

text: Input text to turn into speech.

target_voice_path: Path to audio file of target voice to clone.

output_wav_file: Name of output audio file (if output WAV file is desired).

output_sample_rate: Output sample rate (default 24000).

alpha: Determines timbre of speech, higher means style is more suitable to text than to the target voice.

beta: Determines prosody of speech, higher means style is more suitable to text than to the target voice.

diffusion_steps: The more the steps, the more diverse the samples are, with the cost of speed.

embedding_scale: Higher scale means style is more conditional to the input text and hence more emotional.

ref_s: Pre-computed style vector to pass directly.

return: audio data as a Numpy array (will also create the WAV file if output_wav_file was set).

Note: I'm not affiliated with the original authors. StyleTTS2 is a neat, open source, state-of-the-art approach to TTS. Pass your kudos to the authors at the model repo:

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Original authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

Paper: https://arxiv.org/abs/2306.07691

Audio samples: https://styletts2.github.io/

Online demo: Hugging Face (thank @fakerybakery for the wonderful online demo)

Open In Colab Slack

Conditions and Terms of Use

Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.

Common Issues

TODO

Notes