Home

Awesome

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation by Hierarchical Variational Inference for Zero-shot Speech Synthesis <br><sub>The official implementation of HierSpeech++</sub>

<a src="https://img.shields.io/badge/cs.CV-2311.12454-b31b1b?logo=arxiv&logoColor=red" href="http://arxiv.org/abs/2311.12454"> <img src="https://img.shields.io/badge/cs.CV-2311.12454-b31b1b?logo=arxiv&logoColor=red"></a>|Hugging Face Spaces|Demo page|Checkpoint

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee

Department of Artificial Intelligence, Korea University, Seoul, Korea

Abstract

Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis.

Fig1_pipeline

This repository contains:

<!-- - 💥 A Colab notebook for running pre-trained HierSpeech++ models (Soon..) 🛸 A HierSpeech++ training script (Will be released soon) -->

Previous Our Works

This paper is an extension version of above papers.

Update

24.02.20

24.01.19

Todo

Hierarchical Speech Synthesizer

<!-- - [ ] HierSpeech-Lite (Fast and Efficient Zero-shot Speech Synthesizer) - [ ] HierSinger (Zero-shot Singing Voice Synthesizer) - [ ] HierSpeech2-24k-Large-Full (For High-resolutional and High-quality Speech Synthesizer) - [ ] HierSpeech2-48k-Large-Full (For Industrial-level High-resolution and High-quality Speech Synthesizer) -->

Text-to-Vec (TTV)

<!-- - [ ] Hierarchical Text-to-Vec (For Much More Expressive Text-to-Speech) -->

Speech Super-resolution (16k --> 24k or 48k)

Cleaning Up the Source Code

Training code (Will be released after paper acceptance)

Getting Started

Pre-requisites

  1. Pytorch >=1.13 and torchaudio >= 0.13
  2. Install requirements
pip install -r requirements.txt
  1. Install Phonemizer
pip install phonemizer
sudo apt-get install espeak-ng

Checkpoint [Download]

Hierarchical Speech Synthesizer

ModelSampling RateParamsDatasetHourSpeakerCheckpoint
HierSpeech216 kHz97MLibriTTS (train-460)2451,151[Download]
HierSpeech216 kHz97MLibriTTS (train-960)5552,311[Download]
HierSpeech216 kHz97MLibriTTS (train-960), Libri-light (Small, Medium), Expresso, MSSS(Kor), NIKL(Kor)2,7967,299[Download]
<!-- | HierSpeech2-Lite|16 kHz|-| LibriTTS (train-960)) |-| | HierSpeech2-Lite|16 kHz|-| LibriTTS (train-960) NIKL, AudioBook-Korean) |-| | HierSpeech2-Large-CL|16 kHz|200M| LibriTTS (train-960), Libri-Light, NIKL, AudioBook-Korean, Japanese, Chinese, CSS, MLS) |-| -->

TTV

ModelLanguageParamsDatasetHourSpeakerCheckpoint
TTVEng107MLibriTTS (train-960)5552,311[Download]
<!-- | TTV |Kor|100M| NIKL |114|118|-| | TTV |Eng|50M| LibriTTS (train-960) |555|2,311|-| | TTV-Large |Eng|100M| LibriTTS (train-960) |555|2,311|-| | TTV-Lite |Eng|10M| LibriTTS (train-960) |555|2,311|-| | TTV |Kor|50M| NIKL |114|118|-| -->

SpeechSR

ModelSampling RateParamsDatasetCheckpoint
SpeechSR-24k16kHz --> 24 kHz0.13MLibriTTS (train-960), MSSS (Kor)speechsr24k
SpeechSR-48k16kHz --> 48 kHz0.13MMSSS (Kor), Expresso (Eng), VCTK (Eng)speechsr48k

Text-to-Speech

sh inference.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference.py \
                --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
                --ckpt_text2w2v "logs/ttv_libritts_v1/ttv_lt960_ckpt.pth" \
                --output_dir "tts_results_eng_kor_v2" \
                --noise_scale_vc "0.333" \
                --noise_scale_ttv "0.333" \
                --denoise_ratio "0"

Noise Control

# without denoiser
--denoise_ratio "0"
# with denoiser
--denoise_ratio "1"
# Mixup (Recommend 0.6~0.8)
--denoise_rate "0.8" 

Voice Conversion

sh inference_vc.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \
                --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
                --output_dir "vc_results_eng_kor_v2" \
                --noise_scale_vc "0.333" \
                --noise_scale_ttv "0.333" \
                --denoise_ratio "0"

Speech Super-resolution

--output_sr "48000" # Default
--output_sr "24000" # 
--output_sr "16000" # without super-resolution.

Speech Denoising for Noise-free Speech Synthesis (Only used in Speaker Encoder during Inference)

TTV-v2 (WIP)

GAN VS Diffusion

<details> <summary> [Read More] </summary> We think that we could not confirm which is better yet. There are many advatanges for each model so you can utilize each model for your own purposes and each study must be actively conducted simultaneously.

GAN (Specifically, GAN-based End-to-End Speech Synthesis Models)

Diffusion (Diffusion-based Mel-spectrogram Generation Models)

(In this wors) Our Approaches for GAN-based End-to-End Speech Synthesis Models

(Our other works) Diffusion-based Mel-spectrogram Generation Models

Our Goals

</details>

LLM-based Models

We hope to compare LLM-based models for zero-shot TTS baselines. However, there is no public-available official implementation of LLM-based TTS models. Unfortunately, unofficial models have a poor performance in zero-shot TTS so we hope they will release their model for a fair comparison and reproducibility and for our speech community. THB I could not stand the inference speed almost 1,000 times slower than e2e models It takes 5 days to synthesize the full sentences of LibriTTS-test subsets. Even, the audio quality is so bad. I hope they will release their official source code soon.

In my very personal opinion, VITS is still the best TTS model I have ever seen. But, I acknowledge that LLM-based models have much powerful potential for their creative generative performance from the large-scale dataset but not now.

Limitation of our work

TTV v2 may reduce this issue significantly...!

Results [Download]

We have attached all samples from LibriTTS test-clean and test-other.

Reference

Our repository is heavily based on VITS and BigVGAN.

<details> <summary> [Read More] </summary>

Our Previous Works

Baseline Model

Waveform Generator for High-quality Audio Generation

Self-supervised Speech Model

Other Large Language Model based Speech Synthesis Model

Diffusion-based Model

AdaLN-zero

Thanks for all nice works.

</details>