Home

Awesome

OverFlow: Putting flows on top of neural transducers for better TTS

Shivam Mehta, Éva Székely, Jonas Beskow, and Gustav Eje Henter

This is the official code repository for the paper "OverFlow: Putting flows on top of neural transducers for better TTS". For audio examples, visit our demo page. pre-trained model (female) and pre-trained model (male) are also available.

OverFlow is now also available in Coqui TTS! Making it easier for people to use and experiment with OverFlow please find the training recipe under recipes/ljspeech/overflow rolling out more recipes soon!

# Install TTS
pip install tts
# Change --text to the desired text prompt
# Change --out_path to the desired output path
tts --text "Hello world!" --model_name tts_models/en/ljspeech/overflow --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --out_path output.wav

Current plan is to maintain both the repositories.

<img src="docs/images/model_architecture.png" alt="Architecture of OverFlow" width="650"/>

Setup and training using LJ Speech

  1. Download and extract the LJ Speech dataset. Place it in the data folder such that the directory becomes data/LJSpeech-1.1. Otherwise update the filelists in data/filelists accordingly.
  2. Clone this repository git clone https://github.com/shivammehta25/OverFlow.git
    • If using multiple GPUs change the flag in src/hparams.gradient_checkpoint=False
  3. Initalise the submodules git submodule init; git submodule update
  4. Make sure you have docker installed and running.
    • It is recommended to use Docker (it manages the CUDA runtime libraries and Python dependencies itself specified in Dockerfile)
    • Alternatively, If you do not intend to use Docker, you can use pip to install the dependencies using pip install -r requirements.txt
  5. Run bash start.sh and it will install all the dependencies and run the container.
  6. Check src/hparams.py for hyperparameters and set GPUs.
    1. For multi-GPU training, set GPUs to [0, 1 ..]
    2. For CPU training (not recommended), set GPUs to an empty list []
    3. Check the location of transcriptions
  7. Once your filelists and hparams are updated run python generate_data_properties.py to generate data_parameters.pt for your dataset (the default data_parameters.pt is available for LJSpeech in the repository).
  8. Run python train.py to train the model.
    1. Checkpoints will be saved in the hparams.checkpoint_dir.
    2. Tensorboard logs will be saved in the hparams.tensorboard_log_dir.
  9. To resume training, run python train.py -c <CHECKPOINT_PATH>

Synthesis

  1. Download our pre-trained LJ Speech model.
  2. Download HiFi gan pretrained HiFiGAN model.
    • We recommend using fine tuned on Tacotron2 if you cannot finetune on OverFlow.
  3. Run jupyter notebook and open synthesis.ipynb or use the overflow_speak.py file.

For one sentence

python overflow_speak.py -t "Hello world" --checkpoint_path <CHECKPOINT_PATH> --hifigan_checkpoint_path <HIFIGAN_PATH>  --hifigan_config <HIFIGAN_CONFIG_PATH>

For multiple sentence put them into a file each sentence in a new line

python overflow_speak.py -f <FILENAME> --checkpoint_path <CHECKPOINT_PATH> --hifigan_checkpoint_path <HIFIGAN_PATH>  --hifigan_config <HIFIGAN_CONFIG_PATH>

Miscellaneous

Mixed-precision training or full-precision training

Multi-GPU training or single-GPU training

Known issues/warnings

Torchmetric error on RTX 3090

torch==1.11.0a0+b6df043
--extra-index-url https://download.pytorch.org/whl/cu113
torchmetrics==0.6.0

Support

If you have any questions or comments, please open an issue on our GitHub repository.

Citation information

If you use or build on our method or code for your research, please cite our paper:

@inproceedings{mehta2023overflow,
  title={{O}ver{F}low: {P}utting flows on top of neural transducers for better {TTS}},
  author={Mehta, Shivam and Kirkland, Ambika and Lameris, Harm and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. Interspeech},
  pages={4279--4283},
  doi={10.21437/Interspeech.2023-1996},
  year={2023}
}

Acknowledgements

The code implementation is based on Nvidia's implementation of Tacotron 2, Glow TTS and uses PyTorch Lightning for boilerplate-free code.