Home

Awesome

<p align="center"> <br> <img src="assets/logo.png" width="200"/> <br> </p> <h2 align="center"> <p> Speech-Editing-Toolkit</p> </h2>

This repo contains official PyTorch implementations of:

<p align="center"> <br> <img src="assets/spec_denoiser.gif" width="400" height="180"/> <br> </p>

This repo contains unofficial PyTorch implementations of:

Supported Datasets

Our framework supports the following datasets:

Install Dependencies

Please install the latest numpy, torch and tensorboard first. Then run the following commands:

export PYTHONPATH=.
# install requirements.
pip install -U pip
pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3

Finally, install Montreal Forced Aligner following the link below:

https://montreal-forced-aligner.readthedocs.io/en/latest/

Download the pre-trained vocoder

mkdir pretrained
mkdir pretrained/hifigan_hifitts

download model_ckpt_steps_2168000.ckpt, config.yaml, from https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing to pretrained/hifigan_hifitts

Data Preprocess

# You can set the 'self.dataset_name' in these files as 'vctk' or 'libritts' to process these datasets. And you should also set the ``BASE_DIR`` value in ``run_mfa_train_align.sh`` to the corresponding directory. 
# The default dataset is ``vctk``.
python data_gen/tts/base_preprocess.py
python data_gen/tts/run_mfa_train_align.sh
python data_gen/tts/base_binarizer.py

Train (FluentSpeech)

# Example run for FluentSpeech.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/spec_denoiser.yaml --exp_name spec_denoiser --reset

Train (Baselines)

# Example run for CampNet.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/campnet.yaml --exp_name campnet --reset
# Example run for A3T.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/a3t.yaml --exp_name a3t --reset
# Example run for EditSpeech.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/editspeech.yaml --exp_name editspeech --reset

Pretrained Checkpoint

Here, we provide the pretrained checkpoint of fluentspeech. To start, please put the config.yaml and xxx.ckpt at ./checkpoints/spec_denoiser/.

modeldataseturlcheckpoint name
FluentSpeechlibritts-cleanhttps://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharingmodel_ckpt_steps_568000.ckpt

Inference

We provide the data structure of inference in inference/example.csv. text and edited_text refer to the original text and target text. region refers to the word idx range (start from 1 ) that you want to edit. edited_region refers to the word idx range of the edited_text.

iditem_nametextedited_textwav_fn_origedited_regionregion
01"this is a libri vox recording""this is a funny joke shows."inference/audio_backup/1.wav[3,6][3,6]
# run with one example
python inference/tts/spec_denoiser.py --exp_name spec_denoiser

Citation

If you find this useful for your research, please star our repo.

License and Agreement

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Tips

  1. If you find the mfa_dict.txt, mfa_model.zip, phone_set.json, or word_set.json are missing in inference, you need to run the preprocess script in our repo to get them. You can also download all of these files you need for inferencing the pre-trained model from https://drive.google.com/drive/folders/1BOFQ0j2j6nsPqfUlG8ot9I-xvNGmwgPK?usp=sharing and put them in data/processed/libritts.
  2. Please specify the MFA version as 2.0.0rc3.

If you find any other problems, please contact me.