Home

Awesome

🚧 While I no longer actively update this repo, you can find me continuously pushing this tech forward to good side and open-source. Join me at MaskGCT. I'm also building an optimized and cloud hosted version: https://noiz.ai/

mockingbird

MIT License

English | 中文| 中文Linux

Features

🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc.

🤩 PyTorch worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060

🌍 Windows + Linux run in both Windows OS and linux OS (even in M1 MACOS)

🤩 Easy & Awesome effect with only newly-trained synthesizer, by reusing the pretrained encoder/vocoder

🌍 Webserver Ready to serve your result with remote calling

DEMO VIDEO

Quick Start

1. Install Requirements

1.1 General Setup

Follow the original repo to test if you got all environment ready. **Python 3.7 or higher ** is needed to run the toolbox.

If you get an ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu102 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2 ) This error is probably due to a low version of python, try using 3.9 and it will install successfully

The recommended environment here is Repo Tag 0.0.1 Pytorch1.9.0 with Torchvision0.10.0 and cudatoolkit10.2 requirements.txt webrtcvad-wheels because requirements. txt was exported a few months ago, so it doesn't work with newer versions

or

1.2 Setup with a M1 Mac

The following steps are a workaround to directly use the original demo_toolbox.pywithout the changing of codes.

Since the major issue comes with the PyQt5 packages used in demo_toolbox.py not compatible with M1 chips, were one to attempt on training models with the M1 chip, either that person can forgo demo_toolbox.py, or one can try the web.py in the project.

1.2.1 Install PyQt5, with ref here.
1.2.2 Install pyworld and ctc-segmentation

Both packages seem to be unique to this project and are not seen in the original Real-Time Voice Cloning project. When installing with pip install, both packages lack wheels so the program tries to directly compile from c code and could not find Python.h.

1.2.3 Other dependencies
1.2.4 Run the Inference Time (with Toolbox)

To run the project on x86 architecture. ref.

2. Prepare your models

Note that we are using the pretrained encoder/vocoder but not synthesizer, since the original model is incompatible with the Chinese symbols. It means the demo_cli is not working at this moment, so additional synthesizer models are required.

You can either train your models or use existing ones:

2.1 Train encoder with your dataset (Optional)

For training, the encoder uses visdom. You can disable it with --no_visdom, but it's nice to have. Run "visdom" in a separate CLI/process to start your visdom server.

2.2 Train synthesizer with your dataset

2.3 Use pretrained model of synthesizer

Thanks to the community, some models will be shared:

authorDownload linkPreview VideoInfo
@authorhttps://pan.baidu.com/s/1iONvRxmkI-t1nHqxKytY3g Baidu 4j5d75k steps trained by multiple datasets
@authorhttps://pan.baidu.com/s/1fMh9IlgKJlL2PIiRTYDUvw Baidu code:om7f25k steps trained by multiple datasets, only works under version 0.0.1
@FawenYohttps://yisiou-my.sharepoint.com/:u:/g/personal/lawrence_cheng_fawenyo_onmicrosoft_com/EWFWDHzee-NNg9TWdKckCc4BC7bK2j9cCbOWn0-_tK0nOg?e=n0gGgCinput output200k steps with local accent of Taiwan, only works under version 0.0.1
@mivenhttps://pan.baidu.com/s/1PI-hM3sn5wbeChRryX-RCQ code: 2021 https://www.aliyundrive.com/s/AwPsbo8mcSP code: z2m0https://www.bilibili.com/video/BV1uh411B7AD/only works under version 0.0.1

2.4 Train vocoder (Optional)

note: vocoder has little difference in effect, so you may not need to train a new one.

<datasets_root> replace with your dataset root,<synthesizer_model_path>replace with directory of your best trained models of sythensizer, e.g. sythensizer\saved_mode\xxx

3. Launch

3.1 Using the web server

You can then try to run:python web.py and open it in browser, default as http://localhost:8080

3.2 Using the Toolbox

You can then try the toolbox: python demo_toolbox.py -d <datasets_root>

3.3 Using the command line

You can then try the command: python gen_voice.py <text_file.txt> your_wav_file.wav you may need to install cn2an by "pip install cn2an" for better digital number result.

Reference

This repository is forked from Real-Time-Voice-Cloning which only support English.

URLDesignationTitleImplementation source
1803.09017GlobalStyleToken (synthesizer)Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech SynthesisThis repo
2010.05646HiFi-GAN (vocoder)Generative Adversarial Networks for Efficient and High Fidelity Speech SynthesisThis repo
2106.02297Fre-GAN (vocoder)Fre-GAN: Adversarial Frequency-consistent Audio SynthesisThis repo
1806.04558SV2TTSTransfer Learning from Speaker Verification to Multispeaker Text-To-Speech SynthesisThis repo
1802.08435WaveRNN (vocoder)Efficient Neural Audio Synthesisfatchord/WaveRNN
1703.10135Tacotron (synthesizer)Tacotron: Towards End-to-End Speech Synthesisfatchord/WaveRNN
1710.10467GE2E (encoder)Generalized End-To-End Loss for Speaker VerificationThis repo

F Q&A

1.Where can I download the dataset?

DatasetOriginal SourceAlternative Sources
aidatatang_200zhOpenSLRGoogle Drive
magicdataOpenSLRGoogle Drive (Dev set)
aishell3OpenSLRGoogle Drive
data_aishellOpenSLR

After unzip aidatatang_200zh, you need to unzip all the files under aidatatang_200zh\corpus\train

2.What is<datasets_root>?

If the dataset path is D:\data\aidatatang_200zh,then <datasets_root> isD:\data

3.Not enough VRAM

Train the synthesizer:adjust the batch_size in synthesizer/hparams.py

//Before
tts_schedule = [(2,  1e-3,  20_000,  12),   # Progressive training schedule
                (2,  5e-4,  40_000,  12),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  12),   #
                (2,  1e-4, 160_000,  12),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  12),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  12)],  # lr = learning rate
//After
tts_schedule = [(2,  1e-3,  20_000,  8),   # Progressive training schedule
                (2,  5e-4,  40_000,  8),   # (r, lr, step, batch_size)
                (2,  2e-4,  80_000,  8),   #
                (2,  1e-4, 160_000,  8),   # r = reduction factor (# of mel frames
                (2,  3e-5, 320_000,  8),   #     synthesized for each decoder iteration)
                (2,  1e-5, 640_000,  8)],  # lr = learning rate

Train Vocoder-Preprocess the data:adjust the batch_size in synthesizer/hparams.py

//Before
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 16,                  # For vocoder preprocessing and inference.
//After
### Data Preprocessing
        max_mel_frames = 900,
        rescale = True,
        rescaling_max = 0.9,
        synthesis_batch_size = 8,                  # For vocoder preprocessing and inference.

Train Vocoder-Train the vocoder:adjust the batch_size in vocoder/wavernn/hparams.py

//Before
# Training
voc_batch_size = 100
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad = 2

//After
# Training
voc_batch_size = 6
voc_lr = 1e-4
voc_gen_at_checkpoint = 5
voc_pad =2

4.If it happens RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([70, 512]) from checkpoint, the shape in current model is torch.Size([75, 512]).

Please refer to issue #37

5. How to improve CPU and GPU occupancy rate?

Adjust the batch_size as appropriate to improve

6. What if it happens the page file is too small to complete the operation

Please refer to this video and change the virtual memory to 100G (102400), for example : When the file is placed in the D disk, the virtual memory of the D disk is changed.

7. When should I stop during training?

FYI, my attention came after 18k steps and loss became lower than 0.4 after 50k steps. attention_step_20500_sample_1 step-135500-mel-spectrogram_sample_1