Awesome

<div align="center"> <h1> Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab </h1>

This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and whisper-vits-svc. So the features from whisper-vits-svc are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

grad_tts

grad_svc

The framework of grad-svc-v1

grad_svc_v2

The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96

https://github.com/PlayVoice/Grad-SVC/assets/16432329/f9b66af7-b5b5-4efb-b73d-adb0dc84a0ae

</div>

Features

Such beautiful codes from Grad-TTS

easy to read
Multi-speaker based on speaker encoder
No speaker leaky based on Perturbation & Instance Normlize & GRL

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
No electronic sound
Integrated DPM Solver-k for less steps
Integrated Fast Maximum Likelihood Sampling Scheme, for less steps
Conditional Flow Matching (V3), first used in SVC
Rectified Flow Matching (TODO)

Setup Environment

Install project dependencies
```
pip install -r requirements.txt
```
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/.

Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb

系统性能瓶颈：生成器和判别器一共116M，而生成器只有22M
Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/.
```
python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
```
For this pretrain model, temperature is set temperature=1.015 in gvc_inference.py to get good result.

Dataset preparation

Put the dataset into the data_raw directory following the structure below.

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

After preprocessing you will get an output with following structure.

data_gvc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── mel
│    └── speaker0
│    │      ├── 000001.mel.pt
│    │      └── 000xxx.mel.pt
│    └── speaker1
│           ├── 000001.mel.pt
│           └── 000xxx.mel.pt
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_gvc/waves-16k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_gvc/waves-32k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch

use 32k audio to extract mel

python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker

Extract the average value of the timbre code for inference

python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

Start training
```
python gvc_trainer.py
```

Resume training

python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth

Log visualization
```
tensorboard --logdir logs/
```

Train Loss

loss_96_v2

grad_svc_mel

Inference

Export inference model

python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth

Inference

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0

temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).

Inference step by step

Extract hubert content vector

python hubert/inference.py -w test.wav -v test.vec.npy

Extract pitch to the csv text format

python pitch/inference.py -w test.wav -p test.csv

Convert hubert & pitch to wave

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0

Data

Name	URL
PopCS	https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop	https://wenet.org.cn/opencpop/download/
Multi-Singer	https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer	https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
VCTK	https://datashare.ed.ac.uk/handle/10283/2651