Awesome

ConsistencyVC-voive-conversion

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion

Demo page: https://consistencyvc.github.io/ConsistencyVC-demo-page

The whisper medium model can be downloaded here: https://drive.google.com/file/d/1PZsfQg3PUZuu1k6nHvavd6OcOB_8m1Aa/view?usp=drive_link

The pre-trained models are available here:https://drive.google.com/drive/folders/1KvMN1V8BWCzJd-N8hfyP283rLQBKIbig?usp=sharing

Note: The audio needs to be 16KHz for train and inference.

Inference with the pre-trained models (use WEO as example)

Generate the WEO of the source speech in src by preprocess_ppg.py.

Copy the root of the reference speech to tgt

Use whisperconvert_exp.py to achieve voice conversion using WEO as content information.

For ConsistencyEVC, use ppgemoconvert_exp.py to achieve voice conversion using ppg as content information.

Inference for the long audio

I uploaded a new py file for the inference of long audio. You don't need to run the whisper by another file, just change this part and run this py file.

Train models by your dataset

Use ppg.py to generate the PPG.

Use preprocess_ppg.py to generate the WEO.

If you want to use WEO to train a cross-lingual voice conversion model:

First you need to train the model without speaker consistency loss for 100k steps:

change this line to

loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo

run the py file:

python train_whisper_emo.py -c configs/cvc-whispers-multi.json -m cvc-whispers-three

Then change this line back to finetune this model with speaker consistency loss

python train_whisper_emo.py -c configs/cvc-whispers-three-emo.json -m cvc-whispers-three

If you want to use PPG to train an expressive voice conversion model:

First you need to train the model without speaker consistency loss for 100k steps:

change this line to

loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo

run the py file:

python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo.json -m cvc-eng-ppgs-three-emo

Then change this line back to finetune this model with speaker consistency loss

python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo-cycleloss.json -m cvc-eng-ppgs-three-emo

Reference

The code structure is based on FreeVC-s. Suggestion: please follow the instruction of FreeVC to install python requirements.

The WEO content feature is based on LoraSVC.

The PPG is from the phoneme recognition model.