Awesome
ConsistencyVC-voive-conversion
Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion
Demo page: https://consistencyvc.github.io/ConsistencyVC-demo-page
The whisper medium model can be downloaded here: https://drive.google.com/file/d/1PZsfQg3PUZuu1k6nHvavd6OcOB_8m1Aa/view?usp=drive_link
The pre-trained models are available here:https://drive.google.com/drive/folders/1KvMN1V8BWCzJd-N8hfyP283rLQBKIbig?usp=sharing
Note: The audio needs to be 16KHz for train and inference.
<img src="cvc627.png" alt="cvc" width="100%"> <!-- 科研好累。 -->Inference with the pre-trained models (use WEO as example)
Generate the WEO of the source speech in src by preprocess_ppg.py.
Copy the root of the reference speech to tgt
Use whisperconvert_exp.py to achieve voice conversion using WEO as content information.
For ConsistencyEVC, use ppgemoconvert_exp.py to achieve voice conversion using ppg as content information.
Inference for the long audio
I uploaded a new py file for the inference of long audio. You don't need to run the whisper by another file, just change this part and run this py file.
Train models by your dataset
Use ppg.py to generate the PPG.
Use preprocess_ppg.py to generate the WEO.
If you want to use WEO to train a cross-lingual voice conversion model:
First you need to train the model without speaker consistency loss for 100k steps:
change this line to
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo
run the py file:
python train_whisper_emo.py -c configs/cvc-whispers-multi.json -m cvc-whispers-three
Then change this line back to finetune this model with speaker consistency loss
python train_whisper_emo.py -c configs/cvc-whispers-three-emo.json -m cvc-whispers-three
If you want to use PPG to train an expressive voice conversion model:
First you need to train the model without speaker consistency loss for 100k steps:
change this line to
loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl# + loss_emo
run the py file:
python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo.json -m cvc-eng-ppgs-three-emo
Then change this line back to finetune this model with speaker consistency loss
python train_eng_ppg_emo_loss.py -c configs/cvc-eng-ppgs-three-emo-cycleloss.json -m cvc-eng-ppgs-three-emo
Reference
The code structure is based on FreeVC-s. Suggestion: please follow the instruction of FreeVC to install python requirements.
The WEO content feature is based on LoraSVC.
The PPG is from the phoneme recognition model.