Awesome

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

GitHub Repo stars GitHub

In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.

🤗 Play online at HuggingFace Spaces.

Visit our demo page for audio samples.

We also provide the pretrained models.

<table style="width:100%"> <tr> <td><img src="./resources/train.png" alt="training" height="200"></td> <td><img src="./resources/infer.png" alt="inference" height="200"></td> </tr> <tr> <th>(a) Training</th> <th>(b) Inference</th> </tr> </table>

Updates

Code release. (Nov 27, 2022)
Online demo at HuggingFace Spaces. (Dec 14, 2022)
Supports 24kHz outputs. See here for details. (Dec 15, 2022)
Fix data loading bug. (Jan 10, 2023)

Pre-requisites

Clone this repo: git clone https://github.com/OlaWod/FreeVC.git
CD into this repo: cd FreeVC
Install python requirements: pip install -r requirements.txt
Download WavLM-Large and put it under directory 'wavlm/'
Download the VCTK dataset (for training only)
Download HiFi-GAN model and put it under directory 'hifigan/' (for training with SR only)

Inference Example

Download the pretrained checkpoints and run:

# inference with FreeVC
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc.json --ptfile checkpoints/freevc.pth --txtpath convert.txt --outdir outputs/freevc

# inference with FreeVC-s
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc-s.json --ptfile checkpoints/freevc-s.pth --txtpath convert.txt --outdir outputs/freevc-s

Training Example

Preprocess

python downsample.py --in_dir </path/to/VCTK/wavs>
ln -s dataset/vctk-16k DUMMY

# run this if you want a different train-val-test split
python preprocess_flist.py

# run this if you want to use pretrained speaker encoder
CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py

# run this if you want to train without SR-based augmentation
CUDA_VISIBLE_DEVICES=0 python preprocess_ssl.py

# run these if you want to train with SR-based augmentation
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 89 --max 92

Train