Home

Awesome

Voice Conversion with Non-Parallel Data

Subtitle: Speaking like Kate Winslet

Authors: Dabi Ahn(andabi412@gmail.com), Kyubyong Park(kbpark.linguist@gmail.com)

Samples

https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks

Intro

What if you could imitate a famous celebrity's voice or sing like a famous singer? This project started with a goal to convert someone's voice to a specific target voice. So called, it's voice style transfer. We worked on this project that aims to convert someone's voice to a famous English actress Kate Winslet's voice. We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset.

<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/title.png" width="50%"></p>

Model Architecture

This is a many-to-one voice conversion system. The main significance of this work is that we could generate a target speaker's utterances without parallel data like <source's wav, target's wav>, <wav, text> or <wav, phone>, but only waveforms of the target speaker. (To make these parallel datasets needs a lot of effort.) All we need in this project is a number of waveforms of the target speaker's utterances and only a small set of <wav, phone> pairs from a number of anonymous speakers.

<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/architecture.png" width="85%"></p>

The model architecture consists of two modules:

  1. Net1(phoneme classification) classify someone's utterances to one of phoneme classes at every timestep.
    • Phonemes are speaker-independent while waveforms are speaker-dependent.
  2. Net2(speech synthesis) synthesize speeches of the target speaker from the phones.

We applied CBHG(1-D convolution bank + highway network + bidirectional GRU) modules that are mentioned in Tacotron. CBHG is known to be good for capturing features from sequential data.

Net1 is a classifier.

Net2 is a synthesizer.

Net2 contains Net1 as a sub-network.

Implementations

Requirements

Settings

Procedure

<p align="center"><img src="https://raw.githubusercontent.com/andabi/deep-voice-conversion/master/materials/phoneme_dist.png" width="30%"></p>

Tips (Lessons We've learned from this project)

References