Home

Awesome

V2C

Pytorch implementation for “V2C: Visual Voice Cloning”

Get MCD Metrics

The package (pymcd) computes Mel Cepstral Distortion (MCD) in python, which is used to assess the quality of the generated speech by comparing the discrepancy between generated and ground-truth speeches.

Overview

Mel Cepstral Distortion (MCD) is a measure of how different two sequences of mel cepstra are, which is widely used to evaluate the performance of speech synthesis models. The MCD metric compares k-th (default k=13) Mel Frequency Cepstral Coefficient (MFCC) vectors derived from the generated speech and ground truth, respectively.

The pymcd package provides scripts to compute a variety of forms of MCD score:

Installation

Require Python 3, the package can be installed and updated using pip, i.e.,

pip install -U pymcd

Example

from pymcd.mcd import Calculate_MCD

# instance of MCD class
# three different modes "plain", "dtw" and "dtw_sl" for the above three MCD metrics
mcd_toolbox = Calculate_MCD(MCD_mode="plain")

# two inputs w.r.t. reference (ground-truth) and synthesized speeches, respectively
mcd_value = mcd_toolbox.calculate_mcd("001.wav", "002.wav")

Get Dataset

1. V2C-Animation Dataset Construction

(1) Overall processes of dataset

The overall processes of V2C-Animation dataset can be divided into three parts: 1) data pre-processing, 2) data collection, and 3) data annotation & organization.

1) data pre-processing

<p align="center"> <img src="./DataConstruction/images/data_preprocessing.png" alt="example" width="60%"> </p> <p align="center"> Figure: The process of data pre-processing. </p>

To alleviate the impact from the background music, we only extract the sound channel of the center speaker, which mainly focuses on the sound of the speaking character. In practice, we use Adobe Premiere Pro (Pr) to extract the voice of center speaker.

<p align="center"> <img src="https://static.diffen.com/uploadz/6/6e/5.1-surround-sound.png" alt="example" width="50%"> </p> <p align="center"> Figure: 5.1 surround sound. </p>

As shown in the above image, 5.1 surround sound has 6 sound channels, and so 6 speakers. It includes a center speaker, subwoofer (for low frequency effects, such as explosions), left and right front speakers, and left and right rear speakers. (These image and text are from https://www.diffen.com)

2) data collection

<p align="center"> <img src="./DataConstruction/images/data_collection.png" alt="example" width="80%"> </p> <p align="center"> Figure: The process of data collection. </p>

We search for animated movies with the corresponding subtitles and then select a set of 26 movies of diverse genres. Specifically, we first cut the movies into a series of video clips according to the subtitle files. Here, we use an SRT type subtitle file. In addition to subtitles/texts, the SRT file also contains starting and ending time-stamps to ensure the subtitles match with video and audio, and sequential number of subtitles (e.g., No. 726 and No. 1340 in Figure), which indicates the index of each video clip. Based on the SRT file, we cut the movie into a series of video clips using the FFmpeg toolkit (an automatic audio and video processing toolkit) and then extract the audio from each video clip by FFmpeg as well.

<p align="center"> <img src="./DataConstruction/images/movie_clip_with_subtitle.png" alt="example" width="80%"> </p> <p align="center"> Figure: Examples of how to cut a movie into a series of video clips according to subtitle files. Note that the subtitle files contain both starting and ending time-stamps for each video clip. </p>

3) data annotation & organization

<p align="center"> <img src="./DataConstruction/images/data_organization.png" alt="example" width="70%"> </p> <p align="center"> Figure: The processes of data annotation and organization. </p>

Inspired by the organization of LibriSpeech dataset, we categorize the obtained video clips, audios and subtitles into their corresponding characters (i.e., speakers) via a crowd-sourced service. To ensure that the characters appeared in the video clips are the same as the speaking ones, we manually remove the data example that does not satisfy the requirement. Then, following the categories of FER-2013 (a dataset for human facial expression recognition), we divide the collected video/audio clips into 8 types including angry, happy, sad, etc. In this way, we totally collect a dataset with 10,217 video clips with paired audios and subtitles. All of the annotations, time-stamps of the mined movie clips and a tool to extract the triplet data will be released.

<p align="center"> <img src="./DataConstruction/images/emotion_distribution.jpg" alt="example" width="50%"> </p> <p align="center"> Figure: Distribution of emotion labels on V2C-Animation. </p>

We divide the collected video/audio clips into 8 types (i.e., 0: angry, 1: disgust, 2: fear, 3: happy, 4: neutral, 5: sad, 6: surprise, and 7: others). The corresponding emotion labels for the video clips are in emotions.json.

<p align="center"> <img src="./DataConstruction/images/character_emotion.png" alt="example" width="50%"> </p> <p align="center"> Figure: Samples of the character's emotion (e.g., happy and sad) involved in the reference video. Here, we take Elsa (a character in movie Frozen) as an example. </p>

(2) Organization of V2C-Animation dataset

Run the following code, which can produce and organize the data automatically. The name of the movie in the movie_path should be the same as the SRT files in the SRT_path.

python toolkit_data.py --SRT_path (path_of_SRT_files) --movie_path (path_of_movies) --output_path (path_of_output_data)

Note that this code involves the processes 2 and 3 only. Thus, we need to make a pre-processing according to the process 1 above to remove the background music and then reserve the voice of center speaker in the movie.

The organization of V2C-Animation dataset:

<p align="center"> <img src="./DataConstruction/images/V2C-Speaker-v1.jpg" alt="example" width="100%"> </p> <p align="center"> Figure: Movies with the corresponding speakers/characters on the V2C-Animation dataset. </p>
<root>
    |
    .- movie_dataset/
               |
               .- zootopia/
               |   |
               |   .- zootopia_speeches/
               |   |   |
	           |   |   .- Daddy/
	           |   |   |   |
	           |   |   |   .- 00/
	           |   |   |        |
	           |   |   |        .- Daddy-00.trans.txt
	           |   |   |        |    
	           |   |   |        .- Daddy-00-0034.wav
	           |   |   |        |
	           |   |   |        .- Daddy-00-0034.normalized.txt
	           |   |   |        |
	           |   |   |        .- Daddy-00-0036.wav
	           |   |   |    	|
	           |   |   |    	.- Daddy-00-0036.normalized.txt
	           |   |   |    	|
	           |   |   |        ...
	           |   |   |
	           |   |   .- Judy/
	           |   |       | ...
	           |   |	               
               |   |
               |   .- zootopia_videos/
               |       |
               |       .- Daddy/
               |       |   |
               |       |   .- 0034.mp4
               |       |   |
               |       |   .- 0036.mp4
               |       |   |
               |       |   ...
               |       .- Judy/
               |           | ...
               | ...

(3) Links of animated movies

We provide the hyperlink of each animated movies on the V2C-Animation dataset.

Bossbaby, Brave, Cloudy, CloudyII, COCO, Croods, Dragon, DragonII, Frozen, FrozenII, Incredibles, IncrediblesII, Inside, Meet, Moana, Ralph, Tangled, Tinker, TinkerII, TinkerIII, Toy, ToyII, ToyIII, Up, Wreck, Zootopia

Experimental Results

To investigate the performance of the proposed method, we conduct experiments in two different settings.

Setting 1: we compare our method with baselines using the ground-truth intermediate duration, patch and energy values.

MethodMCDMCD-DTWMCD-DTW-SLId. Acc.Emo. Acc.MOS-naturalnessMOS-similarity
Ground Truth00.0000.0000.0090.6284.384.61 ± 0.154.74 ± 0.12
Fastspeech212.0810.2910.3159.3853.133.86 ± 0.073.75 ± 0.06
V2C-Net (Ours)11.7910.0910.0562.5056.253.97 ± 0.063.90 ± 0.06

Setting 2: we compare our method with baselines using the predicted intermediate duration, patch and energy values.

MethodMCDMCD-DTWMCD-DTW-SLId. Acc.Emo. Acc.MOS-naturalnessMOS-similarity
Ground Truth00.0000.0000.0090.6284.384.61 ± 0.154.74 ± 0.12
SV2TTS21.0812.8749.5633.6237.192.03 ± 0.221.92 ± 0.15
Fastspeech220.7814.3919.4121.7246.822.79 ± 0.102.79 ± 0.10
V2C-Net (Ours)20.6114.2319.1526.8448.413.19 ± 0.043.06 ± 0.06