Home

Awesome

Speech Drives Templates

The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". <br> project / paper / video

<p align="center"> <img src="./iccv2021_sdt.jpg" width=500px/> </p>

Our paper and this repo focus on upper-body pose generation from audio. To synthesize images from poses, please refer to this Pose2Img repo.

🔔 Update:

Directory hierarchy

|-- config
|     |-- default.py
|     |-- voice2pose_s2g.yaml        # baseline: speech2gesture
|     |-- voice2pose_sdt_bp.yaml     # ours (Backprop)
|     |-- voice2pose_sdt_vae.yaml    # ours (VAE)
|     \-- pose2pose.yaml             # gesture reconstruction  
|
|-- core
|     |-- datasets
|     |-- netowrks
|     |-- pipelines
|     \-- utils
|
|-- datasets
|     \-- speakers
|           |-- oliver
|           |-- kubinec
|           \-- ...
|
|-- output
|     \-- <date-config-tag>  # A directory for each experiment
|
`-- main.py

Installation

To generate videos, you need ffmpeg in your system.

sudo apt install ffmpeg

Install Python packages

pip install -r requirements.txt

Dataset

We use a subset (Oliver and Kubinec) of the Speech2Gesture dataset and remove frames with bad human poses. We also collect data of two mandarine speakers (Luo and Xing).

To ease later research, we pack our processed data including 2d human pose sequences and corresponding audio clips. Please download from this link and organize the data under datasets/speakers as the above dirctory hierarchy.

Note that you do NOT need the source video frames to run this repo. In case you still want them for your own usage:

Since our method address the entire upper body including the face and hands, the number of keypoints in our data is 137. For more details, please refer to this document.

Custom dataset

To build a dataset from custom videos, we provide reference scripts in data_preprocess/:

# ==== video processing ====
1_1_change_fps.py           # we use fps=15 by default
1_2_video2frames.py         # save each video as images

# ==== keypoint processing ====
2_1_gen_kpts.py             # use openpose to obtain keypoints
2_2_remove_outlier.py       # remove a frame with bad predicted keypoints
(2_3_rescale_shoulder_width.py  # rescale the keypoints)

# ==== npz processing ====
3_1_generate_clips.py       # generate a csv files as an index and npz files for clips
3_2_split_train_val_test.py # edit the csv file for dataset division

# ==== speakers_stat processing ====
4_1_calculate_mean_std.py   # save the mean and std of each keypoint (137 points) into a npy file
4_2_parse_mean_std_npz.py   # parse the above npy and print out for `speakers_stat.py`

The step 2_3 is optional. It rescales the keypoints so that a new speaker has the same shoulder width as Oliver, and then you can simply copy the scale_factor of Oliver for the new speaker in speakers_stat.py.

Training SDT-BP

Training from scratch

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --tag oliver \
    DATASET.SPEAKER oliver \
    SYS.NUM_WORKERS 32

Resume training from an interrupted experiment

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --resume_from <checkpoint-to-continue-from> \
    DATASET.SPEAKER oliver

Training from a pretrained model

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --pretrain_from <checkpoint-to-pretrain-from> \
    --tag oliver \
    DATASET.SPEAKER oliver

Evaluation

To evaluate a model, use --test_only and --checkpoint as follows

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --tag oliver \
    --test_only \
    --checkpoint <path-to-checkpoint> \
    DATASET.SPEAKER oliver

Demo

To evaluate a model on an audio file, use --demo_input and --checkpoint as follows

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --tag oliver \
    --demo_input demo_audio.wav \
    --checkpoint <path-to-checkpoint> \
    DATASET.SPEAKER oliver

You can find our checkpoint here.

FTD computation and template vector extraction

Pose sequence reconstruction with VAE

First, you need to train the VAE by pose sequence reconstruction:

python main.py --config_file configs/pose2pose.yaml \
    --tag oliver \
    DATASET.SPEAKER oliver

Compute FTD while training SDT-BP

Once the VAE is train, you can compute FTD while training our SDT-BP model by spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT as follows:

python main.py --config_file configs/voice2pose_sdt_bp.yaml \
    --tag oliver \
    DATASET.SPEAKER oliver \
    VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>

Training SDT-VAE

By changing the config file and spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT, you can train our SDT-VAE model, and the FTD metric will also be computed:

python main.py --config_file configs/voice2pose_sdt_vae.yaml \
    --tag oliver \
    DATASET.SPEAKER oliver \
    VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>

For evaluation and demo with our SDT-VAE model, dont't forget to always specify the VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT parameter.

Misc

Cite

@inproceedings{qian2021speech,
  title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
  author={Qian, Shenhan and Tu, Zhi and Zhi, Yihao and Liu, Wen and Gao, Shenghua},
  booktitle={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={11057--11066},
  year={2021},
  organization={IEEE}
}