Awesome
Speech Drives Templates
The official repo for the ICCV-2021 paper "Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates". <br> project / paper / video
<p align="center"> <img src="./iccv2021_sdt.jpg" width=500px/> </p>Our paper and this repo focus on upper-body pose generation from audio. To synthesize images from poses, please refer to this Pose2Img repo.
🔔 Update:
- 2022-04-29: Upload checkpoints for all subjects.
- 2022-04-26: Change
POSE2POSE.LAMBDA_KL
inconfig/default.py
from 1.0 to 0.1.
Directory hierarchy
|-- config
| |-- default.py
| |-- voice2pose_s2g.yaml # baseline: speech2gesture
| |-- voice2pose_sdt_bp.yaml # ours (Backprop)
| |-- voice2pose_sdt_vae.yaml # ours (VAE)
| \-- pose2pose.yaml # gesture reconstruction
|
|-- core
| |-- datasets
| |-- netowrks
| |-- pipelines
| \-- utils
|
|-- datasets
| \-- speakers
| |-- oliver
| |-- kubinec
| \-- ...
|
|-- output
| \-- <date-config-tag> # A directory for each experiment
|
`-- main.py
Installation
To generate videos, you need ffmpeg
in your system.
sudo apt install ffmpeg
Install Python packages
pip install -r requirements.txt
Dataset
We use a subset (Oliver and Kubinec) of the Speech2Gesture dataset and remove frames with bad human poses. We also collect data of two mandarine speakers (Luo and Xing).
To ease later research, we pack our processed data including 2d human pose sequences and corresponding audio clips.
Please download from this link and organize the data under datasets/speakers
as the above dirctory hierarchy.
Note that you do NOT need the source video frames to run this repo. In case you still want them for your own usage:
- For Luo and Xing, we provide the links of source videos as text files along side the above data packs.
- For Oliver and Kubinec, please refer to the Speech2Gesture dataset.
Since our method address the entire upper body including the face and hands, the number of keypoints in our data is 137. For more details, please refer to this document.
Custom dataset
To build a dataset from custom videos, we provide reference scripts in data_preprocess/
:
# ==== video processing ====
1_1_change_fps.py # we use fps=15 by default
1_2_video2frames.py # save each video as images
# ==== keypoint processing ====
2_1_gen_kpts.py # use openpose to obtain keypoints
2_2_remove_outlier.py # remove a frame with bad predicted keypoints
(2_3_rescale_shoulder_width.py # rescale the keypoints)
# ==== npz processing ====
3_1_generate_clips.py # generate a csv files as an index and npz files for clips
3_2_split_train_val_test.py # edit the csv file for dataset division
# ==== speakers_stat processing ====
4_1_calculate_mean_std.py # save the mean and std of each keypoint (137 points) into a npy file
4_2_parse_mean_std_npz.py # parse the above npy and print out for `speakers_stat.py`
The step 2_3 is optional. It rescales the keypoints so that a new speaker has the same shoulder width as Oliver, and then you can simply copy the
scale_factor
of Oliver for the new speaker inspeakers_stat.py
.
Training SDT-BP
Training from scratch
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
SYS.NUM_WORKERS 32
--tag
set the name of the experiment which wil be displayed in the outputfile.- You can overwrite any parameter defined in
configs/default.py
by simply adding it at the end of the command. The example above setSYS.NUM_WORKERS
to 32 temporarily.
Resume training from an interrupted experiment
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--resume_from <checkpoint-to-continue-from> \
DATASET.SPEAKER oliver
- With
--resume_from
, the program will load thestate_dict
from the checkpoint for both the model and the optimizer, and write results to the original directory that the checkpoint lies in.
Training from a pretrained model
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--pretrain_from <checkpoint-to-pretrain-from> \
--tag oliver \
DATASET.SPEAKER oliver
- With
--pretrain_from
, the program will only load thestate_dict
for the model, and write results to a new base directory.
Evaluation
To evaluate a model, use --test_only
and --checkpoint
as follows
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
--test_only \
--checkpoint <path-to-checkpoint> \
DATASET.SPEAKER oliver
Demo
To evaluate a model on an audio file, use --demo_input
and --checkpoint
as follows
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
--demo_input demo_audio.wav \
--checkpoint <path-to-checkpoint> \
DATASET.SPEAKER oliver
You can find our checkpoint here.
FTD computation and template vector extraction
Pose sequence reconstruction with VAE
First, you need to train the VAE by pose sequence reconstruction:
python main.py --config_file configs/pose2pose.yaml \
--tag oliver \
DATASET.SPEAKER oliver
Compute FTD while training SDT-BP
Once the VAE is train, you can compute FTD while training our SDT-BP model by spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT
as follows:
python main.py --config_file configs/voice2pose_sdt_bp.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>
Training SDT-VAE
By changing the config file and spotting out VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT
, you can train our SDT-VAE model, and the FTD metric will also be computed:
python main.py --config_file configs/voice2pose_sdt_vae.yaml \
--tag oliver \
DATASET.SPEAKER oliver \
VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT <path-to-VAE-checkpoint>
For evaluation and demo with our SDT-VAE model, dont't forget to always specify the VOICE2POSE.POSE_ENCODER.AE_CHECKPOINT
parameter.
Misc
-
We save a checkpoint and conduct validation after each epoch. You can change the interval in the config file.
-
We generate and save 2 videos in each epoch when training. During validation, we sample 8 videos for each epoch. These videos can be saved in tensorborad (without sound) and mp4 (with sound). You can change the
SYS.VIDEO_FORMAT
parameter to select one or two of them. -
For multi-GPU training, we recommand using DistributedDataParallel (DDP) because it provide SyncBN across GPU cards. To enable DDP, set
SYS.DISTRIBUTED
toTrue
and setSYS.WORLD_SIZE
according to the number of GPUs.When using DDP, assure that the
batch_size
can be divided exactly bySYS.WORLD_SIZE
. -
We usually set
NUM_WORKERS
to 32 for best performance. If you encounter any error about memory, try lowerNUM_WORKERS
. -
We also support dataset caching (
DATASET.CACHING
) to further speed up data loading.If you encounter errors in the dataloader like
RuntimeError: received 0 items of ancdata
, please increaseulimit
by running the commandulimit -n 262144
. (refer to this issue) -
To run any module other than the main files in the root directory, for example the
core\datasets\gesture_dataset.py
file, you should runpython -m core.datasets.gesture_dataset
rather thanpython core\datasets\gesture_dataset.py
. This is an interesting problem of Python's relative importing.
Cite
@inproceedings{qian2021speech,
title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
author={Qian, Shenhan and Tu, Zhi and Zhi, Yihao and Liu, Wen and Gao, Shenghua},
booktitle={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
pages={11057--11066},
year={2021},
organization={IEEE}
}