Home

Awesome

Evaluation of Text-to-Gesture Generation Model Using Convolutional Neural Network

This repository contains the code for the text-to-gesture generation model using CNN.

The demonstration video of generated gestures is available at https://youtu.be/JX4Gqy-Rmso.

Eiichi Asakawa, Naoshi Kaneko, Dai Hasegawa, and Shinichi Shirakawa: Evaluation of text-to-gesture generation model using convolutional neural network, Neural Networks, Elsevier, Vol. 151, pp. 365-375, Jul. 2022. [DOI]

If you use this code for your research, please cite our paper:

@article{AsakawaNEUNET2022,
    author = {Eiichi Asakawa and Naoshi Kaneko and Dai Hasegawa and Shinichi Shirakawa},
    title = {Evaluation of text-to-gesture generation model using convolutional neural network},
    journal = {Neural Networks},
    volume = {151},
    pages = {365--375},
    year = {2022},
    doi = {https://doi.org/10.1016/j.neunet.2022.03.041}
}

Requirements

We used the PyTorch version 1.7.1 for neural network implementation. We tested the codes on the following environment:

Preparation

  1. Our code uses the speech and gesture dataset provided by Ginosar et al. Download the Speech2Gesture dataset by following the instruction "Download specific speaker data" in https://github.com/amirbar/speech2gesture/blob/master/data/dataset.md.
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik, "Learning Individual Styles of Conversational Gesture," CVPR 2019.

After downloading the Speech2Gesture dataset, your dataset folder should be like:

Gestures
├── frames_df_10_19_19.csv
├── almaram
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    └── videos
...
└── shelly
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    └── videos
  1. Download the text dataset from HERE and unarchive the zip file.
  2. Move the words directory in each speaker name's directory to the corresponding speaker's directory in your dataset directory.

After this step, your dataset folder should be like:

Gestures
├── frames_df_10_19_19.csv
├── almaram
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    ├── videos
    └── words
...
└── shelly
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    ├── videos
    └── words

Note that the word data for speaker Jon is very little. Therefore, it should not use for model training.

  1. Set up the fasttext by following the instruction HERE. Download the pre-trained model file (wiki-news-300d-1M-subword.bin) of fasttext from HERE.

Create training and test data

python dataset.py --base_path <BASE_PATH> --speaker <SPEAKER_NAME> --wordvec_file <W2V_FILE> --dataset_type <DATASET_TYPE> --frames <FRAMES>
# training data
python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --dataset_type train --frames 64

# test data
python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --dataset_type test --frames 192

After run the script, the direcrories containing the training or test data are created in your dataset folder. After this step, your dataset folder should be like:

Gestures
├── frames_df_10_19_19.csv
├── almaram
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    ├── test-192
    ├── train-64
    ├── videos
    └── words
...
├── shelly
    ├── frames
    ├── keypoints_all
    ├── keypoints_simple
    ├── test-192
    ├── train-64
    ├── videos
    └── words

Model training

python train.py --outdir_path <OUT_DIR> --speaker <SPEAKER_NAME> --gpu_num <GPU> --base_path <BASE_PATH> --train_dir <TRAIN_DIR>

The experimental settings (e.g., number of epochs, loss function) can change by specifying the argument. Please see the script file of train.py for the details.

python train.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64

The resulting files will be created in ./out_training/oliver_YYYYMMDD-AAAAAA/.

Evaluation

python test.py --base_path <BASE_PATH> --test_speaker <TEST_SPEAKER> --test_dir <TEST_DIR> --model_dir <MODEL_DIR> --model_path <MODEL_PATH> --outdir_path <OUT_DIR>
python test.py --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/

The resulting files (.npy files for predicted motion) are created in ./out_test/oliver_by_oliver_YYYYMMDD-AAAAAA_test-192/.

python test.py --base_path <BASE_PATH> --test_speaker rock --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/

The resulting files (.npy files for predicted motion) will be created in ./out_test/rock_by_oliver_YYYYMMDD-AAAAAA_test-192/.

Visualization

python make_gesture_video.py --base_path <BASE_PATH> --test_out_path <TEST_OUT_PATH> --test_out_dir <TEST_OUT_DIR> --video_out_path <VIDEO_OUT_PATH>
python make_gesture_video.py --base_path <BASE_PATH> --test_out_path ./out_test/ --test_out_dir oliver_by_oliver_YYYYMMDD-AAAAAA_test-192 --video_out_path ./out_video/

The gesture videos (side-by-side video of ground truth and text-to-gesture) will be created in ./out_video. The left side gesture is ground truth, and the right side gesture is one generated by the text-to-gesture generation model. Also, the original videos of test intervals will be created in ./out_video/original/oliver/.


For Transformer model

If you want to use the transformer model, please use ./transformer/dataset.py, train_transformer.py, and test_transformer.py instead of ./dataset.py, train.py, and test.py.

Create training and test data

The training dataset creation code ./transformer/dataset.py creates the data including both text and audio information for model training. The created files are saved in train-64-text-audio instead of train-64, which should be used for transformer model training.

python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --frames 64

Model training

# Text2Gesture
python train_transformer.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64-text-audio --modality text

# Speech2Gesture
python train_transformer.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64-text-audio --modality audio

Evaluation

# Text2Gesture
python test_transformer.py --modality text --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/

# Speech2Gesture
python test_transformer.py --modality audio --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/