Awesome

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Conference archive | Arxiv | Demo | Presentation Video | Poster

Further Work

📢 DiffuseStyleGesture/DiffuseStyleGesture+ - Based on the diffusion model, the full body gesture.

📢 UnifiedGesture - Training on multiple gesture datasets, refine the gestures.

1. Environment Settings

This code was tested on NVIDIA GeForce RTX 2080 Ti and requires conda or miniconda.

conda create -n QPGesture python=3.7
conda activate QPGesture
pip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

2. Quick Start

Download our processed database and pre-trained models from Tsinghua Cloud or Google Cloud and place them in the data fold and pretrained_model fold in the project path.

cd ./codebook/Speech2GestureMatching/
bash GestureKNN.sh

This is an audio clip about 24 seconds long and it takes about 5 minutes to match. You will get the results in ./codebook/Speech2GestureMatching/output/result.npz

cd ..
python VisualizeCodebook.py --config=./configs/codebook.yml --gpu 0 --code_path "./Speech2GestureMatching/output/result.npz" --VQVAE_model_path "../pretrained_model/codebook_checkpoint_best.bin" --stage inference

Then you will get .bvh, .mp4 and other intermediate files in .codebook/Speech2GestureMatching/output/knn_pred_wavvq/

https://github.com/YoungSeng/QPGesture/assets/37477030/0ee4bab2-056c-4dd1-b35f-6c125efaaf4e

You can use Blender to visualize bvh file.

https://github.com/YoungSeng/QPGesture/assets/37477030/d554f634-04e6-4f7e-8cb8-5f382af282a1

We also provide a processed database for speaker id 1, available for download in Tsinghua Cloud and Baidu Cloud. It is optional to use this database. We recommend trying Speaker 1, which has a larger database and better performance.

3. Test your own audio

Here, we need to build the test set. We use ./data/Example3/4.wav as an example. Note that no text is used here.

Download vq-wav2vec Gumbel from fairseq and put it in ./process/. Modify the code of fairseq in conda or miniconda according to this issue.

Then run:

cd ./process/
python make_test_data.py --audio_path "../data/Example3/4.wav" --save_path "../data/Example3/4"

You will get ./data/Example3/4/wavvq_240.npz Then similar to the previous step, just run the following code:

cd ../codebook/Speech2GestureMatching/
bash GestureKNN.sh "../../data/Example3/4/wavvq_240.npz" 0 "./output/result_Example3.npz"

4. Constructing database

Install gentle like Trimodal to align the text and audio, this will take some minutes:

cd ./process/
git clone https://github.com/lowerquality/gentle.git
cd gentle
./install.sh

You can verify whether gentle is installed successfully with the following command:

python align.py './examples/data/lucier.mp3' './examples/data/lucier.txt'

Download the WavLM Large and put it into ./pretrained_model/. Download the character you want to build from BEAT, you can put it in ./dataset/orig_BEAT/ or other places. Here is an example of speaker id 10:

python make_beat_dataset.py --BEAT_path "../dataset/orig_BEAT/speakers/" --save_dir "../dataset/BEAT" --prefix "speaker_10_state_0" --step 1
cd ../codebook/Speech2GestureMatching/
python normalize_audio.py
python mfcc.py
cd ../../process/
python make_beat_dataset.py --BEAT_path "../dataset/orig_BEAT/speakers/" --save_dir "../dataset/BEAT" --prefix "speaker_10_state_0" --step 2

Now we get a basic database and further we compute phase, wavlm and wavvq features:

cd ../codebook/
python PAE.py --config=./configs/codebook.yml --gpu 0 --stage inference
cd ../process/
python make_beat_dataset.py --config "../codebook/configs/codebook.yml" --BEAT_path "../dataset/orig_BEAT/speakers/" --save_dir "../dataset/BEAT" --prefix "speaker_10_state_0" --gpu 0 --step 3
python make_beat_dataset.py --config "../codebook/configs/codebook.yml" --BEAT_path "../dataset/orig_BEAT/speakers/" --save_dir "../dataset/BEAT" --prefix "speaker_10_state_0" --gpu 0 --step 4

Then you will get all the databases in Quick Start.

5. Train your own model

Data preparation

This is just an example of speaker id 10, in fact we use all speakers to train these models.

pip install numpy==1.19.5       # Unfortunately, we have been troubled with the version of the numpy library (with pyarrow).
python beat_data_to_lmdb.py --config=../codebook/configs/codebook.yml --gpu 0

Then you will get data mean/std, and you may copy them to ./codebook/configs/codebook.yml.

gesture VQ-VAE

cd ../codebook/
python train.py --config=./configs/codebook.yml --gpu 0

The gesture VQ-VAE will saved in ./codebook/output/train_codebook/codebook_checkpoint_best.bin.

For futher calculate the distance between each code, run

python VisualizeCodebook.py --config=./configs/codebook.yml --gpu 0 --code_path "./Speech2GestureMatching/output/result.npz" --VQVAE_model_path "./output/train_codebook/codebook_checkpoint_best.bin" --stage train

Then you will get the absolute pose of each code in ./codebook/output/code.npz used in Quick Start.

PAE

python PAE.py --config=./configs/codebook.yml --gpu 0 --stage train

The PAE will saved in ./codebook/output/train_PAE/PAE_checkpoint_best.bin

Reference

This work is highly inspired by Bailando, KNN and DeepPhase.

Citation

If you find this work useful, please consider cite our work with the following bibtex:

@inproceedings{yang2023QPGesture,
  author       = {Sicheng Yang and Zhiyong Wu and Minglei Li and Zhensong Zhang and Lei Hao and Weihong Bao and Haolin Zhuang},
  title        = {QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation},
  booktitle    = {{IEEE/CVF} Conference on Computer Vision and Pattern Recognition, {CVPR}},
  publisher    = {{IEEE}},
  month        = {June},
  year         = {2023},
  pages        = {2321-2330}
}

Please feel free to contact us yangsc21@mails.tsinghua.edu.cn with any question or concerns.