Home

Awesome

ParCo: Part-Coordinating Text-to-Motion Synthesis

<p align="center"> <img src="docs/imgs/teaser1.png" width="30%" /> </p>

Pytorch implementation of paper ParCo: Part-Coordinating Text-to-Motion Synthesis [ECCV 2024].

Paper Language Colab

<p align="center"> <table> <tr> <th colspan="4">Text: "a person is having a hearty laugh and makes a jovial motion with their left hand."</th> </tr> <tr> <th>πŸ”₯ParCo (Ours)πŸ”₯</th> <th><u><a href="https://mingyuan-zhang.github.io/projects/ReMoDiffuse.html"><nobr>ReMoDiffuse</nobr> </a></u></th> <th><u><a href="https://mael-zys.github.io/T2M-GPT/"><nobr>T2M-GPT</nobr> </a></u></th> <th><u><a href="https://mingyuan-zhang.github.io/projects/MotionDiffuse.html"><nobr>MotionDiffuse</nobr> </a></u></th> </tr> <tr> <td><img src="docs/imgs/parco/parco_5.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/remodiffuse/remodiff_5.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/t2mgpt/t2mgpt_5.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/motiondiffuse/motiondiffuse_5.gif" width="160px" alt="gif"></td> </tr> <tr> <th colspan="4">Text: "standing on one leg and hopping."</th> </tr> <tr> <th>πŸ”₯ParCo (Ours)πŸ”₯</th> <th><u><a href="https://mingyuan-zhang.github.io/projects/ReMoDiffuse.html"><nobr>ReMoDiffuse</nobr> </a></u></th> <th><u><a href="https://mael-zys.github.io/T2M-GPT/"><nobr>T2M-GPT</nobr> </a></u></th> <th><u><a href="https://mingyuan-zhang.github.io/projects/MotionDiffuse.html"><nobr>MotionDiffuse</nobr> </a></u></th> </tr> <tr> <td><img src="docs/imgs/parco/parco_9.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/remodiffuse/remodiff_9.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/t2mgpt/t2mgpt_9.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/motiondiffuse/motiondiffuse_9.gif" width="160px" alt="gif"></td> </tr> <tr> <th colspan="4">Text: "a man steps back, picks something up and put it to his head and then puts it back."</th> </tr> <tr> <th>πŸ”₯ParCo (Ours)πŸ”₯</th> <th><u><a href="https://mingyuan-zhang.github.io/projects/ReMoDiffuse.html"><nobr>ReMoDiffuse</nobr> </a></u></th> <th><u><a href="https://mael-zys.github.io/T2M-GPT/"><nobr>T2M-GPT</nobr> </a></u></th> <th><u><a href="https://mingyuan-zhang.github.io/projects/MotionDiffuse.html"><nobr>MotionDiffuse</nobr> </a></u></th> </tr> <tr> <td><img src="docs/imgs/parco/parco_2.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/remodiffuse/remodiff_2.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/t2mgpt/t2mgpt_2.gif" width="160px" alt="gif"></td> <td><img src="docs/imgs/motiondiffuse/motiondiffuse_2.gif" width="160px" alt="gif"></td> </tr> </table> </p>

If our project is helpful for your research, please consider starring this repo and citing our paper:

@article{zou2024parco,
  title={ParCo: Part-Coordinating Text-to-Motion Synthesis},
  author={Zou, Qiran and Yuan, Shangyuan and Du, Shian and Wang, Yu and Liu, Chang and Xu, Yi and Chen, Jie and Ji, Xiangyang},
  journal={arXiv preprint arXiv:2403.18512},
  year={2024}
}

Computational resource consumption

Training

Time and GPU memory consumed for training (single A100 GPU):

(Stage-1) VQ-VAE(Stage-2) Part-Coordinated Transformer
Time20.5h52.3h
Memory3.5GB28.4GB

Inference

MethodParam(M)FLOPs(G)InferTime(s)
ReMoDiffuse198.2481.00.091
T2M-GPT237.6292.30.544
ParCo (Ours)168.4211.70.036

Table of Content

1. Quick Start Demo

1.1. Colab Demo

πŸ‘‰ Try our Colab demo !

Our demo shows how to prepare the environment and inference with ParCo. You can also conveniently explore our ParCo.

If you wish to reproduce the visualization results of ParCo, we recommend installing the environment locally following our tutorial and reproducing it there (as results differ between Colab and local runs). This is likely due to differences in GPU and CUDA environment between Colab and local training/testing.

<p align="center"> <img src="docs/imgs/demo_screenshot.png" width="40%" /> </p>

1.2. Local Quick Inference

After the installation completed, you can directly generate motion (.gif format) with your own text input as following:

CUDA_VISIBLE_DEVICES=0 python visualize/infer_motion_npy.py \
--eval-exp-dir output/ParCo_official_HumanML3D/VQVAE-ParCo-t2m-default/00000-Trans-ParCo-default \
--select-ckpt fid \
--infer-mode userinput \
--input-text 'an idol trainee is dancing like a basketball dribbling.' \
--skip-path-check

The generated motion visual sample is saved as output/visualize/XXXXX-userinput/skeleton_viz.gif.

<p align="center"> <img src="docs/imgs/demo_local_infer.gif" width="30%" /> </p>

2. Installation

2.1. Environment

Our model was trained and tested on a single A100-40G GPU with software environment: Python 3.7.11, PyTorch 1.10.1, CUDA 11.3.1, cuDNN 8.2.0, Ubuntu 20.04.

2.2. Feature extractors

We use the extractors provided by T2M for evaluation. Please download the extractors and glove word vectorizer. Note that 'zip' should be pre-installed in your system, if not, run sudo apt-get install zip to install zip.

bash dataset/prepare/download_glove.sh
bash dataset/prepare/download_extractor.sh

If you are using proxy to access Google Drive, use scripts below for downloading. Default proxy port in the script is set to 1087, you can modify the scripts to set your own proxy port.

bash dataset/prepare/use_proxy/download_glove.sh
bash dataset/prepare/use_proxy/download_extractor.sh

2.3. Datasets

Two 3D human motion-language datasets, HumanML3D and KIT-ML, are used by our project. You can find preparation and acquisition for both datasets [here].

You can also directly download these datasets processed by us: [Google Drive].

The file directory should look like:

./dataset/HumanML3D/
β”œβ”€β”€ new_joint_vecs/
β”œβ”€β”€ texts/
β”œβ”€β”€ Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
β”œβ”€β”€ Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
β”œβ”€β”€ train.txt
β”œβ”€β”€ val.txt
β”œβ”€β”€ test.txt
β”œβ”€β”€ train_val.txt
└── all.txt

3. Train ParCo

The experiment directory structure of our project is:

./output  (arg.out_dir)
 β”œβ”€β”€ 00000-DATASET  (exp_number + dataset_name)
 β”‚   └── VQVAE-EXP_NAME-DESC  (VQVAE + args.exp_name + desc)
 β”‚       β”œβ”€β”€ events.out.XXX
 β”‚       β”œβ”€β”€ net_best_XXX.pth
 β”‚       ...
 β”‚       β”œβ”€β”€ run.log
 β”‚       β”œβ”€β”€ test_vqvae
 β”‚       β”‚   β”œβ”€β”€ ...
 β”‚       β”‚   ...
 β”‚       β”œβ”€β”€ 0000-Trans-EXP_NAME-DESC  (stage2_exp_number + Trans + args.exp_name + desc)
 β”‚       β”‚   β”œβ”€β”€ quantized_dataset  (The quantized motion using VQVAE)
 β”‚       β”‚   β”œβ”€β”€ events.out.XXX
 β”‚       β”‚   β”œβ”€β”€ net_best_XXX.pth
 β”‚       β”‚   ...
 β”‚       β”‚   β”œβ”€β”€ run.log
 β”‚       β”‚   └── test_trans
 β”‚       β”‚       β”œβ”€β”€ ...
 β”‚       β”‚       ...
 β”‚       β”œβ”€β”€ 0001-Trans-EXP_NAME-DESC
 β”‚       ...
 β”œβ”€β”€ 00001-DATASET  (exp_number + dataset_name)
 ...

3.1. VQ-VAE

For KIT dataset, set --dataname kit.

CUDA_VISIBLE_DEVICES=0 python train_ParCo_vq.py \
--out-dir output \
--exp-name ParCo \
--dataname t2m \
--batch-size 256 \
--lr 2e-4 \
--total-iter 300000 \
--lr-scheduler 200000 \
--vqvae-cfg default \
--down-t 2 \
--depth 3 \
--dilation-growth-rate 3 \
--vq-act relu \
--quantizer ema_reset \
--loss-vel 0.5 \
--recons-loss l1_smooth

3.2. Part-Coordinated Transformer

Remember to set --vqvae-train-dir to the corresponding directory of the VQ-VAE you trained.

For KIT-ML dataset, set --dataname kit.

CUDA_VISIBLE_DEVICES=0 python train_ParCo_trans.py \
--vqvae-train-dir output/00000-t2m-ParCo/VQVAE-ParCo-t2m-default/ \
--select-vqvae-ckpt last \
--exp-name ParCo \
--pkeep 0.4 \
--batch-size 128 \
--trans-cfg default \
--fuse-ver V1_3 \
--alpha 1.0 \
--num-layers 14 \
--embed-dim-gpt 1024 \
--nb-code 512 \
--n-head-gpt 16 \
--block-size 51 \
--ff-rate 4 \
--drop-out-rate 0.1 \
--total-iter 300000 \
--eval-iter 10000 \
--lr-scheduler 150000 \
--lr 0.0001 \
--dataname t2m \
--down-t 2 \
--depth 3 \
--quantizer ema_reset \
--dilation-growth-rate 3 \
--vq-act relu

4. Evaluation

4.1. VQ-VAE

Remember to set --vqvae-train-dir to the VQ-VAE you want to evaluate.

CUDA_VISIBLE_DEVICES=0 python eval_ParCo_vq.py --vqvae-train-dir output/00000-t2m-ParCo/VQVAE-ParCo-t2m-default/ --select-vqvae-ckpt last

4.2. Part-Coordinated Transformer

For evaluation on KIT-ML dataset, set --select-ckpt last. If you want to evaluate the MultiModality (which takes a long time), just delete --skip-mmod.

Remember to set --eval-exp-dir to your trained ParCo's directory.

CUDA_VISIBLE_DEVICES=0 python eval_ParCo_trans.py \
--eval-exp-dir output/00000-t2m-ParCo/VQVAE-ParCo-t2m-default/00000-Trans-ParCo-default \
--select-ckpt fid \
--skip-mmod

5. Pre-trained Models

Our pretrained models are provided at [Google Drive]. Extract the .zip files and put them under output folder for evaluation.

You can also run following scripts to prepare the pretrained models:

mkdir output
cd output
gdown 1jmuX3xDEku3e_ldnTUm192eQRS3EEw99
unzip ParCo_official_model_weights_HumanML3D.zip
cd ..

For model trained on KIT-ML, replace the gdown downloading command with gdown 1_D9vqIhMv5-oz6qfiTGKjsxhS5DNP0PB and unzip command with unzip ParCo_official_model_weights_KIT-ML.zip.

Since we renamed the directory of pre-trained models, remember to set --skip-path-check when evaluating our Part-Coordinated transformer. For example:

CUDA_VISIBLE_DEVICES=0 python eval_ParCo_trans.py \
--eval-exp-dir output/ParCo_official_HumanML3D/VQVAE-ParCo-t2m-default/00000-Trans-ParCo-default \
--select-ckpt fid \
--skip-mmod \
--skip-path-check

6. ParCo with up&low body partition

Our ParCo adopts 6-part partition strategy. If you want to investigate ParCo with upper and lower body partition, run the scripts below.

<details> <summary> Details </summary> </details>

7. Visualize Motion

Render SMPL mesh:

TODO

Acknowledgement

We thank for:

Other awesome public codes from Text-to-Motion community: ReMoDiffuse, AttT2M, etc.