Home

Awesome

MDM: Human Motion Diffusion Model

arXiv <a href="https://replicate.com/arielreplicate/motion_diffusion_model"><img src="https://replicate.com/arielreplicate/motion_diffusion_model/badge"></a>

The official PyTorch implementation of the paper "Human Motion Diffusion Model".

Please visit our webpage for more details.

teaser

MDM is now 40X faster 🤩🤩🤩 (~0.4 sec/sample)

How come?!?

(1) We released the 50 diffusion steps model (instead of 1000 steps) which runs 20X faster with comparable results.

(2) Calling CLIP just once and caching the result runs 2X faster for all models. Please pull.

MDM results on HumanML3D to cite in your paper (The original model used in the MDM paper)

Performance improvement is due to an evaluation bug fix. BLUE marks fixed entries compared to the paper. fixed_results

Bibtex

🔴🔴🔴NOTE: MDM and MotionDiffuse are NOT the same paper! For some reason, Google Scholar merged the two papers. The right way to cite MDM is:</span>

<!-- If you find this code useful in your research, please cite: -->
@inproceedings{
tevet2023human,
title={Human Motion Diffusion Model},
author={Guy Tevet and Sigal Raab and Brian Gordon and Yoni Shafir and Daniel Cohen-or and Amit Haim Bermano},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=SJ1kSyO2jwu}
}

News

📢 15/Apr/24 - Released a 50 diffusion steps model (instead of 1000 steps) which runs 20X faster 🤩🤩🤩 with comparable results.

📢 12/Apr/24 - MDM inference is now 2X faster 🤩🤩🤩 This was made possible by calling CLIP just once and caching the result, and is backward compatible with older models.

📢 25/Jan/24 - Fixed bug in evalutation code (#182) - Please use the fixed results when citing MDM.<br>

📢 1/Jun/23 - Fixed generation issue (#104) - Please pull to improve generation results.

📢 23/Nov/22 - Fixed evaluation issue (#42) - Please pull and run bash prepare/download_t2m_evaluators.sh from the top of the repo to adapt.

📢 4/Nov/22 - Added sampling, training and evaluation of unconstrained tasks. Note slight env changes adapting to the new code. If you already have an installed environment, run bash prepare/download_unconstrained_assets.sh; conda install -y -c anaconda scikit-learn to adapt.

📢 3/Nov/22 - Added in-between and upper-body editing.

📢 31/Oct/22 - Added sampling, training and evaluation of action-to-motion tasks.

📢 9/Oct/22 - Added training and evaluation scripts. Note slight env changes adapting to the new code. If you already have an installed environment, run bash prepare/download_glove.sh; pip install clearml to adapt.

📢 6/Oct/22 - First release - sampling and rendering using pre-trained models.

Checkout MDM Follow-ups (partial list)

🐉 SinMDM - Learns single motion motifs - even for non-humanoid characters.

👯 PriorMDM - Uses MDM as a generative prior, enabling new generation tasks with few examples or even no data at all.

💃 MAS - Generating intricate 3D motions (including non-humanoid) using 2D diffusion models trained on in-the-wild videos.

🐒 MoMo - Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

🏃 CAMDM - Taming Diffusion Probabilistic Models for Character Control - a real-time version of MDM.

Getting started

This code was tested on Ubuntu 18.04.5 LTS and requires:

1. Setup environment

Install ffmpeg (if not already installed):

sudo apt update
sudo apt install ffmpeg

For windows use this instead.

Setup conda env:

conda env create -f environment.yml
conda activate mdm
python -m spacy download en_core_web_sm
pip install git+https://github.com/openai/CLIP.git

Download dependencies:

<details> <summary><b>Text to Motion</b></summary>
bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.sh
</details> <details> <summary><b>Action to Motion</b></summary>
bash prepare/download_smpl_files.sh
bash prepare/download_recognition_models.sh
</details> <details> <summary><b>Unconstrained</b></summary>
bash prepare/download_smpl_files.sh
bash prepare/download_recognition_models.sh
bash prepare/download_recognition_unconstrained_models.sh
</details>

2. Get data

<details> <summary><b>Text to Motion</b></summary>

There are two paths to get the data:

(a) Go the easy way if you just want to generate text-to-motion (excluding editing which does require motion capture data)

(b) Get full data to train and evaluate the model.

a. The easy way (text only)

HumanML3D - Clone HumanML3D, then copy the data dir to our repository:

cd ..
git clone https://github.com/EricGuo5513/HumanML3D.git
unzip ./HumanML3D/HumanML3D/texts.zip -d ./HumanML3D/HumanML3D/
cp -r HumanML3D/HumanML3D motion-diffusion-model/dataset/HumanML3D
cd motion-diffusion-model

b. Full data (text + motion capture)

HumanML3D - Follow the instructions in HumanML3D, then copy the result dataset to our repository:

cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D

KIT - Download from HumanML3D (no processing needed this time) and the place result in ./dataset/KIT-ML

</details> <details> <summary><b>Action to Motion</b></summary>

UESTC, HumanAct12

bash prepare/download_a2m_datasets.sh
</details> <details> <summary><b>Unconstrained</b></summary>

HumanAct12

bash prepare/download_unconstrained_datasets.sh
</details>

3. Download the pretrained models

Download the model(s) you wish to use, then unzip and place them in ./save/.

<details> <summary><b>Text to Motion</b></summary>

You need only the first one.

HumanML3D

humanml-encoder-512-50steps - Runs 20X faster with comparable performance!

humanml-encoder-512 (best model used in the paper)

humanml-decoder-512

humanml-decoder-with-emb-512

KIT

kit-encoder-512

</details> <details> <summary><b>Action to Motion</b></summary>

UESTC

uestc

uestc_no_fc

HumanAct12

humanact12

humanact12_no_fc

</details> <details> <summary><b>Unconstrained</b></summary>

HumanAct12

humanact12_unconstrained

</details>

Motion Synthesis

<details> <summary><b>Text to Motion</b></summary>

Generate from test set prompts

python -m sample.generate --model_path ./save/humanml_trans_enc_512/model000200000.pt --num_samples 10 --num_repetitions 3

Generate from your text file

python -m sample.generate --model_path ./save/humanml_trans_enc_512/model000200000.pt --input_text ./assets/example_text_prompts.txt

Generate a single prompt

python -m sample.generate --model_path ./save/humanml_trans_enc_512/model000200000.pt --text_prompt "the person walked forward and is picking up his toolbox."
</details> <details> <summary><b>Action to Motion</b></summary>

Generate from test set actions

python -m sample.generate --model_path ./save/humanact12/model000350000.pt --num_samples 10 --num_repetitions 3

Generate from your actions file

python -m sample.generate --model_path ./save/humanact12/model000350000.pt --action_file ./assets/example_action_names_humanact12.txt

Generate a single action

python -m sample.generate --model_path ./save/humanact12/model000350000.pt --action_name "drink"
</details> <details> <summary><b>Unconstrained</b></summary>
python -m sample.generate --model_path ./save/unconstrained/model000450000.pt --num_samples 10 --num_repetitions 3

By abuse of notation, (num_samples * num_repetitions) samples are created, and are visually organized in a display of num_samples rows and num_repetitions columns.

</details>

You may also define:

Running those will get you:

It will look something like this:

example

You can stop here, or render the SMPL mesh using the following script.

Render SMPL mesh

To create SMPL mesh per frame run:

python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/file

This script outputs:

Notes:

Notes for 3d makers:

Motion Editing

Unconditioned editing

python -m sample.edit --model_path ./save/humanml_trans_enc_512/model000200000.pt --edit_mode in_between

You may also define:

The output will look like this (blue frames are from the input motion; orange were generated by the model):

example

Text conditioned editing

Just add the text conditioning using --text_condition. For example:

python -m sample.edit --model_path ./save/humanml_trans_enc_512/model000200000.pt --edit_mode upper_body --text_condition "A person throws a ball"

The output will look like this (blue joints are from the input motion; orange were generated by the model):

example

Train your own MDM

<details> <summary><b>Text to Motion</b></summary>

HumanML3D

python -m train.train_mdm --save_dir save/my_humanml_trans_enc_512 --dataset humanml

KIT

python -m train.train_mdm --save_dir save/my_kit_trans_enc_512 --dataset kit
</details> <details> <summary><b>Action to Motion</b></summary>
python -m train.train_mdm --save_dir save/my_name --dataset {humanact12,uestc} --cond_mask_prob 0 --lambda_rcxyz 1 --lambda_vel 1 --lambda_fc 1
</details> <details> <summary><b>Unconstrained</b></summary>
python -m train.train_mdm --save_dir save/my_name --dataset humanact12 --cond_mask_prob 0 --lambda_rcxyz 1 --lambda_vel 1 --lambda_fc 1  --unconstrained
</details>

Evaluate

<details> <summary><b>Text to Motion</b></summary>

HumanML3D

python -m eval.eval_humanml --model_path ./save/humanml_trans_enc_512/model000475000.pt

KIT

python -m eval.eval_humanml --model_path ./save/kit_trans_enc_512/model000400000.pt
</details> <details> <summary><b>Action to Motion</b></summary>
python -m eval.eval_humanact12_uestc --model <path-to-model-ckpt> --eval_mode full

where path-to-model-ckpt can be a path to any of the pretrained action-to-motion models listed above, or to a checkpoint trained by the user.

</details> <details> <summary><b>Unconstrained</b></summary>
python -m eval.eval_humanact12_uestc --model ./save/unconstrained/model000450000.pt --eval_mode full

Precision and recall are not computed to save computing time. If you wish to compute them, edit the file eval/a2m/gru_eval.py and change the string fast=True to fast=False.

</details>

Acknowledgments

This code is standing on the shoulders of giants. We want to thank the following contributors that our code is based on:

guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, MoDi.

License

This code is distributed under an MIT LICENSE.

Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.