Awesome

InterControl

This repository is the official implementation of InterControl : Generate Human Motion Interactions by Controlling Every Joint

Zhenzhi Wang $^1$, Jingbo Wang $^2$, Yixuan Li $^1$, Dahua Lin $^{1,2}$, Bo Dai $^2$.

$^1$ CUHK, $^2$ Shanghai AI Lab.

Interaction Visualization

<table class="center"> <tr style="line-height: 1"> <td style="border: none; text-align: center">Three people are holding hands together.</td> <td style="border: none; text-align: center">Two people are fighting with another person, leading to a 2v1 fighting game.</td> <td style="border: none; text-align: center">Character animation version of 2v1 fighting in a physics simulator.</td> </tr> <tr> <td><img src="./assets/three people hold hands.gif"></td> <td><img src="./assets/three people fighting.gif"></td> <td><img src="./assets/animation-three-people-fighting.gif"></td> </tr> </table> <p style="margin-left: 3em; margin-top: 1em"></p> <table class="center"> <tr style="line-height: 1"> <td style="border: none; text-align: center">A person wins the fighting game and the referee holding his hands up to celebrate his success.</td> <td style="border: none; text-align: center">Two people are fighting with each other (1v1 fighting game).</td> <td style="border: none; text-align: center">Character animation version of 1v1 fighting in a physics simulator.</td> </tr> <tr> <td><img src="./assets/winner gesture in fighting games.gif"></td> <td><img src="./assets/two people fighting.gif"></td> <td><img src="./assets/animation-two-people-fighting.gif"></td> </tr> </table> <p style="margin-left: 2em; margin-top: 1em"></p> <table class="center"> <tr style="line-height: 1"> <td style="border: none; text-align: center">Two people are dancing together (sample 1).</td> <td style="border: none; text-align: center">Two people are dancing together (sample 2).</td> <td style="border: none; text-align: center">Two people are dancing together (sample 3).</td> </tr> <tr> <td><img src="./assets/two people dance-sample1.gif"></td> <td><img src="./assets/two people dance-sample2.gif"></td> <td><img src="./assets/two people dance-sample3.gif"></td> </tr> </table> <p style="margin-left: 2em; margin-top: 1em"></p>

Abstract

Text-conditioned human motion generation model has achieved great progress by introducing diffusion models and corresponding control signals. However, the interaction between humans are still under explored. To model interactions of arbitrary number of humans, we define interactions as human joint pairs that are either in contact or separated, and leverage Large Language Model (LLM) Planner to translate interaction descriptions into contact plans. Based on the contact plans, interaction generation could be achieved by spatially controllable motion generation methods by taking joint contacts as spatial conditions. We present a novel approach named InterControl for flexible spatial control of every joint in every person at any time by leveraging motion diffusion model only trained on single-person data. We incorporate a motion controlnet to generate coherent and realistic motions given sparse spatial control signals and a loss guidance module to precisely align any joint to the desired position in a classifier guidance manner via Inverse Kinematics (IK). Extensive experiments on HumanML3D and KIT-ML dataset demonstrate its effectiveness in versatile joint control. We also collect data of joint contact pairs by LLMs to show InterControl's ability in human interaction generation.

Getting started

Our code is developed from PriorMDM, therefore shares similar dependencies and setup instructions, which requires:

Python 3.8
conda3 or miniconda3
CUDA capable GPU (one is enough)

1. Setup environment (similar to PriorMDM)

Install ffmpeg (if not already installed):

sudo apt update
sudo apt install ffmpeg

Setup conda env:

conda env create -f environment.yml
conda activate InterControl
python -m spacy download en_core_web_sm
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/GuyTevet/smplx.git

2. Get MDM dependencies

<details> <summary><b>If you already have an installed MDM</b></summary>

Link from installed MDM

Before running the following bash script, first change the path to the full path to your installed MDM

bash prepare/link_mdm.sh

</details> <details> <summary><b>First time user</b></summary>

Download dependencies:

bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.sh

Get HumanML3D dataset :

Follow the instructions in HumanML3D, then copy the result dataset to our repository:

cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D

</details>

3. Download weights trained on HumanML3D dataset

Download the model(s) you wish to use, then unzip and place it in ./save/.

InterControl weights with loss guidance on $\mu_t$

all joints control, finetuned for sparse signals in temporal mask0.25_bfgs5_posterior_all
all joints control, checkpoint for HumanML3D dataset evalution mask1_bfgs5_posterior_all

InterControl weights with loss guidance on $x_0$

all joints control mask1_x0_all
pelvis control mask1_x0_pelvis

MDM weights (needed for InterControl training)

my_humanml-encoder-512

Single-Person Motion Generation

Sampling

Loss Guidance on $\mu_t$

python -m sample.global_joint_control --model_path save/mask0.25_bfgs5_posterior_all/model000140000.pt \
--num_samples 32 --use_posterior --control_joint all

It will visualize generated motions in the format of skeletons. To render SMPL meshes, please refer to the following section.

Render SMPL mesh

The rendering part is exactly the same as PriorMDM. We make no changes to it, except for a little bug that they add the root offset to the mesh twice. The following is the original instruction from PriorMDM.

To create SMPL mesh per frame run:

python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/file

This script outputs:

sample##_rep##_smpl_params.npy - SMPL parameters (thetas, root translations, vertices and faces)
sample##_rep##_obj - Mesh per frame in .obj format.

Notes:

The .obj can be integrated into Blender/Maya/3DS-MAX and rendered using them.
This script is running SMPLify and needs GPU as well (can be specified with the --device flag).
Important - Do not change the original .mp4 path before running the script.

Notes for 3d makers:

You have two ways to animate the sequence:
1. Use the SMPL add-on and the theta parameters saved to sample##_rep##_smpl_params.npy (we always use beta=0 and the gender-neutral model).
2. A more straightforward way is using the mesh data itself. All meshes have the same topology (SMPL), so you just need to keyframe vertex locations. Since the OBJs are not preserving vertices order, we also save this data to the sample##_rep##_smpl_params.npy file for your convenience.

By adjusting the camera position and the lighting, you can get the same results as our interaction demo.

Evaluation

Select checkpoint to be evluated by sepcifying the model_path, and use replication_times for multiple evaluations and get average results, the following evaluation script will generate motions for 10 times.

Loss Guidance on $\mu_t$

python3 -m eval.eval_controlmdm --model_path save/mask1_bfgs5_posterior_all/model000120000.pt \
--replication_times 10 --mask_ratio 1 --bfgs_times_first 5 \
--bfgs_times_last 10 --bfgs_interval 1 --use_posterior \
--control_joint all

Loss Guidance on $x_0$

python3 -m eval.eval_controlmdm --model_path save/mask1_x0_all/model000160000.pt \
--replication_times 10 --mask_ratio 1 --bfgs_times_first 1 \
--bfgs_times_last 10 --bfgs_interval 1 \
--control_joint all

Human Interaction Generation

Sampling

Two-people Interaction Sampling It requires information in sample.json to generate interactions. The information could be copied from ./assets/all_plans.json (our collected interaction plans from LLM planner) to generate different interactions.

python -m sample.interactive_global_joint_control \
--model_path save/mask0.25_bfgs5_posterior_all/model000140000.pt \
--multi_person --bfgs_times_first 5 --bfgs_times_last 10 \
--interaction_json './assets/sample.json' \

It will visualize generated motions in the format of skeletons. To render SMPL meshes, please refer to rendering section in single-person motion generation.

More than 3 people interaction sampling, need hand-crafted masks for each person

python -m sample.more_people_global_joint_control \
--model_path save/mask0.25_bfgs5_posterior_all/model000140000.pt \
--multi_person --bfgs_times_first 5 --bfgs_times_last 10 --use_posterior \

Evaluation

Loss Guidance on $\mu_t$

python3 -m eval.eval_interaction --model_path save/mask0.25_bfgs5_posterior_all/model000140000.pt \
--replication_times 10 --bfgs_times_first 5 --bfgs_times_last 10 --bfgs_interval 1 \
--use_posterior  --control_joint all \
--interaction_json './assets/all_plans.json' \
--multi_person

Training InterControl on HumanML3D

The model will save in the directory ./save/ + values in --save_dir. It requires pretrained MDM weights, which can be downloaded from my_humanml-encoder-512. Put the downloaded weights in ./save/ and make sure the checkpoint location is ./save/humanml_trans_enc_512/model000475000.pt.

Loss Guidance on $\mu_t$

python3 -m train.train_global_joint_control --save_dir save/mask1_bfgs5_posterior_all \
--dataset humanml --inpainting_mask global_joint --lr 0.00001 --mask_ratio 1 --control_joint all \
--use_posterior --bfgs_times_first 5

Loss Guidance on $x_0$

python3 -m train.train_global_joint_control --save_dir save/mask1_x0_all \
--dataset humanml --inpainting_mask global_joint --lr 0.00001 --mask_ratio 1 --control_joint all \
--bfgs_times_first 0

Only for pelvis control

python3 -m train.train_global_joint_control --save_dir save/mask1_x0_pelvis \
--dataset humanml --inpainting_mask global_joint --lr 0.00001 --mask_ratio 1 --control_joint pelvis \
--bfgs_times_first 0

Bibtex

If you find this code useful in your research, please cite:

@article{wang2023intercontrol,
  title={InterControl: Generate Human Motion Interactions by Controlling Every Joint},
  author={Wang, Zhenzhi and Wang, Jingbo and Lin, Dahua and Dai, Bo},
  journal={arXiv preprint arXiv:2311.15864},
  year={2023}
}

Acknowledgments

This code is standing on the shoulders of giants. We want to thank the following contributors that our code is based on:

GMD, PriorMDM, MDM, guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, TEACH.

License

This code is distributed under an MIT LICENSE.

Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.