Awesome
BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation
[Paper] [Project Page] [Colab]
If you find our code or paper helpful, please consider starring our repository and citing us.
@article{hosseyni2024bad,
title={BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation},
author={Hosseyni, S Rohollah and Rahmani, Ali Ahmad and Seyedmohammadi, S Jamal and Seyedin, Sanaz and Mohammadi, Arash},
journal={arXiv preprint arXiv:2409.10847},
year={2024}
}
News
š¢ 2024-09-24 --- Initialized the webpage and git project.
Get You Ready
<details>1. Conda Environment
For training and evaluation, we used the following conda environment, which is based on the MMM environment:
conda env create -f environment.yml
conda activate bad
pip install git+https://github.com/openai/CLIP.git
We encountered issues when using the above environment for generation and visualization. As a result, we had to use a new environment. You may try changing the version of some packages from the previous environment, particularly numpy, and it might work. The new environment is based on the Momask environment, with additional packages like smplx from the MDM environment.
conda env create -f environment2.yml
conda activate bad2
pip install git+https://github.com/openai/CLIP.git
2. Models and Dependencies
Download Pre-trained Models
bash dataset/prepare/download_models.sh
Download SMPL Files
For rendering.
bash dataset/prepare/download_smpl_files.sh
Download Evaluation Models and Gloves
For evaluation only.
bash dataset/prepare/download_extractor.sh
bash dataset/prepare/download_glove.sh
Troubleshooting
To address the download error related to gdown: "Cannot retrieve the public link of the file. You may need to change the permission to 'Anyone with the link', or have had many accesses". A potential solution is to run pip install --upgrade --no-cache-dir gdown
, as suggested on https://github.com/wkentaro/gdown/issues/43. This should help resolve the issue.
(Optional) Download Manually
Visit [Google Drive] to download the models and evaluators mannually.
3. Get Data
HumanML3D - We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download link here.
./dataset/HumanML3D/
āāā new_joint_vecs/
āāā texts/
āāā Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
āāā Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D)
āāā train.txt
āāā val.txt
āāā test.txt
āāā train_val.txt
āāā all.txt
KIT-ML - For KIT-ML dataset, you can download and extract it using the following files:
bash dataset/prepare/download_kit.sh
bash dataset/prepare/extract_kit.sh
If you face any issues, you can refer to this link.
</details>Training
<details>Stage 1: VQ-VAE
python train_vq.py --exp_name 'trian_vq' \
--dataname t2m \
--total_batch_size 256
--exp_name
: The name of your experiment.--dataname
: Dataset name; uset2m
for HumanML3D andkit
for KIT-ML dataset.
Stage 2: Transformer
python train_t2m_trans.py --exp_name 'train_tr' \
--dataname t2m \
--time_cond \
--z_0_attend_to_all \
--unmasked_tokens_not_attend_to_mask_tokens \
--total_batch_size 256 \
--vq_pretrained_path ./output/vq/vq_last.pth
--z_0_attend_to_all
: Specifies the causality condition for mask tokens, where each mask token attends to the lastT-p+1
mask tokens. Ifz_0_attend_to_all
is not activated, each mask token attends to the firstp
mask tokens.--time_cond
: Uses time as one of the conditions for training the transformer.--unmasked_tokens_not_attend_to_mask_tokens
: Prohibits mask tokens from attending to other mask tokens.--vq_pretrained_path
: The path to your pretrained VQ-VAE.
Evaluation
<details>For sampling using Order-Agnostic Autoregressive Sampling (OAAS), rand_pos
should be set to False
. rand_pos=False
means that the token with the highest probability is always sampled, and no top_p
, top_k
, or temperature
is applied. If rand_pos=True
, the metrics significantly worsen, whereas in Confidence-Based Sampling (CBS), the metrics significantly improve. We do not know why OAAS performance worsens with random sampling during generation. Maybe this is a bug; we are not sure! We would be extremely grateful if anyone could help fix this issue.
python GPT_eval_multi.py --exp_name "eval" \
--sampling_type OAAS \
--z_0_attend_to_all \
--time_cond \
--unmasked_tokens_not_attend_to_mask_tokens \
--num_repeat_inner 1 \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
--sampling_type
: Type of sampling.--num_repeat_inner
: If you want to calculate MModality, it should be above 10, like 20. For other metrics, 1 is enough.--resume_pth
: The path to your pretrained VQ-VAE.--resume_trans
: The path to your pretrained transformer.
For sampling using Confidence-Based Sampling (CBS), rand_pos=True
significantly improves FID compared to CBS with rand_pos=False
.
python GPT_eval_multi.py --exp_name "eval" \
--z_0_attend_to_all \
--time_cond \
--sampling_type CBS \
--rand_pos \
--unmasked_tokens_not_attend_to_mask_tokens \
--num_repeat_inner 1 \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
For evaluation of four temporal editing tasks (inpainting, outpainting, prefix prediction, suffix prediction), you should use eval_edit.py
. We used OAAS to report our results on temporal editing tasks in Table 3 of the paper.
python eval_edit.py --exp_name "eval" \
--edit_task inbetween \
--z_0_attend_to_all \
--time_cond \
--sampling_type OAAS \
--unmasked_tokens_not_attend_to_mask_tokens \
--num_repeat_inner 1 \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
--edit_task
: Four edit tasks are available:inbetween
,outpainting
,prefix
, andsuffix
.
Generation
<details>For generating a motion sequence run the following
python generate.py --caption 'a person jauntily skips forward.' \
--length 196 \
--z_0_attend_to_all \
--time_cond \
--sampling_type OAAS \
--unmasked_tokens_not_attend_to_mask_tokens \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
--length
: The length of the motion sequence. If not provided, a length estimator will be used to predict the length of the motion sequence based on the caption.--caption
: Text prompt used for generating the motion sequence.
For temporal editing, run the following.
python generate.py --temporal_editing \
--caption 'a person jauntily skips forward.' \
--caption_inbetween 'a man walks in a clockwise circle an then sits.' \
--length 196 \
--edit_task inbetween \
--z_0_attend_to_all \
--time_cond \
--sampling_type OAAS \
--unmasked_tokens_not_attend_to_mask_tokens \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
--caption_inbetween
: Text prompt used for generating theinbetween
/outpainting
/prefix
/suffix
motion sequence.--edit_task
: Four edit tasks are available:inbetween
,outpainting
,prefix
, andsuffix
.
For long sequence generation, run the following.
python generate.py --long_seq_generation \
--long_seq_captions 'a person runs forward and jumps.' 'a person crawls.' 'a person does a cart wheel.' 'a person walks forward up stairs and then climbs down.' 'a person sits on the chair and then steps up.' \
--long_seq_lengths 128 196 128 128 128 \
--z_0_attend_to_all \
--time_cond \
--sampling_type OAAS \
--unmasked_tokens_not_attend_to_mask_tokens \
--resume_pth ./output/vq/vq_last.pth \
--resume_trans ./output/t2m/trans_best_fid.pth
--long_seq_generation
: Activating long sequence generation.--long_seq_captions
: Specifies multiple captions.--long_seq_lengths
: Specifies multiple lengths (between 40 and 196) corresponding to each caption.
Visualization
<details>The above commands will save .bvh
and .mp4
files in ./output/visualization/
directory. The .bvh
file can be rendered in Blender. Please refer to this link for more information.
To render the motion sequence in SMPL, you need to pass the .mp4
and .npy
file generated by generate.py
to visualization/render_mesh.py
. The following command will create .obj
files that can be easily imported into Blender. This script is running SMPLify and needs GPU as well.
python visualization/render_mesh.py \
--input_path output/visualization/animation/a_person_jauntily_skips_forwar_196/sample103_repeat0_len196.mp4 \
--npy_path output/visualization/joints/a_person_jauntily_skips_forwar_196/sample103_repeat0_len196.npy
--input_path
: Path to the.mp4
file, created bygenerate.py
.--npy_path
: Path to the.npy
file, created bygenerate.py
For rendering .obj
files using Blender, you can use the scripts in the visualization/blender_scripts directory. First, open Blender, then go to File -> Import -> Wavefront (.obj), navigate to the directory containing the .obj
files, and press A
to select and import all of them. Next, copy and paste the script from visualization/blender_scripts/framing_coloring.py into the Scripting tab in Blender, and run the script. Finally, you can render the animation in the Render tab.
Acknowledgement
We would like to express our sincere gratitude to MMM, Momask, MDM, and T2M-GPT for their outstanding open-source contributions.