Home

Awesome

Text to Motion Retrieval

Overview

This is the official code for reproducing results obtained in the short paper Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language.

:fire: This paper won the Best Short Paper Award Honorable Mention at SIGIR 2023.

This repository is intended to be a codebase - easy to expand with novel motion encoders, text encoders, and loss functions - to develop novel approaches for text-to-motion (and motion-to-text) retrieval.

<table> <tr> <td><i>"A person walks in a counterclockwise circle"</i></td> <td><img src="teaser/example_74.gif" alt="A person walks in a counterclockwise circle" width=100%></td> </tr> <tr> <td><i>"A person is kneeling down on all four legs and begins to crawl"</i></td> <td><img src="teaser/example_243.gif" alt="A person is kneeling down on all four legs and begins to crawl" width=100%></td> </tr> </table>

Performance

Text ModelMotion ModelR@1R@5R@10meanrmedrSPICEspaCy
BERT+LSTMBiGRU2.911.819.8253.9550.2500.768
BERT+LSTMUpperLowerGRU2.410.517.7285.7680.2420.763
BERT+LSTMDG-STGCN2.08.414.4242.0730.2310.767
BERT+LSTMMoT (our)2.511.219.4234.5510.2470.768
CLIPBiGRU3.414.323.1201.9430.2720.780
CLIPUpperLowerGRU3.112.620.8200.4470.2690.779
CLIPDG-STGCN4.116.026.5159.6330.2910.789
CLIPMoT (our)3.514.824.5166.2380.2800.785
Text ModelMotion ModelR@1R@5R@10meanmedSPICEspaCy
BERT+LSTMBiGRU3.715.223.872.3300.2710.706
BERT+LSTMUpperLowerGRU3.215.725.390.2340.2630.697
BERT+LSTMDG-STGCN6.224.538.240.6170.3390.740
BERT+LSTMMoT (our)5.321.332.051.1200.3180.723
CLIPBiGRU6.621.532.352.0220.3160.729
CLIPUpperLowerGRU6.422.032.252.3220.3210.732
CLIPDG-STGCN7.226.138.236.9160.3550.751
CLIPMoT (our)6.526.442.635.5140.3520.748

Getting started

This code was tested on Ubuntu 18.04 LTS and requires:

1. Setup environment

Clone this repo and move into it:

git clone https://github.com/mesnico/text-to-motion-retrieval
cd text-to-motion-retrieval

Create a new conda environment and activate it:

conda create -n t2m python=3.10
conda activate t2m

Install dependencies:

pip install -r requirements.txt

2. Data preparation

HumanML3D - Follow the instructions in HumanML3D, then copy the result dataset to this repository:

cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D

KIT - Download from HumanML3D (no processing needed this time) and the place result in ./dataset/KIT-ML

Compute Text Similarities - This is needed to pre-compute the relevances for NDCG metric to use during validation and testing.

<!-- You have to download the precomputed text similarities from ... and place them under `outputs/computed_relevances`. -->

You can compute them by running the following command:

python text_similarity_utils/compute_relevance --set [val|test] --method spacy --dataset [kit|humanml]

Note 1: with respect to the paper, here we compute only the spacy relevance, given that the spice is slow and hard to calculate.

Note 2: to avoid errors, you should compute similarities using all the 4 configurations of parameters, by invoking the command using all the combinations of sets (val and test), and datasets (kit and humanml).

3. Train

Run the command

bash reproduce_train.sh

Modify che code appropriately for including/excluding models or loss functions. This code will create a folder ./runs where checkpoints and training metrics (tensorboard logs) are stored for each model.

4. Test

Run the command

bash reproduce_eval.sh

Modify che code appropriately following the models and loss functions included in reproduce_train.sh. Non-existing configurations will be skipped.

5. Result tables

Open the show.ipynb notebook to produce the tables shown in the paper.

NOTE: This is still WIP so there may be some runtime errors unless some changes are made to the code

6. Visualize retrieval results

Run the command

bash render.sh

Modify che code appropriately to employ a specific model and specific query ids. The resulting videos are placed in the folder outputs/renders.

Implementation Details

This repo is fully modular and based on the Hydra framework. For this reason, it is also quite easy to extend our text-to-motion framework to handle custom motion encoders, text encoders, and different loss functions.

The config folder is the root configuration path for Hydra.

To add another motion-encoder, text-encoder, or loss-function, you need:

  1. To write the pytorch code placing it inside the appropriate losses, motions, texts folder
  2. Expose your class as a module in the __init__.py file
  3. Write a .yaml configuration file , making sure that module._target_ points to that class
  4. Update the reproduce_train.sh and reproduce_eval.sh files to include the novel configuration (without .yaml extension) to the pool of experiments

Acknowledgements

This repo is largely based on the Motion Diffusion Model (MDM) repository. We are grateful to its authors.

License

This code is distributed under an MIT LICENSE.

Note that our code depends on other libraries, including TERN, TERAN, CLIP, SMPL, SMPL-X, and uses datasets that each have their own respective licenses that must also be followed.