Home

Awesome

Language2Pose:Natural Language Grounded Pose Forecasting

[Paper][Webpage]

There are 5 steps to running this code


PS: The implementation of one of the baselines, proposed by Lin et al.[1], was not publicly available and hence we make use of our implementation of their model to generate all the results and animations marked as Lin et al. Due to the differences in training hyperparameters, dataset and experiments, the numbers reported for Lin et al. in our paper differ from the ones in the original paper [1].

PS: This repo, at the moment, is functional at best. Feel free to create issues/pull requests however you see fit.


Python Virtual Environment

Anaconda is recommended to create the virtual environment

conda create -f env.yaml
source activate torch

To handle the logistics of saving/loading models pycasper is used

git clone https://github.com/chahuja/pycasper
cd src 
ln -s ../pycasper/pycasper .
cd ..

Data

Download

We use KIT Motion-Language Dataset which can be downloaded here

wget https://motion-annotation.humanoids.kit.edu/downloads/4/2017-06-22.zip
mkdir dataset/kit-mocap
unzip 2017-06-22.zip -d dataset/kit-mocap
rm 2017-06-22.zip 

Download Word2Vec binaries

Download the binary file here and place it in src/s2v

Pre-trained Models

Download pretrained models here and place it in src/save

Preprocessing

python data/data.py -dataset KITMocap -path2data ../dataset/kit-mocap

Rendering Ground Truths

python render.py -dataset KITMocap -path2data ../dataset/kit-mocap/new_fke -feats_kind fke

Calculating mean+variance for Z-Normalization

python dataProcessing/meanVariance.py -mask '[0]' -feats_kind rifke -dataset KITMocap -path2data ../dataset/kit-mocap -f_new 8

Training

We train the models using a script train_wordConditioned.py (Pardon the misnomer; initially it was supposed to be word conditioned pose forecasting but then I ended up adding sentence conditioned pose forecasting as well and was too lazy to change the filename.)

All the arguments (and their corresponding help texts) used for training can be found in src/argsUtils.py (PS: Some of them might be deprecated, but I have not removed them in case it breaks any of the other code that I might have written in the experimentation phase. Please raise an issue/ or send me an email if you have any clarification questions about any of the arguments). It would be good to stick to the args used in the examples if you want to play with the models in the paper.

python train_wordConditioned.py -batch_size 100 -cpk jl2p -curriculum 1 -dataset KITMocap -early_stopping 1 -exp 1 -f_new 8 -feats_kind rifke -losses "['SmoothL1Loss']" -lr 0.001 -mask "[0]" -model Seq2SeqConditioned9 -modelKwargs "{'hidden_size':1024, 'use_tp':False, 's2v':'lstm'}" -num_epochs 1000 -path2data ../dataset/kit-mocap -render_list subsets/render_list -s2v 1 -save_dir save/model/ -tb 1 -time 16 -transforms "['zNorm']" 

-modelKwargs need some explaination as they could vary based on the model

hidden_size: size of the joint embedding
use_tp: use a trajectory predictor [1]. False for JL2P models
s2v: sentence to vector model ('lstm' or 'bert')
python train_seq2seq.py -batch_size 100 -cpk lin -curriculum 0 -dataset KITMocap -early_stopping 1 -exp 1 -f_new 8 -feats_kind rifke -losses "['MSELoss']" -lr 0.001 -mask "[0]" -model Seq2Seq -modelKwargs "{'hidden_size':1024, 'use_tp':True, 's2v':'lstm'}" -num_epochs 1000 -path2data ../dataset/kit-mocap -render_list subsets/render_list -s2v 1 -save_dir save/model -tb 1 -time 16 -transforms "['zNorm']"

This model has 2 training steps. train_seq2seq.py uses a seq2seq model to first learn an embedding for pose sequences. Once the training is complete, train_wordConditioned.py is called which optimizes to map from language embeddings to pose embeddings.


Sampling

Sampling from trained Models

The training scripts will sample after the stopping criterion has reached, but if you would like to manually sample run the following script

python sample_wordConditioned.py -load <path-to-weights.p>

<path-to-weights.p> ends in _weights.p

Using Pretrained Models

Make sure you have downloaded the pre-trained models as described here.

python sample_wordConditioned.py -load save/jl2p/exp_726_cpk_jointSampleStart_model_Seq2SeqConditioned9_time_16_chunks_1_weights.p
python sample_wordConditioned.py -load save/lin-et-al/exp_700_cpk_mooney_model_Seq2SeqConditioned10_time_16_chunks_1_weights.p 

Rendering

After sampling, it would be nice to see what animation does the model generates. We only use the test samples for rendering.

If possible, use a machine with many cpu cores, as rendering animations on matplotlib is painfully slow. render.py uses all the available cores for parallel processing.

Using your trained model

python render.py -dataset KITMocap -load <path-to-weights.p> -feats_kind fke -render_list subsets/render_list

Using pre-trained Models

python render.py -dataset KITMocap -load save/jl2p/exp_726_cpk_jointSampleStart_model_Seq2SeqConditioned9_time_16_chunks_1_weights.p -feats_kind fke -render_list subsets/render_list
python render.py -dataset KITMocap -load save/lin-et-al/exp_700_cpk_mooney_model_Seq2SeqConditioned10_time_16_chunks_1_weights.p -feats_kind fke -render_list subsets/render_list

References

[1]: Lin, Angela S., et al. "1. Generating Animated Videos of Human Activities from Natural Language Descriptions." Learning 2018 (2018).