Home

Awesome

An implementation of "Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation"

Maksym Andriushchenko, Fan Yue

This is a TensorFlow implementation of the paper, which became quite influential in the human pose estimation task (~450 citations).

Here are a few examples of joints detection based on FLIC dataset produced with our implementation:

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm1.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm2.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm3.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm4.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm5.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm6.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm7.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ap_sm8.png" height="250"/>

Main Idea

The authors propose a fully-convolutional approach. As input they use multiple images of different resolutions that aim to capture the spatial context of a different size. These images are processed by a series of 5x5 convolutional and max pooling layers. Then the feature maps from different resolutions are added up, followed by 2 large 9x9 convolutions. The final layer with 90x60xK feature maps (where K is the number of joints) is our predicted heat maps. We use then softmax and cross-entropy loss on top of them together with the ground truth heat maps, which we form by placing a small 3x3 binomial kernel on the actual joint position (see data.py). cnn architecture

<!-- <img src="report/img/cnn_architecture.png" width=200/> -->

Note, that the input resolution 320x240 is not justified at all in the paper, and it is not clear how one can arrive at 98x68 or 90x60 feature maps after 2 max pooling layers. It is quite hard to guess what the authors really did here. Instead we use what makes more sense to us: processing of the full resolution 720x480 images, but first convolution is applied with stride=2, making all dimensions of feature maps comparable in size with what proposed in the paper.

The described part detector already gives good results, however there are also some mistakes that potentially can be ruled out by applying a spatial model. For example, in the third image below there are many false detections of hips (pink color), which clearly do not meet kinematic constraints w.r.t. nose and shoulders that are often detected with very high accuracy.

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ex1.png" height="300"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ex2.png" height="300"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/ex3.png" height="300"/>

So the goal is to get rid of such false positives that clearly do not meet kinematic constraints. Traditionally, for such purposes a probabilistic graphical model was used. One of the most popular choices is a tree-structured graphical model, because of the exact inference combined with efficiency due to gaussian pairwise priors, which are most often used. Some approaches combined exact inference with hierarchical structure of a graphical model. Another approaches relied on approximate inference with a loopy graphical model, that allowed to establish connections between symmetrical parts.

An important novelty of this paper is that the spatial model can be modeled as a fully connected graphical model with parameters that can be trained jointly with the part detector. Thus the graphical model can be learned from the data, and there is no need to design it for a specific task and dataset, which is a clear advantage. The schematic description of such spatial model is given below: cnn architecture

We can see a few examples below of how our spatial model trained jointly with the part detector performs compared to the part detector only. On the 1st example we can see that there is a detection of hip of backward facing person. However, this hip does not have any other body parts in its vicinity, so it is ruled out by the spatial model. On the 2nd example there are a few joint detections of person standing on right, and also a minor detection of wrist on the left (small yellow cloud). All of them are ruled out by the spatial model. Note, that there are still some mistakes, but rigorous model evaluation in Section~3 reveals that we indeed get a significant improvement by applying the spatial model.

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/pd1.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/sm1.png" height="250"/>

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/pd2.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/sm2.png" height="250"/>

Reproducibility challenge

Surprisingly, we didn't find any implementation in the internet. It can be explained by the fact that the original paper doesn't list any hyperparameters and doesn't provide all necessary implementation details. Thus, it is extremely hard to reproduce, and we decided to add several reasonable modifications from the recent papers to improve the results. However, we kept the architecture of the CNN and the GM without changes.

Differences from the paper

We introduced the following improvements to the model:

Other important implementation details

Evaluation

The evaluation of our model is presented on two plots below, followed by two plots from the original paper.

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/our_pd_detrate.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/our_pdsm_detrate.png" height="250"/>

<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/orig_wrist_detrate.png" height="250"/> <img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/orig_pdsm_detrate.png" height="250"/>

A few observations. Let's consider radius of 10 normalized pixels for the analysis:

Interpretation of the spatial model

The most interesting question is what kind of parameters for pairwise potentials were learned with backpropagation. We show them below. Please, note that we show pre-softplus values, but after softplus values are obviously similar except negative values. White color denotes high values of parameteres, and dark denotes low values.

Initialized parametersPairwise parameters after 60 epochsPairwise biases after 60 epochs
<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/0epoch_nose_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_nose_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_bias_nose_torso.png" height="180"/>
initial energy of nose|torsoenergy nose|torso after 60 epochsbias nose|torso after 60 epochs
<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/0epoch_rsho_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_rsho_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_bias_rsho_torso.png" height="180"/>
initial energy of rsho|torsoenergy rsho|torso after 60 epochsbias rsho|torso after 60 epochs
<img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/0epoch_relb_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_relb_torso.png" height="180"/><img src="https://raw.githubusercontent.com/max-andr/joint-cnn-mrf/master/report/img/60epoch_bias_relb_torso.png" height="180"/>
initial energy of relb|torsoenergy relb|torso after 60 epochsbias relb|torso after 60 epochs

We show only potentials of joints conditioned on torso, because this leads to more distinct patterns. In contrast, e(lhip|rwri) has almost uniform distribution, which means that this connection in a graphical model is redundant.

Our main observations:

How to run the code

  1. Download FLIC dataset (FLIC.zip from here).
  2. Run python data.py to process the raw data into x_train_flic.npy, x_test_flic.npy, y_train_flic.npy, y_test_flic.npy. Note, there are 2 varibles: meta_info_file = 'data_FLIC.mat' and images_dir = './images_FLIC/' that you may need to change. data_FLIC.mat is just another name for examples.mat from FLIC.zip, and images_FLIC/ is a directory with all the images, which is called just images in FLIC.zip.
  3. Run python pairwise_distr.py to get file pairwise_distribution.pickle that contains a dictionary of numpy arrays that correspond to empirical histogram of joints displacements. This is a smart initialization of the spatial model described in the paper.
  4. And now you can run the training of the model in a multi-GPU setting: python main.py --gpus 2 3 --train --data_augm --use_sm --n_epochs=60 --batch_size=14 --optimizer=adam --lr=0.001 --lmbd=0.001

Supported options (or you can simply type python main.py --help):

Note that the script main.py saves tensorboard summaries (folder tb) and model parameters (folder models_ex).

Contact

For any questions regarding the code please contact Maksym Andriushchenko (m.my surname@gmail.com). Any suggestions are always welcome.

Citation

You can cite the original paper as:

@inproceedings{tompson2014joint,
  title={Joint training of a convolutional network and a graphical model for human pose estimation},
  author={Tompson, Jonathan J and Jain, Arjun and LeCun, Yann and Bregler, Christoph},
  booktitle={Advances in neural information processing systems},
  pages={1799--1807},
  year={2014}
}