Awesome

Multi-Person Pose Estimation Based on Gaussian Response Heatmaps

Code and pre-trained models for our paper.

This repo is the Part A of our work. We associate the keypoints of individual poses using body parts.

Pat B is in the repo on GitHub, Improved-Body-Parts.

Introduction

A bottom-up approach for the problem of multi-person pose estimation. This Part is based on the network backbones in CMU-Pose (namely OpenPose). A modified network is also trained and evalutated.

method FL2

Training
Evaluation
Demo

Task Lists

Add more descriptions and complete the project

Project Features

Implement the models using Keras with TensorFlow backend
VGG as the backbone
No batch normalization layer
Supprot training on multiple GPUs and slice training samples among GPUs
Fast data preparing and augmentation during training
Different learning rate at different layers

Prepare

Install packages:

Python=3.6, Keras=2.1.2, TensorFlow-gpu=1.3.0rc2 and other packages needed (please refer to requirements.txt). We haven't tested other platforms and packages of different versions.
Download the MSCOCO dataset.
Download the pre-trained models from Dropbox.
Change the paths in the code according to your environment.

Run a Demo

python demo_image.py

Evaluation Steps

The corresponding code is in pure python without multiprocess for now.

python testing/evaluation.py

Results on MSCOCO2017 Dataset

Results on MSCOCO 2017 validation subset (model trained without val data, + focal L2 loss, default size 368, 4 scales + flip)

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.607
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.817
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.661
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.581
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.652
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.647
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.837
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.692
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.600
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.717

Update

We have tried new network structures.

Results of posenet/model3 on MSCOCO 2017 validation subset (model trained with val data, + focal L2 loss, default size 368, 4 scales + flip).

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.622
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.828
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.674
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.594
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.669
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.659
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.844
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.706
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.613
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.730

Results on MSCOCO 2017 test-dev subset (model trained with val data, + focal L2 loss, default size 368, 8 scales + flip)

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.599
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.825
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.647
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.580
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.634
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.848
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.686
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.598
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.705

According to our results, the performance of posenet/model3 in this repo is similar to CMU-Net (the cascaed CNN used in CMU-Pose ), which means only merging different feature maps with diferent receptive fields at low resolution (stride=8) could not help much (without offset regression). And the limitted capacity of the network is also a bottleneck of the estimation accuracy.

FYI

New pre-trained IMHN models achieving over 0.69 AP on the MSCOCO test-dev dataset is to be shared soon as long as the review is done.

News!

Recently, we are lucky to have time and machine to utilize. Thus, we revisit our previous work. More accurate results had been achieved after we adopted more powerful Network and use higher resolution of heatmaps (stride=4). Enhanced models with body part representation, variant loss functions and training parameters have been tried.

<font color="#0000dd">Please also refer to our new repo</font>: Improved-Body-Parts (highly recommended)

improved

Results on MSCOCO 2017 test-dev subset (focal L2 loss, default size 512, 5 scales + flip）

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.685
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.867
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.749
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.664
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.719
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 20 ] = 0.728
 Average Recall     (AR) @[ IoU=0.50      | area=   all | maxDets= 20 ] = 0.892
 Average Recall     (AR) @[ IoU=0.75      | area=   all | maxDets= 20 ] = 0.782
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.688
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.784

Training Steps

Before training, prepare the training data using ''training/coco_masks_hdf5.py''.

The training code is available

python training/train_pose.py

Notice: change the sample slicing ratios between different GPUs at this line in the code as you want.

Referred Repositories (mainly)

Citation

If this work help your research, please cite the corresponding paper:

@inproceedings{li2020simple,
  title={Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation},
  author={Li, Jia and Su, Wen and Wang, Zengfu},
  booktitle={Proceedings of the AAAI conference on artificial intelligence},
  volume={34},
  number={07},
  pages={11354--11361},
  year={2020}
}