Awesome
Reimplementation of human keypoint detection in mxnet
-
You can download mxnet model and parameters(coco and MPII) from google drive:
https://drive.google.com/drive/folders/0BzffphMuhDDMV0RZVGhtQWlmS1U
or check caffe_to_mxnet folder to download original caffe model and transfer it to mxnet model.
install heatmap and pafmap cython: cython/rebuild.sh
-
Test demo based on model of coco dataset: testModel.ipynb
-
Test demo based on model of MPII dataset: testModel_mpi.ipynb
-
Train with vgg model warm up. You can download mxnet model and parameters for vgg19 from here
python TrainWeightOnVgg.py
Train from CMU's converted model
python TrainWeight.py
-
Check if heat map, part affinity graph map, mask are generated correctly in training: test_generateLabel.ipynb
-
Evaluation on coco validation dataset with transfered mxnet model: evaluation_coco.py
The result is as following, the mean average precision (AP) over 10 OKS threshold on the first 2644 images in the val set is 0.550, which is 0.577 in original implementation.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.550
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.800
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.610
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.541
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.576
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.591
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.812
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.644
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.549
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.651
Cite paper Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
@article{cao2016realtime,
title={Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields},
author={Zhe Cao and Tomas Simon and Shih-En Wei and Yaser Sheikh},
journal={arXiv preprint arXiv:1611.08050},
year={2016}
}
original caffe training https://github.com/CMU-Perceptual-Computing-Lab/caffe_rtpose
TODO:
- Test demo
- Train demo
- Add image augmentation: rotation, flip
- Add weight vector
- Train all images
- Train from vgg model
- Evaluation code
- Generate heat map and part affinity graph map in C++
- Enhancement: feature pyramid backend in training, symbol and iterator in featurePyramidCPM.py
Training with vgg warm up
python TrainWeightOnVgg.py
(1) Before We tested the code using two K80 GPUS on COCO dataset, with batch size set to 10 and learning rate set to 0.00004. and using vgg pretrained vgg model to initialize our parameters. After 20 epochs, we tested our model on COCO validation dataset(only 50 images) and we got only 0.048 as mAP, very low compared to original implementation. Please reach us if you have some ideas about this issue.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.048
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.183
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.019
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.078
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.035
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.066
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.224
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.022
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.075
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.054
(2) Fix the iterator bug, no data augmentation
We tested the code using one TITAN X (Pascal) on COCO dataset, with batch size set to 10 and learning rate set to 0.00004. and using pretrained vgg model to initialize our parameters. After 4 epochs, we tested our model on COCO validation dataset(only first 50 images) and we got only 0.115 as mAP, the original transfered model gots 0.530 .
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.115
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.350
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.030
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.168
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.091
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.141
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.373
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.067
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.164
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.117
After 18 epochs, we tested our model on COCO validation dataset(only first 50 images) and we got only 0.226 as mAP.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.226
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.434
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.201
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.254
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.226
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.250
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.440
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.239
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.252
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.261
After 23 epochs, we tested our model on COCO validation dataset(only first 50 images) and we got only 0.231 as mAP.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.231
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.466
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.230
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.245
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.249
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.251
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.470
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.261
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.243
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.278
After 36 epochs, we tested our model on COCO validation dataset(only first 50 images) and we got only 0.229 as mAP.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.229
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.442
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.218
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.233
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.260
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.257
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.455
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.269
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.232
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.302
(3) batch size set to 10 and learning rate set to 0.00004, GTX 1080
First level
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.190
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.403
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.146
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.218
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.185
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.216
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.418
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.187
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.216
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.224
Six level
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.258
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.478
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.251
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.280
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.284
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.493
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.291
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.280
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.307
The traning process is not so easy, I found this model even can't converge if all layers are initialized randomly, I guess one reason is that this model uses many convolution layers with a large kernel, whose big pad may introduce much noise, and another reason may be the fact that this model uses MSE as loss function, and maybe it's better to use sigmoid as the activation function of the last layer and use entropy loss function instead.