Home

Awesome

Introduction

TransPose is a human pose estimation model based on a CNN feature extractor, a Transformer Encoder, and a prediction head. Given an image, the attention layers built in Transformer can efficiently capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on.

Architecture

[arxiv 2012.14214] [paper] [demo-notebook]

TransPose: Keypoint Localization via Transformer, Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang, ICCV 2021

Model Zoo

We choose two types of CNNs as the backbone candidates: ResNet and HRNet. The derived convolutional blocks are ResNet-Small, HRNet-Small-W32, and HRNet-Small-W48.

ModelBackbone#Attention layersdh#Heads#ParamsAP (coco val gt bbox)Download
TransPose-R-A3ResNet-S3256102485.2Mb73.8model
TransPose-R-A4ResNet-S4256102486.0Mb75.1model
TransPose-H-SHRNet-S-W3246412818.0Mb76.1model
TransPose-H-A4HRNet-S-W48496192117.3Mb77.5model
TransPose-H-A6HRNet-S-W48696192117.5Mb78.1model

Quick use

Try out the Web Demo: Hugging Face Spaces

You can directly load TransPose-R-A4 or TransPose-H-A4 models with pretrained weights on COCO train2017 dataset from Torch Hub, simply by:

import torch

tpr = torch.hub.load('yangsenius/TransPose:main', 'tpr_a4_256x192', pretrained=True)
tph = torch.hub.load('yangsenius/TransPose:main', 'tph_a4_256x192', pretrained=True)

Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset

ModelInput sizeFPS*GFLOPsAPAp .5AP .75AP (M)AP (L)ARAR .5AR .75AR (M)AR (L)
TransPose-R-A3256x1921418.00.7170.8890.7880.6800.7860.7710.9300.8360.7270.835
TransPose-R-A4256x1921388.90.7260.8910.7990.6880.7980.7800.9310.8450.7350.844
TransPose-H-S256x1924510.20.7420.8960.8080.7060.8100.7950.9350.8550.7520.856
TransPose-H-A4256x1924117.50.7530.9000.8180.7170.8210.8030.9390.8610.7610.865
TransPose-H-A6256x1923821.80.7580.9010.8210.7190.8280.8080.9390.8640.7640.872

Note:

Results on COCO test-dev2017 with detector having human AP of 60.9 on COCO test-dev2017 dataset

ModelInput size#ParamsGFLOPsAPAp .5AP .75AP (M)AP (L)ARAR .5AR .75AR (M)AR (L)
TransPose-H-S256x1928.0M10.20.7340.9160.8110.7010.7930.7860.9500.8560.7450.843
TransPose-H-A4256x19217.3M17.50.7470.9190.8220.7140.8070.7990.9530.8660.7580.854
TransPose-H-A6256x19217.5M21.80.7500.9220.8230.7130.8110.8010.9540.8670.7590.859

Visualization

Jupyter Notebook Demo

Given an input image, a pretrained TransPose model, and the predicted locations, we can visualize the spatial dependencies of the predicted locations with threshold for the attention scores.

TransPose-R-A4 with threshold=0.00 example

TransPose-R-A4 with threshold=0.01

TransPose-H-A4 with threshold=0.00 example

TransPose-H-A4 with threshold=0.00075 example

Getting started

Installation

  1. Clone this repository, and we'll call the directory that you cloned as ${POSE_ROOT}

    git clone https://github.com/yangsenius/TransPose.git
    
  2. Install PyTorch>=1.6 and torchvision>=0.7 from the PyTorch official website

  3. Install package dependencies. Make sure the python environment >=3.7

    pip install -r requirements.txt
    
  4. Make output (training models and files) and log (tensorboard log) directories under ${POSE_ROOT} & Make libs

    mkdir output log
    cd ${POSE_ROOT}/lib
    make
    
  5. Download pretrained models from the releases of this repo to the specified directory

    ${POSE_ROOT}
     `-- models
         `-- pytorch
             |-- imagenet
             |   |-- hrnet_w32-36af842e.pth
             |   |-- hrnet_w48-8ef0771d.pth
             |   |-- resnet50-19c8e357.pth
             |-- transpose_coco
             |   |-- tp_r_256x192_enc3_d256_h1024_mh8.pth
             |   |-- tp_r_256x192_enc4_d256_h1024_mh8.pth
             |   |-- tp_h_32_256x192_enc4_d64_h128_mh1.pth
             |   |-- tp_h_48_256x192_enc4_d96_h192_mh1.pth
             |   |-- tp_h_48_256x192_enc6_d96_h192_mh1.pth    
    

Data Preparation

We follow the steps of HRNet to prepare the COCO train/val/test dataset and the annotations. The detected person results are downloaded from OneDrive or GoogleDrive. Please download or link them to ${POSE_ROOT}/data/coco/, and make them look like this:

${POSE_ROOT}/data/coco/
|-- annotations
|   |-- person_keypoints_train2017.json
|   `-- person_keypoints_val2017.json
|-- person_detection_results
|   |-- COCO_val2017_detections_AP_H_56_person.json
|   `-- COCO_test-dev2017_detections_AP_H_609_person.json
`-- images
	|-- train2017
	|   |-- 000000000009.jpg
	|   |-- ... 
	`-- val2017
		|-- 000000000139.jpg
		|-- ... 

Traing & Testing

Testing on COCO val2017 dataset

python tools/test.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml TEST.USE_GT_BBOX True

Training on COCO train2017 dataset

python tools/train.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml

Acknowledgements

Great thanks for these papers and their open-source codes:HRNet, DETR, DarkPose

License

This repository is released under the MIT LICENSE.

Citation

If you find this repository useful please give it a star 🌟 or consider citing our work:

@inproceedings{yang2021transpose,
  title={TransPose: Keypoint Localization via Transformer},
  author={Yang, Sen and Quan, Zhibin and Nie, Mu and Yang, Wankou},
  booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}