Home

Awesome

<h1 align="left">ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation<a href="https://arxiv.org/abs/2204.12484"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg" ></a> </h1>

PWC PWC PWC PWC

<p align="center"> <a href="#Results">Results</a> | <a href="#Updates">Updates</a> | <a href="#Usage">Usage</a> | <a href='#Todo'>Todo</a> | <a href="#Acknowledge">Acknowledge</a> </p> <p align="center"> <a href="https://giphy.com/gifs/UfPQB1qKir7Vqem6sL/fullscreen"><img src="https://media.giphy.com/media/ZewXwZuixYKS2lZmNL/giphy.gif"></a> <a href="https://giphy.com/gifs/DCvf1DrWZgbwPa8bWZ/fullscreen"><img src="https://media.giphy.com/media/2AEeuicbIjwqp2mbug/giphy.gif"></a> </p> <p align="center"> <a href="https://giphy.com/gifs/r3GaZz7H1H6zpuIvPI/fullscreen"><img src="https://media.giphy.com/media/13oe6zo6b2B7CdsOac/giphy.gif"></a> <a href="https://giphy.com/gifs/FjzrGJxsOzZAXaW7Vi/fullscreen"><img src="https://media.giphy.com/media/4JLERHxOEgH0tt5DZO/giphy.gif"></a> </p>

This branch contains the pytorch implementation of <a href="https://arxiv.org/abs/2204.12484">ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation</a> and <a href="https://arxiv.org/abs/2212.04246">ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation</a>. It obtains 81.1 AP on MS COCO Keypoint test-dev set.

<img src="figures/Throughput.png" class="left" width='80%'>

Web Demo

MAE Pre-trained model

Results from this repo on MS COCO val set (single-task training)

Using detection results from a detector that obtains 56 mAP on person. The configs here are for both training and test.

With classic decoder

ModelPretrainResolutionAPARconfiglogweight
ViTPose-SMAE256x19273.879.2configlogOnedrive
ViTPose-BMAE256x19275.881.1configlogOnedrive
ViTPose-LMAE256x19278.383.5configlogOnedrive
ViTPose-HMAE256x19279.184.1configlogOnedrive

With simple decoder

ModelPretrainResolutionAPARconfiglogweight
ViTPose-SMAE256x19273.578.9configlogOnedrive
ViTPose-BMAE256x19275.580.9configlogOnedrive
ViTPose-LMAE256x19278.283.4configlogOnedrive
ViTPose-HMAE256x19278.984.0configlogOnedrive

Results with multi-task training

Note * There may exist duplicate images in the crowdpose training set and the validation images in other datasets, as discussed in issue #24. Please be careful when using these models for evaluation. We provide the results without the crowpose dataset for reference.

Human datasets (MS COCO, AIC, MPII, CrowdPose)

Results on MS COCO val set

Using detection results from a detector that obtains 56 mAP on person. Note the configs here are only for evaluation.

ModelDatasetResolutionAPARconfigweight
ViTPose-BCOCO+AIC+MPII256x19277.182.2configOnedrive
ViTPose-LCOCO+AIC+MPII256x19278.783.8configOnedrive
ViTPose-HCOCO+AIC+MPII256x19279.584.5configOnedrive
ViTPose-GCOCO+AIC+MPII576x43281.085.6
ViTPose-B*COCO+AIC+MPII+CrowdPose256x19277.582.6configOnedrive
ViTPose-L*COCO+AIC+MPII+CrowdPose256x19279.184.1configOnedrive
ViTPose-H*COCO+AIC+MPII+CrowdPose256x19279.884.8configOnedrive
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19275.882.6configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19277.082.6configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19278.684.1configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19279.484.8configlog | Onedrive

Results on OCHuman test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

ModelDatasetResolutionAPARconfigweight
ViTPose-BCOCO+AIC+MPII256x19288.089.6configOnedrive
ViTPose-LCOCO+AIC+MPII256x19290.992.2configOnedrive
ViTPose-HCOCO+AIC+MPII256x19290.992.3configOnedrive
ViTPose-GCOCO+AIC+MPII576x43293.394.3
ViTPose-B*COCO+AIC+MPII+CrowdPose256x19288.290.0configOnedrive
ViTPose-L*COCO+AIC+MPII+CrowdPose256x19291.592.8configOnedrive
ViTPose-H*COCO+AIC+MPII+CrowdPose256x19291.692.8configOnedrive
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19278.480.6configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19282.684.8configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19285.787.5configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19285.787.4configlog | Onedrive

Results on MPII val set

Using groundtruth bounding boxes. Note the configs here are only for evaluation. The metric is PCKh.

ModelDatasetResolutionMeanconfigweight
ViTPose-BCOCO+AIC+MPII256x19293.3configOnedrive
ViTPose-LCOCO+AIC+MPII256x19294.0configOnedrive
ViTPose-HCOCO+AIC+MPII256x19294.1configOnedrive
ViTPose-GCOCO+AIC+MPII576x43294.3
ViTPose-B*COCO+AIC+MPII+CrowdPose256x19293.4configOnedrive
ViTPose-L*COCO+AIC+MPII+CrowdPose256x19293.9configOnedrive
ViTPose-H*COCO+AIC+MPII+CrowdPose256x19294.1configOnedrive
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19292.7configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19292.8configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19294.0configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19294.2configlog | Onedrive

Results on AI Challenger test set

Using groundtruth bounding boxes. Note the configs here are only for evaluation.

ModelDatasetResolutionAPARconfigweight
ViTPose-BCOCO+AIC+MPII256x19232.036.3configOnedrive
ViTPose-LCOCO+AIC+MPII256x19234.539.0configOnedrive
ViTPose-HCOCO+AIC+MPII256x19235.439.9configOnedrive
ViTPose-GCOCO+AIC+MPII576x43243.247.1
ViTPose-B*COCO+AIC+MPII+CrowdPose256x19231.936.3configOnedrive
ViTPose-L*COCO+AIC+MPII+CrowdPose256x19234.639.0configOnedrive
ViTPose-H*COCO+AIC+MPII+CrowdPose256x19235.339.8configOnedrive
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19229.734.3configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19231.836.3configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19234.338.9configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19234.839.1configlog | Onedrive

Results on CrowdPose test set

Using YOLOv3 human detector. Note the configs here are only for evaluation.

ModelDatasetResolutionAPAP(H)configweight
ViTPose-B*COCO+AIC+MPII+CrowdPose256x19274.763.3configOnedrive
ViTPose-L*COCO+AIC+MPII+CrowdPose256x19276.665.9configOnedrive
ViTPose-H*COCO+AIC+MPII+CrowdPose256x19276.365.6configOnedrive

Animal datasets (AP10K, APT36K)

Results on AP-10K test set

ModelDatasetResolutionAPconfigweight
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19271.4configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19274.5configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19280.4configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19282.4configlog | Onedrive

Results on APT-36K val set

ModelDatasetResolutionAPconfigweight
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19274.2configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19275.9configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19280.8configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19282.3configlog | Onedrive

WholeBody dataset

ModelDatasetResolutionAPconfigweight
ViTPose+-SCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19254.4configlog | Onedrive
ViTPose+-BCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19257.4configlog | Onedrive
ViTPose+-LCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19260.6configlog | Onedrive
ViTPose+-HCOCO+AIC+MPII+AP10K+APT36K+WholeBody256x19261.2configlog | Onedrive

Transfer results on the hand dataset (InterHand2.6M)

ModelDatasetResolutionAUCconfigweight
ViTPose+-SCOCO+AIC+MPII+WholeBody256x19286.5configComing Soon
ViTPose+-BCOCO+AIC+MPII+WholeBody256x19287.0configComing Soon
ViTPose+-LCOCO+AIC+MPII+WholeBody256x19287.5configComing Soon
ViTPose+-HCOCO+AIC+MPII+WholeBody256x19287.6configComing Soon

Updates

[2023-01-10] Update ViTPose+! It uses MoE strategies to jointly deal with human, animal, and wholebody pose estimation tasks.

[2022-05-24] Upload the single-task training code, single-task pre-trained models, and multi-task pretrained models.

[2022-05-06] Upload the logs for the base, large, and huge models!

[2022-04-27] Our ViTPose with ViTAE-G obtains 81.1 AP on COCO test-dev set!

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting | VSA | ViTDet

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTPose.git
cd ViTPose
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

After downloading the pretrained models, please conduct the experiments by running

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH> --seed 0

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch --seed 0

To test the pretrained models performance, please run

bash tools/dist_test.sh <Config PATH> <Checkpoint PATH> <NUM GPUs>

For ViTPose+ pre-trained models, please first re-organize the pre-trained weights using

python tools/model_split.py --source <Pretrained PATH>

Todo

This repo current contains modifications including:

Acknowledge

We acknowledge the excellent implementation from mmpose and MAE.

Citing ViTPose

For ViTPose

@inproceedings{
  xu2022vitpose,
  title={Vi{TP}ose: Simple Vision Transformer Baselines for Human Pose Estimation},
  author={Yufei Xu and Jing Zhang and Qiming Zhang and Dacheng Tao},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022},
}

For ViTPose+

@article{xu2022vitpose+,
  title={ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation},
  author={Xu, Yufei and Zhang, Jing and Zhang, Qiming and Tao, Dacheng},
  journal={arXiv preprint arXiv:2212.04246},
  year={2022}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}