Awesome
Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey
Authors: Yang Liu, Changzhen Qiu, Zhiyong Zhang*
School of Electronics and Communication Engineering, Sun Yat-sen University, Shenzhen, Guangdong, China
Overview
This is the regularly updated project page of Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey, a review that primarily concentrates on deep learning approaches to 3D human pose estimation and human mesh recovery. This survey comprehensively includes the most recent state-of-the-art publications (2019-now) from mainstream computer vision conferences and journals.
Please create issues if you have any suggestions!
Citation
Please kindly cite the papers if our work is useful and helpful for your research.
@article{liu2024deep,
title={Deep learning for 3D human pose estimation and mesh recovery: A survey},
author={Liu, Yang and Qiu, Changzhen and Zhang, Zhiyong},
journal={Neurocomputing},
pages={128049},
year={2024},
issn={0925-2312},
doi={https://doi.org/10.1016/j.neucom.2024.128049},
publisher={Elsevier}
}
3D human pose estimation
- Single Person
- In Images
- Solving Depth Ambiguity
- Solving Body Structure Understanding
- Solving Occlusion Problems
- Solving Data Lacking
- In Videos
- Solving Single-frame Limitation
- Solving Real-time Problems
- Solving Body Structure Understanding
- Solving Occlusion Problems
- Solving Data Lacking
- In Images
- Multi-person
- Top-down
- Bottom-up -Solving Real-time Problems - Fabbri et al. [paper]
- Others
Human Mesh Recovery
- Template-based
- Naked
- Multimodal Methods
- Utilizing Attention Mechanism
- Exploiting Temporal Information
- Multi-view Methods
- Boosting Efficiency
- Developing Various Representations
- Utilizing Structural Information
- Choosing Appropriate Learning Strategies
- Self-improving: SPIN [paper], ReFit [paper], You et al. [paper]
- Novel losses: Zanfir et al. [paper], Jiang et al. [paper]
- Unsupervised learning: Madadi et al. [paper], Yu et al. [paper]
- Bilevel online adaptation: Guan et al. [paper]
- Single-shot: Pose2UV [paper]
- Contrastive learning: JOTR [paper]
- Domain adaptation: Nam et al. [paper]
- Detailed
- With Clothes
- With Hands
- Whole Body
- Naked
- Template-free
The overview of the mainstream datasets.
Dataset | Type | Data | Total frames | Feature | Download link |
---|---|---|---|---|---|
Human3.6M | 3D/Mesh | Video | 3.6M | multi-view | Website |
3DPW | 3D/Mesh | Video | 51K | multi-person | Website |
MPI-INF-3DHP | 2D/3D | Video | 2K | in-wild | Website |
HumanEva | 3D | Video | 40K | multi-view | Website |
CMU-Panoptic | 3D | Video | 1.5M | multi-view/multi-person | Website |
MuCo-3DHP | 3D | Image | 8K | multi-person/occluded scene | Website |
SURREAL | 2D/3D/Mesh | Video | 6.0M | synthetic model | Website |
3DOH50K | 2D/3D/Mesh | Image | 51K | object-occluded | Website |
3DCP | Mesh | Mesh | 190 | contact | Website |
AMASS | Mesh | Motion | 11K | soft-tissue dynamics | Website |
DensePose | Mesh | Image | 50K | multi-person | Website |
UP-3D | 3D/Mesh | Image | 8K | sport scene | Website |
THuman2.0 | Mesh | Image | 7K | textured surface | Website |
Comparisons of 3D pose estimation methods on Human3.6M.
Method | Year | Publication | Highlight | MPJPE↓ | PMPJPE↓ | Code |
---|---|---|---|---|---|---|
Graformer | 2022 | CVPR'22 | graph-based transformer | 35.2 | - | Code |
GLA-GCN | 2023 | ICCV'23 | adaptive GCN | 34.4 | 37.8 | Code |
PoseDA | 2023 | arXiv'23 | domain adaptation | 49.4 | 34.2 | Code |
GFPose | 2023 | CVPR'23 | gradient fields | 35.6 | 30.5 | Code |
TP-LSTMs | 2022 | TPAMI'22 | pose similarity metric | 40.5 | 31.8 | - |
FTCM | 2023 | TCSVT'23 | frequency-temporal collaborative | 28.1 | - | Code |
VideoPose3D | 2019 | CVPR'19 | semi-supervised | 46.8 | 36.5 | Code |
PoseFormer | 2021 | ICCV'21 | spatio-temporal transformer | 44.3 | 34.6 | Code |
STCFormer | 2023 | CVPR'23 | spatio-temporal transformer | 40.5 | 31.8 | Code |
3Dpose_ssl | 2020 | TPAMI'20 | self-supervised | 63.6 | 63.7 | Code |
MTF-Transformer | 2022 | TPAMI'22 | multi-view temporal fusion | 26.2 | - | Code |
AdaptPose | 2022 | CVPR'22 | cross datasets | 42.5 | 34.0 | Code |
3D-HPE-PAA | 2022 | TIP'22 | part aware attention | 43.1 | 33.7 | Code |
DeciWatch | 2022 | ECCV'22 | efficient framework | 52.8 | - | Code |
Diffpose | 2023 | CVPR'23 | pose refine | 36.9 | 28.7 | Code |
Elepose | 2022 | CVPR'22 | unsupervised | - | 36.7 | Code |
Uplift and Upsample | 2023 | CVPR'23 | efficient transformers | 48.1 | 37.6 | Code |
RS-Net | 2023 | TIP'23 | regular splitting graph network | 48.6 | 38.9 | Code |
HSTFormer | 2023 | arXiv'23 | spatial-temporal transformers | 42.7 | 33.7 | Code |
PoseFormerV2 | 2023 | CVPR'23 | frequency domain | 45.2 | 35.6 | Code |
DiffPose | 2023 | ICCV'23 | diffusion models | 42.9 | 30.8 | Code |
Comparisons of 3D pose estimation methods on MPI-INF-3DHP.
Method | Year | Publication | Highlight | MPJPE↓ | PCK↑ | AUC↑ | Code |
---|---|---|---|---|---|---|---|
HSTFormer | 2023 | arXiv'23 | spatial-temporal transformers | 28.3 | 98.0 | 78.6 | Code |
PoseFormerV2 | 2023 | CVPR'23 | frequency domain | 27.8 | 97.9 | 78.8 | Code |
Uplift and Upsample | 2023 | CVPR'23 | efficient transformers | 46.9 | 95.4 | 67.6 | Code |
RS-Net | 2023 | TIP'23 | regular splitting graph network | - | 85.6 | 53.2 | Code |
Diffpose | 2023 | CVPR'23 | pose refine | 29.1 | 98.0 | 75.9 | Code |
FTCM | 2023 | TCSVT'23 | frequency-temporal collaborative | 31.2 | 97.9 | 79.8 | Code |
STCFormer | 2023 | CVPR'23 | spatio-temporal transformer | 23.1 | 98.7 | 83.9 | Code |
PoseDA | 2023 | arXiv'23 | domain adaptation | 61.3 | 92.0 | 62.5 | Code |
TP-LSTMs | 2022 | TPAMI'22 | pose similarity metric | 48.8 | 82.6 | 81.3 | - |
AdaptPose | 2022 | CVPR'22 | cross datasets | 77.2 | 88.4 | 54.2 | Code |
3D-HPE-PAA | 2022 | TIP'22 | part aware attention | 69.4 | 90.3 | 57.8 | Code |
Elepose | 2022 | CVPR'22 | unsupervised | 54.0 | 86.0 | 50.1 | Code |
Comparisons of human mesh recovery methods on Human3.6M and 3DPW.
Method | Publication | Highlight | Human3.6M MPJPE↓ | Human3.6M PA-MPJPE↓ | 3DPW MPJPE↓ | 3DPW PA-MPJPE↓ | 3DPW PVE↓ | Code |
---|---|---|---|---|---|---|---|---|
VirtualMarker | CVPR'23 | novel intermediate representation | 47.3 | 32.0 | 67.5 | 41.3 | 77.9 | Code |
NIKI | CVPR'23 | inverse kinematics | - | - | 71.3 | 40.6 | 86.6 | Code |
TORE | ICCV'23 | efficient transformer | 59.6 | 36.4 | 72.3 | 44.4 | 88.2 | Code |
JOTR | ICCV'23 | contrastive learning | - | - | 76.4 | 48.7 | 92.6 | Code |
HMDiff | ICCV'23 | reverse diffusion processing | 49.3 | 32.4 | 72.7 | 44.5 | 82.4 | Code |
ReFit | ICCV'23 | recurrent fitting network | 48.4 | 32.2 | 65.8 | 41.0 | - | Code |
PyMAF-X | TPAMI'23 | regression-based one-stage whole body | - | - | 74.2 | 45.3 | 87.0 | Code |
PointHMR | CVPR'23 | vertex-relevant feature extraction | 48.3 | 32.9 | 73.9 | 44.9 | 85.5 | - |
PLIKS | CVPR'23 | inverse kinematics | 47.0 | 34.5 | 60.5 | 38.5 | 73.3 | Code |
ProPose | CVPR'23 | learning analytical posterior probability | 45.7 | 29.1 | 68.3 | 40.6 | 79.4 | Code |
POTTER | CVPR'23 | pooling attention transformer | 56.5 | 35.1 | 75.0 | 44.8 | 87.4 | Code |
PoseExaminer | ICCV'23 | automated testing of out-of-distribution | - | - | 74.5 | 46.5 | 88.6 | Code |
MotionBERT | ICCV'23 | pretrained human representations | 43.1 | 27.8 | 68.8 | 40.6 | 79.4 | Code |
3DNBF | ICCV'23 | analysis-by-synthesis approach | - | - | 88.8 | 53.3 | - | Code |
FastMETRO | ECCV'22 | efficient architecture | 52.2 | 33.7 | 73.5 | 44.6 | 84.1 | Code |
CLIFF | ECCV'22 | multi-modality inputs | 47.1 | 32.7 | 69.0 | 43.0 | 81.2 | Code |
PARE | ICCV'21 | part-driven attention | - | - | 74.5 | 46.5 | 88.6 | Code |
Graphormer | ICCV'21 | GCNN-reinforced transformer | 51.2 | 34.5 | 74.7 | 45.6 | 87.7 | Code |
PSVT | CVPR'23 | spatio-temporal encoder | - | - | 73.1 | 43.5 | 84.0 | - |
GLoT | CVPR'23 | short-term and long-term temporal correlations | 67.0 | 46.3 | 80.7 | 50.6 | 96.3 | Code |
MPS-Net | CVPR'23 | temporally adjacent representations | 69.4 | 47.4 | 91.6 | 54.0 | 109.6 | Code |
MAED | ICCV'21 | multi-level attention | 56.4 | 38.7 | 79.1 | 45.7 | 92.6 | Code |
Lee et al. | ICCV'21 | uncertainty-aware | 58.4 | 38.4 | 92.8 | 52.2 | 106.1 | - |
TCMR | CVPR'21 | temporal consistency | 62.3 | 41.1 | 95.0 | 55.8 | 111.3 | - |
VIBE | CVPR'20 | self-attention temporal network | 65.6 | 41.4 | 82.9 | 51.9 | 99.1 | Code |
ImpHMR | CVPR'23 | implicitly imagine person in 3D space | - | - | 74.3 | 45.4 | 87.1 | - |
SGRE | ICCV'23 | sequentially global rotation estimation | - | - | 78.4 | 49.6 | 93.3 | Code |
PMCE | ICCV'23 | pose and mesh co-evolution network | 53.5 | 37.7 | 69.5 | 46.7 | 84.8 | Code |