Home

Awesome

EPro-PnP

📢 NEWS: We have released EPro-PnP-v2. A new updated preprint can be found on arXiv.

EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation <br> In CVPR 2022 (Oral, Best Student Paper). [paper][video] <br> Hansheng Chen*<sup>1,2</sup>, Pichao Wang†<sup>2</sup>, Fan Wang<sup>2</sup>, Wei Tian†<sup>1</sup>, Lu Xiong<sup>1</sup>, Hao Li<sup>2</sup>

<sup>1</sup>Tongji University, <sup>2</sup>Alibaba Group <br> *Part of work done during an internship at Alibaba Group. <br> †Corresponding Authors: Pichao Wang, Wei Tian.

Introduction

EPro-PnP is a probabilistic Perspective-n-Points (PnP) layer for end-to-end 6DoF pose estimation networks. Broadly speaking, it is essentially a continuous counterpart of the widely used categorical Softmax layer, and is theoretically generalizable to other learning models with nested <!-- $\mathrm{arg\,min}$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?%5Cmathrm%7Barg%5C%2Cmin%7D"> optimization.

<img src="intro.png" width="500" alt=""/>

Given the layer input: an <!-- $N$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?N">-point correspondence set <!-- $X = \left\{x^\text{3D}_i,x^\text{2D}_i,w^\text{2D}_i\,\middle|\,i=1\cdots N\right\}$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?X%20%3D%20%5Cleft%5C%7Bx%5E%5Ctext%7B3D%7D_i%2Cx%5E%5Ctext%7B2D%7D_i%2Cw%5E%5Ctext%7B2D%7D_i%5C%2C%5Cmiddle%7C%5C%2Ci%3D1%5Ccdots%20N%5Cright%5C%7D"> consisting of 3D object coordinates <!-- $x^\text{3D}_i \in \mathbb{R}^3$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?x%5E%5Ctext%7B3D%7D_i%20%5Cin%20%5Cmathbb%7BR%7D%5E3">, 2D image coordinates <!-- $x^\text{2D}_i \in \mathbb{R}^2$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?x%5E%5Ctext%7B2D%7D_i%20%5Cin%20%5Cmathbb%7BR%7D%5E2">, and 2D weights <!-- $w^\text{2D}_i \in \mathbb{R}^2_+ $ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?w%5E%5Ctext%7B2D%7D_i%20%5Cin%20%5Cmathbb%7BR%7D%5E2_%2B">, a conventional PnP solver searches for an optimal pose <!-- $y^\ast$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?y%5E%5Cast"> (rigid transformation in SE(3)) that minimizes the weighted reprojection error. Previous work tries to backpropagate through the PnP operation, yet <!-- $y^\ast$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?y%5E%5Cast"> is inherently non-differentiable due to the inner <!-- $\mathrm{arg\,min}$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?%5Cmathrm%7Barg%5C%2Cmin%7D"> operation. This leads to convergence issue if all the components in <!-- $X$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?X"> must be learned by the network.

In contrast, our probabilistic PnP layer outputs a posterior distribution of pose, whose probability density <!-- $p(y|X)$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?p(y%7CX)"> can be derived for proper backpropagation. The distribution is approximated via Monte Carlo sampling. With EPro-PnP, the correspondences <!-- $X$ --> <img style="transform: translateY(0.1em); background: white;" src="https://latex.codecogs.com/svg.latex?X"> can be learned from scratch altogether by minimizing the KL divergence between the predicted and target pose distribution.

Models

V1 models in this repository

EPro-PnP-6DoF for 6DoF pose estimation<br>

<img src="EPro-PnP-6DoF/viz.gif" width="500" alt=""/>

EPro-PnP-Det for 3D object detection

<img src="EPro-PnP-Det/resources/viz.gif" width="500" alt=""/>

New V2 models

EPro-PnP-Det v2: state-of-the-art monocular 3D object detector

Main differences to v1b:

At the time of submission (Aug 30, 2022), EPro-PnP-Det v2 ranks 1st among all camera-based single-frame object detection models on the official nuScenes benchmark (test split, without extra data).

MethodTTABackboneNDSmAPmATEmASEmAOEmAVEmAAESchedule
EPro-PnP-Det v2 (ours)YR1010.4900.4230.5470.2360.3021.0710.12312 ep
PETRNSwin-B0.4830.4450.6270.2490.4490.9270.14124 ep
BEVDet-BaseYSwin-B0.4820.4220.5290.2360.3950.9790.15220 ep
EPro-PnP-Det v2 (ours)NR1010.4810.4090.5590.2390.3251.0900.11512 ep
PolarFormerNR1010.4700.4150.6570.2630.4050.9110.13924 ep
BEVFormer-SNR1010.4620.4090.6500.2610.4390.9250.14724 ep
PETRNR1010.4550.3910.6470.2510.4330.9330.14324 ep
EPro-PnP-Det v1YR1010.4530.3730.6050.2430.3591.0670.12412 ep
PGDYR1010.4480.3860.6260.2450.4511.5090.12724+24 ep
FCOS3DYR1010.4280.3580.6900.2490.4521.4340.124-

EPro-PnP-6DoF v2 for 6DoF pose estimation<br>

Main differences to v1b:

With these updates the v2 model can be trained without 3D models to achieve better performance (ADD 0.1d = 93.83) than GDRNet (ADD 0.1d = 93.6), unleashing the full potential of simple end-to-end training.

Use EPro-PnP in Your Own Model

We provide a demo on the usage of the EPro-PnP layer.

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{epropnp, 
  author = {Hansheng Chen and Pichao Wang and Fan Wang and Wei Tian and Lu Xiong and Hao Li, 
  title = {EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation}, 
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, 
  year = {2022}
}