Home

Awesome

πŸš€ Metric3D Project πŸš€

Official PyTorch implementation of Metric3Dv1 and Metric3Dv2:

[1] Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

[2] Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

<a href='https://jugghm.github.io/Metric3Dv2'><img src='https://img.shields.io/badge/project%20page-@Metric3D-yellow.svg'></a> <a href='https://arxiv.org/abs/2307.10984'><img src='https://img.shields.io/badge/arxiv-@Metric3Dv1-green'></a> <a href='https://arxiv.org/abs/2404.15506'><img src='https://img.shields.io/badge/arxiv-@Metric3Dv2-red'></a> <a href='https://huggingface.co/spaces/JUGGHM/Metric3D'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a>

PWC

PWC

PWC

PWC

PWC

πŸ† Champion in CVPR2023 Monocular Depth Estimation Challenge

News

🌼 Abstract

Metric3D is a strong and robust geometry foundation model for high-quality and zero-shot metric depth and surface normal estimation from a single image. It excels at solving in-the-wild scene reconstruction. It can directly help you measure the size of structures from a single image. Now it achieves SOTA performance on over 10 depth and normal benchmarks.

depth_normal

metrology

πŸ“ Benchmarks

Metric Depth

Our models rank 1st on the routing KITTI and NYU benchmarks.

BackboneKITTI Ξ΄1 ↑KITTI Ξ΄2 ↑KITTI AbsRel ↓KITTI RMSE ↓KITTI RMS_log ↓NYU Ξ΄1 ↑NYU Ξ΄2 ↑NYU AbsRel ↓NYU RMSE ↓NYU log10 ↓
ZoeDepthViT-Large0.9710.9950.0532.2810.0820.9530.9950.0770.2770.033
ZeroDepthResNet-180.9680.9960.0572.0870.0830.9540.9950.0740.2690.103
IEBinsSwinT-Large0.9780.9980.0502.0110.0750.9360.9920.0870.3140.031
DepthAnythingViT-Large0.9820.9980.0461.9850.0690.9840.9980.0560.2060.024
OursViT-Large0.9850.9980.0441.9850.0640.9890.9980.0470.1830.020
OursViT-giant20.9890.9980.0391.7660.0600.9870.9970.0450.1870.015

Affine-invariant Depth

Even compared to recent affine-invariant depth methods (Marigold and Depth Anything), our metric-depth (and normal) models still show superior performance.

#Data for Pretrain and TrainKITTI Absrel ↓KITTI Ξ΄1 ↑NYUv2 AbsRel ↓NYUv2 Ξ΄1 ↑DIODE-Full AbsRel ↓DIODE-Full Ξ΄1 ↑Eth3d AbsRel ↓Eth3d Ξ΄1 ↑
OmniData (v2, ViT-L)1.3M + 12.2M0.0690.9480.0740.9450.1490.8350.1660.778
MariGold (LDMv2)5B + 74K0.0990.9160.0550.9610.3080.7730.1270.960
DepthAnything (ViT-L)142M + 63M0.0760.9470.0430.9810.2770.7590.0650.882
Ours (ViT-L)142M + 16M0.0420.9790.0420.9800.1410.8820.0420.987
Ours (ViT-g)142M + 16M0.0430.9820.0430.9810.1360.8950.0420.983

Surface Normal

Our models also show powerful performance on normal benchmarks.

NYU 11.25Β° ↑NYU Mean ↓NYU RMS ↓ScanNet 11.25Β° ↑ScanNet Mean ↓ScanNet RMS ↓iBims 11.25Β° ↑iBims Mean ↓iBims RMS ↓
EESNU0.59716.024.70.71111.820.30.58520.0-
IronDepth------0.43125.337.4
PolyMax0.65613.120.4------
Ours (ViT-L)0.68812.019.20.7609.916.40.69419.434.9
Ours (ViT-g)0.66213.220.20.7789.215.30.69719.635.2

🌈 DEMOs

Zero-shot monocular metric depth & surface normal

<img src="media/gifs/demo_1.gif" width="600" height="337"> <img src="media/gifs/demo_12.gif" width="600" height="337">

Zero-shot metric 3D recovery

<img src="media/gifs/demo_2.gif" width="600" height="337">

Improving monocular SLAM

<img src="media/gifs/demo_22.gif" width="600" height="337">

πŸ”¨ Installation

One-line Installation

For the ViT models, use the following environment:

pip install -r requirements_v2.txt

For ConvNeXt-L, it is

pip install -r requirements_v1.txt

dataset annotation components

With off-the-shelf depth datasets, we need to generate json annotaions in compatible with this dataset, which is organized by:

dict(
	'files':list(
		dict(
			'rgb': 'data/kitti_demo/rgb/xxx.png',
			'depth': 'data/kitti_demo/depth/xxx.png',
			'depth_scale': 1000.0 # the depth scale of gt depth img.
			'cam_in': [fx, fy, cx, cy],
		),

		dict(
			...
		),

		...
	)
)

To generate such annotations, please refer to the "Inference" section.

configs

In mono/configs we provide different config setups.

Intrinsics of the canonical camera is set bellow:

    canonical_space = dict(
        img_size=(512, 960),
        focal_length=1000.0,
    ),

where cx and cy is set to be half of the image size.

Inference settings are defined as

    depth_range=(0, 1),
    depth_normalize=(0.3, 150),
    crop_size = (512, 1088),

where the images will be first resized as the crop_size and then fed into the model.

✈️ Training

Please refer to training/README.md. Now we provide complete json files for KITTI fine-tuning.

✈️ Inference

News: Improved ONNX support with dynamic shapes (Feature owned by @xenova. Appreciate for this outstanding contribution 🚩🚩🚩)

Now the onnx supports are availble for all three models with varying shapes. Refer to issue117 for more details.

Improved ONNX Checkpoints Available now

EncoderDecoderLink
v2-S-ONNXDINO2reg-ViT-SmallRAFT-4iterDownload πŸ€—
v2-L-ONNXDINO2reg-ViT-LargeRAFT-8iterDownload πŸ€—
v2-g-ONNXDINO2reg-ViT-giant2RAFT-8iterDownload πŸ€—

One additional reminder for using these onnx models is reported by @norbertlink.

News: Pytorch Hub is supported

Now you can use Metric3D via Pytorch Hub with just few lines of code:

import torch
model = torch.hub.load('yvanyin/metric3d', 'metric3d_vit_small', pretrain=True)
pred_depth, confidence, output_dict = model.inference({'input': rgb})
pred_normal = output_dict['prediction_normal'][:, :3, :, :] # only available for Metric3Dv2 i.e., ViT models
normal_confidence = output_dict['prediction_normal'][:, 3, :, :] # see https://arxiv.org/abs/2109.09881 for details

Supported models: metric3d_convnext_tiny, metric3d_convnext_large, metric3d_vit_small, metric3d_vit_large, metric3d_vit_giant2.

We also provided a minimal working example in hubconf.py, which hopefully makes everything clearer.

News: ONNX Exportation and Inference are supported

We also provided a flexible working example in metric3d_onnx_export.py to export the Pytorch Hub model to ONNX format. We could test with the following commands:

# Export the model to ONNX model
python3 onnx/metric_3d_onnx_export.py metric3d_vit_small # metric3d_vit_large/metric3d_convnext_large

# Test the inference of the ONNX model
python3 onnx/test_onnx.py metric3d_vit_small.onnx

ros2_vision_inference provides a Python example, showcasing a pipeline from image to point clouds and integrated into ROS2 systems.

Download Checkpoint

EncoderDecoderLink
v1-TConvNeXt-TinyHourglass-DecoderDownload πŸ€—
v1-LConvNeXt-LargeHourglass-DecoderDownload
v2-SDINO2reg-ViT-SmallRAFT-4iterDownload
v2-LDINO2reg-ViT-LargeRAFT-8iterDownload
v2-gDINO2reg-ViT-giant2RAFT-8iterDownload πŸ€—

Dataset Mode

  1. put the trained ckpt file model.pth in weight/.
  2. generate data annotation by following the code data/gene_annos_kitti_demo.py, which includes 'rgb', (optional) 'intrinsic', (optional) 'depth', (optional) 'depth_scale'.
  3. change the 'test_data_path' in test_*.sh to the *.json path.
  4. run source test_kitti.sh or source test_nyu.sh.

In-the-Wild Mode

  1. put the trained ckpt file model.pth in weight/.
  2. change the 'test_data_path' in test.sh to the image folder path.
  3. run source test_vit.sh for transformers and source test.sh for convnets. As no intrinsics are provided, we provided by default 9 settings of focal length.

Metric3D and Droid-Slam

If you are interested in combining metric3D and monocular visual slam system to achieve the metric slam, you can refer to this repo.

❓ Q & A

Q1: Why depth maps look good but pointclouds are distorted?

Because the focal length is not properly set! Please find a proper focal length by modifying codes here yourself.

Q2: Why the point clouds are too slow to be generated?

Because the images are too large! Use smaller ones instead.

Q3: Why predicted depth maps are not satisfactory?

First be sure all black padding regions at image boundaries are cropped out. Then please try again. Besides, metric 3D is not almighty. Some objects (chandeliers, drones...) / camera views (aerial view, bev...) do not occur frequently in the training datasets. We will going deeper into this and release more powerful solutions.

πŸ“§ Citation

If you use this toolbox in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

@misc{Metric3D,
  author =       {Yin, Wei and Hu, Mu},
  title =        {{Metric3D}: A Toolbox for Zero-shot Metric Depth Estimation},
  howpublished = {\url{https://github.com/YvanYin/Metric3D}},
  year =         {2024}
}
<!-- ``` @article{hu2024metric3dv2, title={Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation}, author={Hu, Mu and Yin, Wei and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao and Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie}, journal={arXiv preprint arXiv:2404.15506}, year={2024} } ``` -->

Also please cite our papers if this help your research.

@article{hu2024metric3d,
  title={Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation},
  author={Hu, Mu and Yin, Wei and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao and Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie},
  journal={arXiv preprint arXiv:2404.15506},
  year={2024}
}
@article{yin2023metric,
  title={Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image},
  author={Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen},
  booktitle={ICCV},
  year={2023}
}

License and Contact

The Metric 3D code is under a 2-clause BSD License. For further commercial inquiries, please contact Dr. Wei Yin [yvanwy@outlook.com] and Mr. Mu Hu [mhuam@connect.ust.hk].