Awesome

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding (ICCV 2023)

<h3 align="center"> <a href="https://arxiv.org/abs/2303.11325">arXiv</a> </h3>

pipeline

Introduction

Welcome to the official repository of GeoMIM, a groundbreaking pretraining approach for multi-view camera-based 3D perception. This repository provides the pretraining and finetuning code and pretrained models to reproduce the exceptional results presented in our paper.

The implementation of pretraining is based on bevfusion. See the pretrain folder for further details.

After pretraining, we finetune the pretrained Swin Transformer for multi-view camera-based 3D perception. We use the BEVDet for finetuning. We provide models with different techniques used in BEVDet, including CBGS, 4D, Depth, and Stereo. We also provide models for occpancy prediction using the implementation in BEVDet repo. See the bevdet folder for further details.

Key Results

We provide the GeoMIM pretrained Swin-Base and Large checkpoints.

Model	Download
Swin-Base	Model
Swin-Large	Model

We have achieved strong performance on the nuScenes benchmark with GeoMIM. Here are some quantitative results on 3D detection:

Config	mAP	NDS	Download
bevdet-swinb-4d-256x704-cbgs	33.98	47.19	Model
bevdet-swinb-4d-256x704-cbgs-geomim	42.25	53.1	Model
bevdet-swinb-4d-stereo-256x704-cbgs-geomim	45.33	55.1	Model
bevdet-swinb-4d-stereo-512x1408-cbgs	47.2	57.6	Model (#)
bevdet-swinb-4d-stereo-512x1408-cbgs-geomim	52.04	60.92	Model

Here are some quantitative results on occpancy prediction:

Config	mIoU	Download
bevdet-occ-swinb-4d-stereo-2x (*)	42.0	Model (#)
bevdet-occ-swinb-4d-stereo-2x-geomim	45.0	Model
bevdet-occ-swinb-4d-stereo-2x-geomim (*)	45.73	Model
bevdet-occ-swinl-4d-stereo-2x-geomim	46.27	Model

(*) Load 3D detection checkpoint. (#) Original BEVDet checkpoint.

Get Start

Citation

If you find GeoMIM beneficial for your research, kindly consider citing our paper:

@inproceedings{liu2023geomim,
  title={GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding},
  author={Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2023}
}

Contact

For any questions or inquiries, please feel free to reach out to the authors: Jihao Liu (email) and Tai Wang (email)