Home

Awesome

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding (ICCV 2023)

<h3 align="center"> <a href="https://arxiv.org/abs/2303.11325">arXiv</a> </h3>

pipeline

Introduction

Welcome to the official repository of GeoMIM, a groundbreaking pretraining approach for multi-view camera-based 3D perception. This repository provides the pretraining and finetuning code and pretrained models to reproduce the exceptional results presented in our paper.

The implementation of pretraining is based on bevfusion. See the pretrain folder for further details.

After pretraining, we finetune the pretrained Swin Transformer for multi-view camera-based 3D perception. We use the BEVDet for finetuning. We provide models with different techniques used in BEVDet, including CBGS, 4D, Depth, and Stereo. We also provide models for occpancy prediction using the implementation in BEVDet repo. See the bevdet folder for further details.

Key Results

We provide the GeoMIM pretrained Swin-Base and Large checkpoints.

ModelDownload
Swin-BaseModel
Swin-LargeModel

We have achieved strong performance on the nuScenes benchmark with GeoMIM. Here are some quantitative results on 3D detection:

ConfigmAPNDSDownload
bevdet-swinb-4d-256x704-cbgs33.9847.19Model
bevdet-swinb-4d-256x704-cbgs-geomim42.2553.1Model
bevdet-swinb-4d-stereo-256x704-cbgs-geomim45.3355.1Model
bevdet-swinb-4d-stereo-512x1408-cbgs47.257.6Model (#)
bevdet-swinb-4d-stereo-512x1408-cbgs-geomim52.0460.92Model

Here are some quantitative results on occpancy prediction:

ConfigmIoUDownload
bevdet-occ-swinb-4d-stereo-2x (*)42.0Model (#)
bevdet-occ-swinb-4d-stereo-2x-geomim45.0Model
bevdet-occ-swinb-4d-stereo-2x-geomim (*)45.73Model
bevdet-occ-swinl-4d-stereo-2x-geomim46.27Model

(*) Load 3D detection checkpoint. (#) Original BEVDet checkpoint.

Get Start

Citation

If you find GeoMIM beneficial for your research, kindly consider citing our paper:

@inproceedings{liu2023geomim,
  title={GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding},
  author={Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2023}
}

Contact

For any questions or inquiries, please feel free to reach out to the authors: Jihao Liu (email) and Tai Wang (email)