Home

Awesome

Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

Repository for our TIV 2023 paper "Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders".

Introduction

Mask-based pre-training has achieved great success for self-supervised learning in images and languages without manually annotated supervision. However, it has not yet been studied for large-scale point clouds with redundant spatial information. In this research, we propose a mask voxel autoencoder network for pre-training large-scale point clouds, dubbed Voxel-MAE. Our key idea is to transform the point clouds into voxel representations and classify whether the voxel contains point clouds. This simple but effective strategy makes the network voxel-aware of the object shape, thus improving the performance of downstream tasks, such as 3D object detection. Our Voxel-MAE, with even a 90% masking ratio, can still learn representative features for the high spatial redundancy of large-scale point clouds. We also validate the effectiveness of Voxel-MAE on unsupervised domain adaptative tasks, which proves the generalization ability of Voxel-MAE. Our Voxel-MAE proves that it is feasible to pre-train large-scale point clouds without data annotations to enhance the perception ability of the autonomous vehicle. Extensive experiments show great effectiveness of our pre-training method with 3D object detectors (SECOND, CenterPoint, and PV-RCNN) on three popular datasets (KITTI, Waymo, and nuScenes).

<p align="center"> <img src="docs/Voxel-MAE.png" width="100%"/>Flowchart of Voxel-MAE </p>

Installation

Please refer to INSTALL.md for the installation of OpenPCDet(v0.5).

Getting Started

Please refer to GETTING_STARTED.md .

Usage

First Pre-training Voxel-MAE

KITTI:

Train with multiple GPUs:
bash ./scripts/dist_train_voxel_mae.sh ${NUM_GPUS}  --cfg_file cfgs/kitti_models/voxel_mae_kitti.yaml --batch_size ${BATCH_SIZE}

Train with a single GPU:
python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_kitti.yaml --batch_size ${BATCH_SIZE}

Waymo:

python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_waymo.yaml --batch_size ${BATCH_SIZE}

nuScenes:

python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_nuscenes.yaml --batch_size ${BATCH_SIZE}

Then traing OpenPCDet

Same as OpenPCDet with pre-trained model from our Voxel-MAE.

bash ./scripts/dist_train.sh ${NUM_GPUS}  --cfg_file cfgs/kitti_models/second.yaml --batch_size ${BATCH_SIZE} --pretrained_model ../output/kitti/voxel_mae/ckpt/check_point_10.pth

Performance

KITTI Dataset

The results are the 3D detection performance of moderate difficulty on the val set of KITTI dataset. Results of OpenPCDet are from here .

Car@R11Pedestrian@R11Cyclist@R11
SECOND78.6252.9867.15
Voxel-MAE+SECOND78.9053.1468.08
SECOND-IoU79.0955.7471.31
Voxel-MAE+SECOND-IoU79.2255.7972.22
PV-RCNN83.6157.9070.47
Voxel-MAE+PV-RCNN83.8259.3771.99

Waymo Open Dataset

Similar to OpenPCDet , all models are trained with a single frame of 20% data (~32k frames) of all the training samples , and the results of each cell here are mAP/mAPH calculated by the official Waymo evaluation metrics on the whole validation set (version 1.2).

Performance@(train with 20% Data)Vec_L1Vec_L2Ped_L1Ped_L2Cyc_L1Cyc_L2Voxel-MAE3D Detection
SECOND70.96/70.3462.58/62.0265.23/54.2457.22/47.4957.13/55.6254.97/53.53
Voxel-MAE+SECOND71.12/70.5862.67/62.3467.21/55.6859.03/48.7957.73/56.1855.62/54.17
CenterPoint71.33/70.7663.16/62.6572.09/65.4964.27/58.2368.68/67.3966.11/64.87
Voxel-MAE+CenterPoint71.89/71.3364.05/63.5373.85/67.1265.78/59.6270.29/69.0367.76/66.53
PV-RCNN (AnchorHead)75.41/74.7467.44/66.8071.98/61.2463.70/53.9565.88/64.2563.39/61.82
Voxel-MAE+PV-RCNN (AnchorHead75.94/75.2867.94/67.3474.02/63.4864.91/55.5767.21/65.4964.62/63.02
PV-RCNN (CenterHead)75.95/75.4368.02/67.5475.94/69.4067.66/61.6270.18/68.9867.73/66.57
Voxel-MAE+PV-RCNN (CenterHead)77.29/76.8168.71/68.2177.70/71.1369.53/63.4670.55/69.3968.11/66.95
PV-RCNN++77.82/77.3269.07/68.6277.99/71.3669.92/63.7471.80/70.7169.31/68.26
Voxel-MAE+PV-RCNN++78.23/77.7269.54/69.1279.85/73.2371.07/64.9671.80/70.6469.31/68.26

nuScenes Dataset

mAPNDSmATEmASEmAOEmAVEmAAE
SECOND-MultiHead (CBGS)50.5962.2931.1525.5126.6426.2620.46
Voxel-MAE+SECOND-MultiHead50.8262.4531.0225.2326.1226.1120.04
CenterPoint (voxel_size=0.1)56.0364.5430.1125.5538.2821.9418.87
Voxel-MAE+CenterPoint56.4565.0229.7325.1738.3821.4718.65

License

Our codes are released under the Apache 2.0 license.

Acknowledgement

This repository is based on OpenPCDet.

Citation

If you find this project useful in your research, please consider cite:

@article{min2023occupancy,
  title={Occupancy-MAE: Self-Supervised Pre-Training Large-Scale LiDAR Point Clouds With Masked Occupancy Autoencoders},
  author={Min, Chen and Xiao, Liang and Zhao, Dawei and Nie, Yiming and Dai, Bin},
  journal={IEEE Transactions on Intelligent Vehicles},
  year={2023},
  publisher={IEEE}
}