Awesome
[ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
<!-- ## Introduction --> <div align="center"> <img src="figs/cmt_eva.png" width="900" /><em> Performance comparison and Robustness under sensor failure. All statistics are measured on a single Tesla A100 GPU using the best model of official repositories. All models use spconv Voxelization module. </em>
</div><br/>CMT is a robust 3D detector for end-to-end 3D multi-modal detection. A DETR-like framework is designed for multi-modal detection(CMT) and lidar-only detection(CMT-L), which obtains 74.1%(SoTA without TTA/model ensemble) and 70.1% NDS separately on nuScenes benchmark. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. CMT can be a strong baseline for further research.
Preparation
-
Environments
Python == 3.8
CUDA == 11.1
pytorch == 1.9.0
mmcv-full == 1.6.0
mmdet == 2.24.0
mmsegmentation == 0.29.1
mmdet3d == 1.0.0rc5
spconv-cu111 == 2.1.21
flash-attn == 0.2.2 -
Data
Follow the mmdet3d to process the nuScenes dataset.
PKLs and image pretrain weights are available at Google Drive.
Train & inference
# train
bash tools/dist_train.sh /path_to_your_config 8
# inference
bash tools/dist_test.sh /path_to_your_config /path_to_your_pth 8 --eval bbox
Main Results
Results on nuScenes val set. The default batch size is 2 on each GPU. The FPS are all evaluated with a single Tesla A100 GPU. (15e + 5e means the last 5 epochs should be trained without GTsample)
Config | Modality | mAP | NDS | Schedule | Inference FPS |
---|---|---|---|---|---|
vov_1600x640 | C | 40.6% | 46.0% | 20e | 8.4 |
voxel0075 | L | 62.14% | 68.6% | 15e+5e | 18.1 |
voxel0100_r50_800x320 | C+L | 67.9% | 70.8% | 15e+5e | 14.2 |
voxel0075_vov_1600x640 | C+L | 70.3% | 72.9% | 15e+5e | 6.4 |
Results on nuScenes test set. To reproduce our result, replace ann_file=data_root + '/nuscenes_infos_train.pkl'
in training config with ann_file=[data_root + '/nuscenes_infos_train.pkl', data_root + '/nuscenes_infos_val.pkl']
:
Config | Modality | mAP | NDS | Schedule | Inference FPS |
---|---|---|---|---|---|
vov_1600x640 | C | 42.9% | 48.1% | 20e | 8.4 |
voxel0075 | L | 65.3% | 70.1% | 15e+5e | 18.1 |
voxel0075_vov_1600x640 | C+L | 72.0% | 74.1% | 15e+5e | 6.4 |
Citation
If you find CMT helpful in your research, please consider citing:
@article{yan2023cross,
title={Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection},
author={Yan, Junjie and Liu, Yingfei and Sun, Jianjian and Jia, Fan and Li, Shuailin and Wang, Tiancai and Zhang, Xiangyu},
journal={arXiv preprint arXiv:2301.01283},
year={2023}
}
Contact
If you have any questions, feel free to open an issue or contact us at yanjunjie@megvii.com, liuyingfei@megvii.com, sunjianjian@megvii.com or wangtiancai@megvii.com.