Home

Awesome

SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention (ECCV 2022)

This is the official repository for SpatialDETR which will be published at ECCV 2022.

https://user-images.githubusercontent.com/46648831/190854092-9f9c7fc5-f890-4ec1-809a-9a3807b5993a.mp4

Authors: Simon Doll, Richard Schulz, Lukas Schneider, Viviane Benzin, Markus Enzweiler, Hendrik P.A. Lensch

Abstract

Based on the key idea of DETR this paper introduces an object-centric 3D object detection framework that operates on a limited number of 3D object queries instead of dense bounding box proposals followed by non-maximum suppression. After image feature extraction a decoder-only transformer architecture is trained on a set-based loss. SpatialDETR infers the classification and bounding box estimates based on attention both spatially within each image and across the different views. To fuse the multi-view information in the attention block we introduce a novel geometric positional encoding that incorporates the view ray geometry to explicitly consider the extrinsic and intrinsic camera setup. This way, the spatially-aware cross-view attention exploits arbitrary receptive fields to integrate cross-sensor data and therefore global context. Extensive experiments on the nuScenes benchmark demonstrate the potential of global attention and result in state-of-the-art performance.

If you find this repository useful, please cite

@inproceedings{Doll2022ECCV,
  author = {Doll, Simon and Schulz, Richard and Schneider, Lukas and Benzin, Viviane and Enzweiler Markus and Lensch, Hendrik P.A.},
  title = {SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention},
  booktitle = {European Conference on Computer Vision(ECCV)},
  year = {2022}
}

You can find the Paper here

Setup

To setup the repository and run trainings we refer to getting_started.md

Changelog

06/22

Experimental results

The baseline models have been trained on 4xV100 GPUs, the submission models on 8xA100 GPUs. For more details we refer to the corresponding configuration / log files. Keep in mind that the performance can vary between runs and that the current codebase uses mmdetection3d@rc1.0

ConfigLogfileSet#GPUsmmdet3dmAPATEASEAOEAVEAAENDS
query_proj_value_proj.py (baseline)log / modelval4rc1.00.3150.8430.2790.4970.7870.2080.396
query_proj_value_proj.pylogval40.170.3130.8500.2740.4940.8140.2130.392
query_center_proj_no_value_proj_shared.pylogval80.170.3510.7720.2740.3950.8470.2170.425
query_center_proj_no_value_proj_shared_cbgs_vovnet_trainval.pylogtest80.170.4250.6140.2530.4020.8570.1310.487

Qualitative results

License

See license_infos.md for details.

Acknowledgement

This repo contains the implementations of SpatialDETR. Our implementation is a plugin to MMDetection3D and also uses a fork of DETR3D. Full credits belong to the contributors of those frameworks and we truly thank them for enabling our research!