Home

Awesome

<div align="center"> <h2>ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers</h2> </div>

<video src="https://github.com/user-attachments/assets/a5856329-0210-4e3a-bbfb-16580e47ba9e" controls="controls" width="500" height="300"></video>

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers, ECCV 2024

News

Abstract

3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods.

Methods

<div align="center"> <img src="figs/framework.png" width="800"/> </div><br/> <div align="center"> <img src="figs/task.png" width="800"/> </div><br/>

Getting Started

Please follow our documentations to get started.

  1. Environment Setup.
  2. Data Preparation.
  3. Training and Inference.

Results on Occ3D(based on nuScenes) Val Set.

MethodBackbonePretrainLr SchdmIoUConfigDownload
ViewFormerR50R50-depth90ep41.85configmodel
ViewFormerInternTCOCO24ep43.61configmodel

Note:

Results on FlowOcc3D Val Set.

MethodBackbonePretrainLr SchdmIoUmAVEConfigDownload
ViewFormerInternTCOCO24ep42.540.412configmodel

Note:

Acknowledgements

We are grateful for these great works as well as open source codebases.

Please also follow our visualization tool Oviz, if you are interested in the visualization in our paper.

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

    @article{li2024viewformer,
        title={ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers}, 
        author={Jinke Li and Xiao He and Chonghua Zhou and Xiaoqiang Cheng and Yang Wen and Dan Zhang},
        journal={arXiv preprint arXiv:2405.04299},
        year={2024},
    }