Home

Awesome

UniTR: The First Unified Multi-modal Transformer Backbone for 3D Perception

This repo is the official implementation of ICCV2023 paper: UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation as well as the follow-ups. Our UniTR achieves state-of-the-art performance on nuScenes Dataset with a real unified and weight-sharing multi-modal (e.g., Cameras and LiDARs) backbone. UniTR is built upon the codebase of DSVT, we have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

<div align="center"> <img src="assets/Figure1.png" width="700"/> </div>

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Haiyang Wang*, Hao Tang*, Shaoshuai Shi $^\dagger$, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang $^\dagger$

Contact: Haiyang Wang (wanghaiyang6@stu.pku.edu.cn), Hao Tang (tanghao@stu.pku.edu.cn), Shaoshuai Shi (shaoshuaics@gmail.com)

πŸš€ Gratitude to Tang Hao for extensive code refactoring and noteworthy contributions to open-source initiatives. His invaluable efforts were pivotal in ensuring the seamless completion of UniTR.

πŸ”₯ πŸ‘€ Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better strategies or some engineering efforts, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models.

πŸ“˜ I am going to share my understanding and future plan of the general 3D perception foundation model without reservation. Please refer to πŸ”₯ Potential ResearchπŸ”₯ . If you find it useful for your research or inspiring, feel free to join me in building this blueprint.

Interpretive Articles: [CVer] [θ‡ͺεŠ¨ι©Ύι©ΆδΉ‹εΏƒ] [ReadPaper] [ηŸ₯乎] [CSDN] [TechBeat (ε°†ι—¨εˆ›ζŠ•)]

News

Overview

TODO

Introduction

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data.

<div align="center"> <img src="assets/Figure2.png" width="500"/> </div>

In this paper, we present an efficient multi-modal backbone for outdoor 3D perception, which processes a variety of modalities with unified modeling and shared parameters. It is a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 mIoU higher for BEV map segmentation with lower inference latency.

<div align="center"> <img src="assets/Figure3.png" width="800"/> </div>

Main results

3D Object Detection (on NuScenes validation)

ModelNDSmAPmATEmASEmAOEmAVEmAAEckptLog
UniTR73.070.126.324.726.824.617.9ckptLog
UniTR+LSS73.370.526.024.426.824.818.7ckptLog

3D Object Detection (on NuScenes test)

ModelNDSmAPmATEmASEmAOEmAVEmAAE
UniTR74.170.524.423.325.724.113.0
UniTR+LSS74.570.924.122.925.624.013.1

Bev Map Segmentation (on NuScenes validation)

ModelmIoUDrivablePed.Cross.WalkwayStopLineCarparkDividerckptLog
UniTR73.290.473.178.266.667.363.8ckptLog
UniTR+LSS74.790.774.079.368.272.964.2ckptLog

What's new here?

πŸ”₯ Beats previous SOTAs of outdoor multi-modal 3D Object Detection and BEV Segmentation

Our approach has achieved the best performance on multiple tasks (e.g., 3D Object Detection and BEV Map Segmentation), and it is highly versatile, requiring only the replacement of the backbone.

3D Object Detection
<div align="left"> <img src="assets/Figure4.png" width="700"/> </div>
BEV Map Segmentation
<div align="left"> <img src="assets/Figure5.png" width="700"/> </div>

πŸ”₯ Weight-Sharing among all modalities

We introduce a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps.

πŸ”₯ Prerequisite for 3D vision foundation models

A weight-shared unified multimodal encoder is a prerequisite for foundation models, especially in the context of 3D perception, unifying information from both images and LiDAR data. This is the first truly multimodal fusion backbone, seamlessly connecting to any 3D detection head.

Quick Start

Installation

conda create -n unitr python=3.8
# Install torch, we only test it in pytorch 1.10
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/torch_stable.html

git clone https://github.com/Haiyang-W/UniTR
cd UniTR

# Install extra dependency
pip install -r requirements.txt

# Install nuscenes-devkit
pip install nuscenes-devkit==1.0.5

# Develop
python setup.py develop

Dataset Preparation

OpenPCDet
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ nuscenes
β”‚   β”‚   │── v1.0-trainval (or v1.0-mini if you use mini)
β”‚   β”‚   β”‚   │── samples
β”‚   β”‚   β”‚   │── sweeps
β”‚   β”‚   β”‚   │── maps
β”‚   β”‚   β”‚   │── v1.0-trainval  
β”œβ”€β”€ pcdet
β”œβ”€β”€ tools
OpenPCDet
β”œβ”€β”€ maps
β”‚   β”œβ”€β”€ ......
β”‚   β”œβ”€β”€ boston-seaport.json
β”‚   β”œβ”€β”€ singapore-onenorth.json
β”‚   β”œβ”€β”€ singapore-queenstown.json
β”‚   β”œβ”€β”€ singapore-hollandvillage.json
# Create dataset info file, lidar and image gt database
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
    --version v1.0-trainval \
    --with_cam \
    --with_cam_gt \
    # --share_memory # if use share mem for lidar and image gt sampling (about 24G+143G or 12G+72G)
# share mem will greatly improve your training speed, but need 150G or 75G extra cache mem. 
# NOTE: all the experiments used share memory. Share mem will not affect performance
OpenPCDet
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ nuscenes
β”‚   β”‚   │── v1.0-trainval (or v1.0-mini if you use mini)
β”‚   β”‚   β”‚   │── samples
β”‚   β”‚   β”‚   │── sweeps
β”‚   β”‚   β”‚   │── maps
β”‚   β”‚   β”‚   │── v1.0-trainval  
β”‚   β”‚   β”‚   │── img_gt_database_10sweeps_withvelo
β”‚   β”‚   β”‚   │── gt_database_10sweeps_withvelo
β”‚   β”‚   β”‚   │── nuscenes_10sweeps_withvelo_lidar.npy (optional) # if open share mem
β”‚   β”‚   β”‚   │── nuscenes_10sweeps_withvelo_img.npy (optional) # if open share mem
β”‚   β”‚   β”‚   │── nuscenes_infos_10sweeps_train.pkl  
β”‚   β”‚   β”‚   │── nuscenes_infos_10sweeps_val.pkl
β”‚   β”‚   β”‚   │── nuscenes_dbinfos_10sweeps_withvelo.pkl
β”œβ”€β”€ pcdet
β”œβ”€β”€ tools

Training

Please download pretrained checkpoint from unitr_pretrain.pth and copy the file under the root folder, eg. UniTR/unitr_pretrain.pth. This file is the weight of pretraining DSVT on Imagenet and Nuimage datasets.

3D object detection:

# multi-gpu training
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

BEV Map Segmentation:

# multi-gpu training
# note that we don't use image pretrain in BEV Map Segmentation
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --sync_bn --eval_map --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --sync_bn --eval_map --logger_iter_interval 1000

Testing

3D object detection:

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --ckpt <CHECKPOINT_FILE>

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --ckpt <CHECKPOINT_FILE>

BEV Map Segmentation

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --ckpt <CHECKPOINT_FILE> --eval_map

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --ckpt <CHECKPOINT_FILE> --eval_map
# NOTE: evaluation results will not be logged in *.log, only be printed in the teminal

Cache Testing

# Only for 3D Object Detection
## normal
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

## add LSS
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

## add LSS
### cache the mapping computation of multi-modal backbone and LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache_plus.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

Performance of cache testing on NuScenes validation (some variations in camera parameters)

ModelNDSmAPmATEmASEmAOEmAVEmAAE
UniTR (Cache Backbone)72.6(-0.4)69.4(-0.7)26.924.826.324.618.2
UniTR+LSS (Cache Backbone)73.1(-0.2)70.2(-0.3)25.824.426.025.318.2
UniTR+LSS (Cache Backbone and LSS)72.6(-0.7οΌ‰69.3(-1.2οΌ‰26.724.325.925.318.2

Potential Research

<div align="center"> <img src="assets/Figure6.png" width="800"/> </div>

Possible Issues

Citation

Please consider citing our work as follows if it is helpful.

@inproceedings{wang2023unitr,
    title={UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation},
    author={Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang},
    booktitle={ICCV},
    year={2023}
}

Acknowledgments

UniTR uses code from a few open source repositories. Without the efforts of these folks (and their willingness to release their implementations), UniTR would not be possible. We thanks these authors for their efforts!