Home

Awesome

Deployment of BEV 3D Detection on TensorRT

This repository is a deployment project of BEV 3D Detection (including BEVFormer, BEVDet) on TensorRT, supporting FP32/FP16/INT8 inference. Meanwhile, in order to improve the inference speed of BEVFormer on TensorRT, this project implements some TensorRT Ops that support nv_half, nv_half2 and INT8. With the accuracy almost unaffected, the inference speed of the BEVFormer base can be increased by more than four times, the engine size can be reduced by more than 90%, and the GPU memory usage can be saved by more than 80%. In addition, the project also supports common 2D object detection models in MMDetection, which support INT8 Quantization and TensorRT Deployment with a small number of code changes.

Benchmarks

BEVFormer

BEVFormer PyTorch

ModelDataBatch SizeNDS/mAPFPSSize (MB)Memory (MB)Device
BEVFormer tiny<br />downloadNuScenes1NDS: 0.354<br/>mAP: 0.25215.93832167RTX 3090
BEVFormer small<br />downloadNuScenes1NDS: 0.478<br/>mAP: 0.3705.16803147RTX 3090
BEVFormer base<br />downloadNuScenes1NDS: 0.517<br/>mAP: 0.4162.42655435RTX 3090

BEVFormer TensorRT with MMDeploy Plugins (Only Support FP32)

ModelDataBatch SizeFloat/IntQuantization MethodNDS/mAPFPSSize (MB)Memory (MB)Device
BEVFormer tinyNuScenes1FP32-NDS: 0.354<br/>mAP: 0.25237.9 (x1)136 (x1)2159 (x1)RTX 3090
BEVFormer tinyNuScenes1FP16-NDS: 0.354<br/>mAP: 0.25269.2 (x1.83)74 (x0.54)1729 (x0.80)RTX 3090
BEVFormer tinyNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.353<br/>mAP: 0.24965.1 (x1.72)58 (x0.43)1737 (x0.80)RTX 3090
BEVFormer tinyNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.353<br/>mAP: 0.24970.7 (x1.87)54 (x0.40)1665 (x0.77)RTX 3090
BEVFormer smallNuScenes1FP32-NDS: 0.478<br/>mAP: 0.3706.6 (x1)245 (x1)4663 (x1)RTX 3090
BEVFormer smallNuScenes1FP16-NDS: 0.478<br/>mAP: 0.37012.8 (x1.94)126 (x0.51)3719 (x0.80)RTX 3090
BEVFormer smallNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.476<br/>mAP: 0.3678.7 (x1.32)158 (x0.64)4079 (x0.87)RTX 3090
BEVFormer smallNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.477<br/>mAP: 0.36813.3 (x2.02)106 (x0.43)3441 (x0.74)RTX 3090
BEVFormer base *NuScenes1FP32-NDS: 0.517<br/>mAP: 0.4161.5 (x1)1689 (x1)13893 (x1)RTX 3090
BEVFormer baseNuScenes1FP16-NDS: 0.517<br/>mAP: 0.4161.8 (x1.20)849 (x0.50)11865 (x0.85)RTX 3090
BEVFormer base *NuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.516<br/>mAP: 0.4141.8 (x1.20)426 (x0.25)12429 (x0.89)RTX 3090
BEVFormer base *NuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.515<br/>mAP: 0.4142.2 (x1.47)244 (x0.14)11011 (x0.79)RTX 3090

* Out of Memory when onnx2trt with TensorRT-8.5.1.7, but they convert successfully with TensorRT-8.4.3.1. So the version of these engines is TensorRT-8.4.3.1.

BEVFormer TensorRT with Custom Plugins (Support nv_half, nv_half2 and int8)

FP16 Plugins with nv_half

ModelDataBatch SizeFloat/IntQuantization MethodNDS/mAPFPS/ImproveSize (MB)Memory (MB)Device
BEVFormer tinyNuScenes1FP32-NDS: 0.354<br/>mAP: 0.25240.0 (x1.06)135 (x0.99)1693 (x0.78)RTX 3090
BEVFormer tinyNuScenes1FP16-NDS: 0.355<br/>mAP: 0.25281.2 (x2.14)70 (x0.51)1203 (x0.56)RTX 3090
BEVFormer tinyNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.351<br/>mAP: 0.24990.1 (x2.38)58 (x0.43)1105 (x0.51)RTX 3090
BEVFormer tinyNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.351<br/>mAP: 0.249107.4 (x2.83)52 (x0.38)1095 (x0.51)RTX 3090
BEVFormer smallNuScenes1FP32-NDS: 0.478<br/>mAP: 0.377.4 (x1.12)250 (x1.02)2585 (x0.55)RTX 3090
BEVFormer smallNuScenes1FP16-NDS: 0.479<br/>mAP: 0.3715.8 (x2.40)127 (x0.52)1729 (x0.37)RTX 3090
BEVFormer smallNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.477<br/>mAP: 0.36717.9 (x2.71)166 (x0.68)1637 (x0.35)RTX 3090
BEVFormer smallNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.476<br/>mAP: 0.36620.4 (x3.10)108 (x0.44)1467 (x0.31)RTX 3090
BEVFormer baseNuScenes1FP32-NDS: 0.517<br/>mAP: 0.4163.0 (x2.00)292 (x0.17)5715 (x0.41)RTX 3090
BEVFormer baseNuScenes1FP16-NDS: 0.517<br/>mAP: 0.4164.9 (x3.27)148 (x0.09)3417 (x0.25)RTX 3090
BEVFormer baseNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.515<br/>mAP: 0.4146.9 (x4.60)202 (x0.12)3307 (x0.24)RTX 3090
BEVFormer baseNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.514<br/>mAP: 0.4138.0 (x5.33)131 (x0.08)2429 (x0.17)RTX 3090

FP16 Plugins with nv_half2

ModelDataBatch SizeFloat/IntQuantization MethodNDS/mAPFPSSize (MB)Memory (MB)Device
BEVFormer tinyNuScenes1FP16-NDS: 0.355<br/>mAP: 0.25184.2 (x2.22)72 (x0.53)1205 (x0.56)RTX 3090
BEVFormer tinyNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.354<br/>mAP: 0.250108.3 (x2.86)52 (x0.38)1093 (x0.51)RTX 3090
BEVFormer smallNuScenes1FP16-NDS: 0.479<br/>mAP: 0.37118.6 (x2.82)124 (x0.51)1725 (x0.37)RTX 3090
BEVFormer smallNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.477<br/>mAP: 0.36822.9 (x3.47)110 (x0.45)1487 (x0.32)RTX 3090
BEVFormer baseNuScenes1FP16-NDS: 0.517<br/>mAP: 0.4166.6 (x4.40)146 (x0.09)3415 (x0.25)RTX 3090
BEVFormer baseNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.516<br/>mAP: 0.4158.6 (x5.73)159 (x0.09)2479 (x0.18)RTX 3090

BEVDet

BEVDet PyTorch

ModelDataBatch SizeNDS/mAPFPSSize (MB)Memory (MB)Device
BEVDet R50 CBGSNuScenes1NDS: 0.38<br/>mAP: 0.29829.01701858RTX 2080Ti

BEVDet TensorRT

with Custom Plugin bev_pool_v2 (Support nv_half, nv_half2 and int8), modified from Official BEVDet

ModelDataBatch SizeFloat/IntQuantization MethodNDS/mAPFPSSize (MB)Memory (MB)Device
BEVDet R50 CBGSNuScenes1FP32-NDS: 0.38<br/>mAP: 0.29844.62451032RTX 2080Ti
BEVDet R50 CBGSNuScenes1FP16-NDS: 0.38<br/>mAP: 0.298135.186790RTX 2080Ti
BEVDet R50 CBGSNuScenes1FP32/INT8PTQ entropy<br />per-tensorNDS: 0.355<br/>mAP: 0.274234.744706RTX 2080Ti
BEVDet R50 CBGSNuScenes1FP16/INT8PTQ entropy<br />per-tensorNDS: 0.357<br/>mAP: 0.277236.444706RTX 2080Ti

2D Detection Models

This project also supports common 2D object detection models in MMDetection with little modification. The following are deployment examples of YOLOx and CenterNet.

YOLOx

ModelDataFrameworkBatch SizeFloat/IntQuantization MethodmAPFPSSize (MB)Memory (MB)Device
YOLOx<br />downloadCOCOPyTorch32FP32-mAP: 0.50663.13797617RTX 3090
YOLOxCOCOTensorRT32FP32-mAP: 0.50671.3 (x1)546 (x1)9943 (x1)RTX 3090
YOLOxCOCOTensorRT32FP16-mAP: 0.506296.8 (x4.16)192 (x0.35)4567 (x0.46)RTX 3090
YOLOxCOCOTensorRT32FP32/INT8PTQ entropy<br />per-tensormAP: 0.488556.4 (x7.80)99 (x0.18)5225 (x0.53)RTX 3090
YOLOxCOCOTensorRT32FP16/INT8PTQ entropy<br />per-tensormAP: 0.479550.6 (x7.72)99 (x0.18)5119 (x0.51)RTX 3090

CenterNet

ModelDataFrameworkBatch SizeFloat/IntQuantization MethodmAPFPSSize (MB)Memory (MB)Device
CenterNet<br />downloadCOCOPyTorch32FP32-mAP: 0.299337.4565171RTX 3090
CenterNetCOCOTensorRT32FP32-mAP: 0.299475.6 (x1)58 (x1)8241 (x1)RTX 3090
CenterNetCOCOTensorRT32FP16-mAP: 0.2971247.1 (x2.62)29 (x0.50)5183 (x0.63)RTX 3090
CenterNetCOCOTensorRT32FP32/INT8PTQ entropy<br />per-tensormAP: 0.271534.0 (x3.22)20 (x0.34)6549 (x0.79)RTX 3090
CenterNetCOCOTensorRT32FP16/INT8PTQ entropy<br />per-tensormAP: 0.2851889.0 (x3.97)17 (x0.29)6453 (x0.78)RTX 3090

Clone

git clone git@github.com:DerryHub/BEVFormer_tensorrt.git
cd BEVFormer_tensorrt
PROJECT_DIR=$(pwd)

Data Preparation

MS COCO (For 2D Detection)

Download the COCO 2017 datasets to /path/to/coco and unzip them.

cd ${PROJECT_DIR}/data
ln -s /path/to/coco coco

NuScenes and CAN bus (For BEVFormer)

Download nuScenes V1.0 full dataset data and CAN bus expansion data HERE as /path/to/nuscenes and /path/to/can_bus.

Prepare nuscenes data like BEVFormer.

cd ${PROJECT_DIR}/data
ln -s /path/to/nuscenes nuscenes
ln -s /path/to/can_bus can_bus

cd ${PROJECT_DIR}
sh samples/bevformer/create_data.sh

Tree

${PROJECT_DIR}/data/.
├── can_bus
│   ├── scene-0001_meta.json
│   ├── scene-0001_ms_imu.json
│   ├── scene-0001_pose.json
│   └── ...
├── coco
│   ├── annotations
│   ├── test2017
│   ├── train2017
│   └── val2017
└── nuscenes
    ├── maps
    ├── samples
    ├── sweeps
    └── v1.0-trainval

Install

With Docker

cd ${PROJECT_DIR}
docker build -t trt85 -f docker/Dockerfile .
docker run -it --gpus all -v ${PROJECT_DIR}:/workspace/BEVFormer_tensorrt/ \
-v /path/to/can_bus:/workspace/BEVFormer_tensorrt/data/can_bus \
-v /path/to/coco:/workspace/BEVFormer_tensorrt/data/coco \
-v /path/to/nuscenes:/workspace/BEVFormer_tensorrt/data/nuscenes \
--shm-size 8G trt85 /bin/bash

# in container
cd /workspace/BEVFormer_tensorrt/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/usr
make -j$(nproc)
make install
cd /workspace/BEVFormer_tensorrt/third_party/bev_mmdet3d
python setup.py build develop --user

NOTE: You can download the Docker Image HERE.

From Source

CUDA/cuDNN/TensorRT

Download and install the CUDA-11.6/cuDNN-8.6.0/TensorRT-8.5.1.7 following NVIDIA.

PyTorch

Install PyTorch and TorchVision following the official instructions.

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

MMCV-full

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.5.0
pip install -r requirements/optional.txt
MMCV_WITH_OPS=1 pip install -e .

MMDetection

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout v2.25.1
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.

MMDeploy

git clone git@github.com:open-mmlab/mmdeploy.git
cd mmdeploy
git checkout v0.10.0

git clone git@github.com:NVIDIA/cub.git third_party/cub
cd third_party/cub
git checkout c3cceac115

# go back to third_party directory and git clone pybind11
cd ..
git clone git@github.com:pybind/pybind11.git pybind11
cd pybind11
git checkout 70a58c5
Build TensorRT Plugins of MMDeploy

Make sure cmake version >= 3.14.0 and gcc version >= 7.

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
export TENSORRT_DIR=/the/path/of/tensorrt
export CUDNN_DIR=/the/path/of/cuda

export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_DIR/lib64:$LD_LIBRARY_PATH

cd ${MMDEPLOY_DIR}
mkdir -p build
cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) 
make install
Install MMDeploy
cd ${MMDEPLOY_DIR}
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.

Install this Project

cd ${PROJECT_DIR}
pip install -r requirements.txt
Build and Install Custom TensorRT Plugins

NOTE: CUDA>=11.4, SM version>=7.5

cd ${PROJECT_DIR}/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/path/to/TensorRT
make -j$(nproc)
make install

Run Unit Test of Custom TensorRT Plugins

cd ${PROJECT_DIR}
sh samples/test_trt_ops.sh
Build and Install Part of Ops in MMDetection3D
cd ${PROJECT_DIR}/third_party/bev_mmdet3d
python setup.py build develop

Prepare the Checkpoints

Download above PyTorch checkpoints to ${PROJECT_DIR}/checkpoints/pytorch/. The ONNX files and TensorRT engines will be saved in ${PROJECT_DIR}/checkpoints/onnx/ and ${PROJECT_DIR}/checkpoints/tensorrt/.

Custom TensorRT Plugins

Support Common TensorRT Ops in BEVFormer:

Each operation is implemented as 2 versions: FP32/FP16 (nv_half)/INT8 and FP32/FP16 (nv_half2)/INT8.

For specific speed comparison, see Custom TensorRT Plugins.

Run

The following tutorial uses BEVFormer base as an example.

cd ${PROJECT_DIR}
# defult gpu_id is 0
sh samples/bevformer/base/pth_evaluate.sh -d ${gpu_id}
# convert .pth to .onnx
sh samples/bevformer/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16)
sh samples/bevformer/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16)
sh samples/bevformer/base/trt_evaluate_fp16.sh -d ${gpu_id}

# Quantization
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16/INT8)
sh samples/bevformer/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16/INT8)
sh samples/bevformer/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# quantization aware train
# defult gpu_ids is 0,1,2,3,4,5,6,7
sh samples/bevformer/base/quant_aware_train.sh -d ${gpu_ids}
# then following the post training quantization process
# nv_half
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/plugin/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/plugin/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/trt_evaluate_fp16.sh -d ${gpu_id}

# nv_half2
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_2.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/onnx2trt_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/trt_evaluate_fp16_2.sh -d ${gpu_id}

# Quantization
# nv_half
# calibration and convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8.sh -d ${gpu_id}
# calibration and convert .onnx to TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# nv_half2
# calibration and convert .onnx to TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16_2.sh -d ${gpu_id}

Acknowledgement

This project is mainly based on these excellent open source projects: