Home

Awesome

<p align="center"> <a href="https://arxiv.org/abs/2408.16530"> <img width="350" alt="image" src="assets/image.jpg"> </a> </p> <p align="center"> <strong>Yu Wang</strong> . <strong>Shaohua Wang</strong> . <strong>Yicheng Li</strong> . <strong>Mingchun Liu</strong> </p> <p align="center"> <a href='https://arxiv.org/abs/2408.16530'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> </p>

A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions

This repository is associated with the review paper titled “A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions,” which provides an extensive overview of recent advancements in 3D object perception for autonomous driving systems. The review covers key approaches including Camera-Based Detection, LiDAR-Based Detection, and Fusion Detection Techniques.

We provide a thorough analysis of the strengths and limitations of each method, highlighting advancements in accuracy and robustness. Furthermore, the review discusses future directions such as Temporal Perception, Occupancy Grids, End-to-End Learning Frameworks, and Cooperative Perception methods, which extend the perception range through collaborative communication.

This repository will be actively maintained with continuous updates on the latest advancements in 3D object detection for autonomous driving. By offering a comprehensive view of the current state and future developments, we aim to be a valuable resource for researchers and practitioners in this field.

Overview

Abstract

In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perception tasks and their developments from a broader perspective. This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques. We provide a comprehensive analysis of the strengths and limitations of each approach, highlighting advancements in accuracy and robustness. Furthermore, we discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks. We also explore cooperative perception methods that extend the perception range through collaborative communication. By providing a holistic view of the current state and future developments in 3D object perception, we aim to offer a more comprehensive understanding of perception tasks for autonomous driving.

Key Contributions

  1. To our knowledge, this is the first time that different development trends in autonomous driving environmental perception have been summarized and analyzed, providing a holistic view of the evolution and future trends in 3D object perception.

  2. We provide a comprehensive summary, classification, and analysis of the latest methods in camera-based, LiDAR-based, and fusion-based 3D object detection.

  3. We offer a panoramic view of perception in autonomous driving environments, not only summarizing the perception methods comprehensively but also compiling datasets and evaluation metrics used by different methods to promote research insights.

Dataset Resources

Comprehensive Autonomous Driving Datasets

  1. KITTI

    • Type: Real, Lidar (L), Camera (C)
    • Use Cases: 2D/3D object detection, stereo vision, depth estimation.
    • Link: KITTI Dataset
  2. nuScenes

    • Type: Real, Lidar (L), Camera (C), Radar
    • Use Cases: 360-degree perception, object tracking, sensor fusion.
    • Link: nuScenes Dataset
  3. Waymo Open Dataset

    • Type: Real, Lidar (L), Camera (C)
    • Use Cases: 3D object detection, tracking, segmentation.
    • Link: Waymo Open Dataset

3D Occupancy Prediction Datasets

  1. Occ3D

    • Type: Real, Lidar (L), Camera (C)
    • Use Cases: 3D occupancy, semantic segmentation.
    • Link: Occ3D Dataset
  2. Semantic KITTI

  3. KITTI-360

    • Type: Real, Lidar (L), Camera (C)
    • Use Cases: 360-degree scene reconstruction, mapping, localization.
    • Link: KITTI-360 Dataset

Vehicle-Road Collaboration Datasets

  1. OPV2V V2X-Sim V2XSet

    • Type: Simulated, Lidar (L), Camera (C)
    • Use Cases: V2V communication.
    • Link: OPV2V Dataset
  2. DAIR-V2X Rope3D

    • Type: Real, Camera (C), GPS/IMU
    • Use Cases: Camera-based localization.
    • Link: Rope3D Dataset
  3. TUMTraf V2X

Simulators

CARLA Simulator

LGSVL Simulator

Microsoft AirSim

High-Quality Papers on 3D Object Detection

Camera-Based 3D Object Detection

  1. Mono3D: Monocular 3D Object Detection for Autonomous Driving

    • Description: Introduces Mono3D, a method for predicting 3D object proposals from a single image using a multi-view approach. Pioneering work in monocular 3D object detection.
    • Link: Mono3D
    • Year: 2016
  2. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

    • Description: Presents a region proposal network that directly predicts 3D bounding boxes from monocular images, advancing the state-of-the-art in monocular 3D detection.
    • Link: M3D-RPN
    • Year: 2019
  3. MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

    • Description: A recent method that leverages depth-aware transformers for monocular 3D object detection, achieving state-of-the-art results.
    • Link: MonoDTR
    • Year: 2022

Lidar-Based 3D Object Detection

  1. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

    • Description: One of the first methods to directly apply deep learning on raw point clouds, using voxelization and 3D convolutional networks for 3D object detection.
    • Link: VoxelNet
    • Year: 2018
  2. PointPillars: Fast Encoders for Object Detection from Point Clouds

    • Description: Uses vertical columns (pillars) to encode point cloud data, enabling fast and efficient 3D object detection, widely used in real-time applications.
    • Link: PointPillars
    • Year: 2019
  3. CenterPoint: Center-based 3D Object Detection and Tracking

    • Description: A more recent approach that represents objects as points and shows strong performance on both detection and tracking tasks.
    • Link: CenterPoint
    • Year: 2020

Fusion-Based 3D Object Detection

  1. MV3D: Multi-View 3D Object Detection Network

    • Description: Combines RGB images and lidar point clouds for robust 3D object detection, a pioneering work in multi-sensor fusion.
    • Link: MV3D
    • Year: 2017
  2. AVOD: Aggregated View Object Detection in Autonomous Driving

    • Description: A two-stage object detection framework that combines lidar point clouds with RGB images, achieving high accuracy in 3D detection.
    • Link: AVOD
    • Year: 2018
  3. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

    • Description: A recent method that uses transformers to achieve robust lidar-camera fusion for 3D object detection, pushing the boundaries of fusion techniques.
    • Link: TransFusion
    • Year: 2022

High-Quality Papers on 3D Occupancy Prediction

  1. OccNet: Occupancy Networks for 3D Reconstruction in a Single Forward Pass

    • Description: Introduces Occupancy Networks (OccNet), a deep learning framework for predicting continuous occupancy values in 3D space, enabling high-quality and detailed 3D reconstruction from sparse inputs.
    • Link: OccNet
    • Year: 2019
  2. Predicting Sharp and Accurate Occupancy Grids Using Variational Autoencoders

    • Description: Proposes the use of variational autoencoders (VAEs) to predict sharp and accurate 3D occupancy grids from lidar data, improving the reliability of occupancy predictions in autonomous driving.
    • Link: Predicting Sharp Occupancy Grids
    • Year: 2020
  3. Deep Occupancy Flow: 3D Motion Prediction from Partial Observations

    • Description: A recent work that combines occupancy prediction with motion flow estimation, enabling the prediction of dynamic 3D occupancy grids from partial observations, particularly useful for autonomous driving.
    • Link: Deep Occupancy Flow
    • Year: 2022

High-Quality Papers on Streaming Perception

  1. StreamYOLO: Real-Time Object Detection for Streaming Perception

    • Description: StreamYOLO is designed for real-time object detection in streaming perception scenarios, optimizing latency and accuracy by employing a novel feature alignment mechanism. It’s particularly suited for applications requiring continuous perception, such as autonomous driving.
    • Link: StreamYOLO
    • Year: 2023
  2. Towards Streaming Perception

    • Description: This paper introduces the concept of streaming perception, focusing on real-time processing of sensor data to maintain continuous perception in dynamic environments. The authors propose new benchmarks and models to handle the challenges of streaming data.
    • Link: Towards Streaming Perception
    • Year: 2022
  3. STM: SpatioTemporal Modeling for Efficient Online Video Object Detection

    • Description: STM (SpatioTemporal Modeling) enhances online video object detection by integrating spatial and temporal features to maintain high accuracy and efficiency in streaming perception scenarios, addressing the challenges of real-time processing.
    • Link: STM
    • Year: 2020

High-Quality Papers on End-to-End Autonomous Driving

  1. End-to-End Driving via Conditional Imitation Learning

    • Description: This paper introduces a framework for end-to-end autonomous driving using conditional imitation learning. The model learns to predict driving actions directly from sensory input based on high-level commands, effectively bridging the gap between perception and control.
    • Link: Conditional Imitation Learning
    • Year: 2018
  2. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

    • Description: ChauffeurNet combines imitation learning with data augmentation techniques to handle challenging driving scenarios. The model learns end-to-end driving policies by imitating expert drivers while also synthesizing difficult driving situations to improve robustness.
    • Link: ChauffeurNet
    • Year: 2018
  3. Learning by Cheating: Imitating Features from Graphical Environments for Real-World Reinforcement Learning

    • Description: This paper presents a novel approach where a model is first trained in a simulated environment using rich graphical features, and then fine-tuned in the real world. This method allows for the transfer of end-to-end driving skills from simulation to reality, leveraging the advantages of both environments.
    • Link: Learning by Cheating
    • Year: 2020

High-Quality Papers on Vehicle-Road Collaboration (V2X Communication)

  1. Cooperative Perception with V2X Communication: Exploring the Design Space

    • Description: This paper explores the design space for cooperative perception using V2X communication, analyzing the trade-offs between different communication strategies and their impact on perception performance in autonomous driving scenarios. It provides a comprehensive study of how vehicle-road collaboration can enhance situational awareness.
    • Link: Cooperative Perception with V2X Communication
    • Year: 2020
  2. V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction

    • Description: V2VNet introduces a framework for vehicle-to-vehicle (V2V) communication that enables joint perception and prediction across multiple vehicles. This approach improves the accuracy and robustness of autonomous driving systems by sharing sensor data and predictive models between vehicles.
    • Link: V2VNet
    • Year: 2020
  3. V2XSet: An Extended Dataset for Vehicle-to-Everything (V2X) Cooperative Perception

    • Description: V2XSet extends existing datasets to include V2X scenarios, providing a rich resource for training and evaluating vehicle-road collaboration algorithms. The paper discusses the challenges of V2X communication and presents baseline results for cooperative perception tasks.
    • Link: V2XSet
    • Year: 2022
  4. V2X-Sim: A Simulation Dataset for Multi-Agent Collaborative Perception

    • Description: V2X-Sim is a simulation dataset designed for multi-agent collaborative perception in V2X scenarios. It provides diverse and challenging environments to train and evaluate vehicle-road collaboration models, focusing on the interaction between vehicles and infrastructure in a simulated setting.
    • Link: V2X-Sim
    • Year: 2021

Related Work

CVPR 2024 Papers Autonomous Driving

This repository is continuously updated. We prioritize including articles that have already been submitted to arXiv.

We kindly invite you to our platform, Auto Driving Heart, for paper interpretation and sharing. If you would like to promote your paper, please feel free to contact me.

1) End to End | 端到端自动驾驶

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

PlanKD: Compressing End-to-End Motion Planner for Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

2)LLM Agent | 大语言模型智能体

ChatSim: Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

RegionGPT: Towards Region Understanding Vision Language Model

Towards Learning a Generalist Model for Embodied Navigation

3)SSC: Semantic Scene Completion | 语义场景补全

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

SemCity: Semantic Scene Generationwith Triplane Diffusion

4)OCC: Occupancy Prediction | 占用感知

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

5) World Model | 世界模型

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

6)车道线检测

Lane2Seq: Towards Unified Lane Detection via Sequence Generation

7)Pre-training | 预训练

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

8)AIGC | 人工智能内容生成

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

SemCity: Semantic Scene Generation with Triplane Diffusion

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

9)3D Object Detection | 三维目标检测

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

UniMODE: Unified Monocular 3D Object Detection

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection

RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection

MonoCD: Monocular 3D Object Detection with Complementary Depths

10)Stereo Matching | 双目立体匹配

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching

Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching

Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching

Neural Markov Random Field for Stereo Matching

11)Cooperative Perception | 协同感知

RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception

12)SLAM

SNI-SLAM: SemanticNeurallmplicit SLAM

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Implicit Event-RGBD Neural SLAM

13)Scene Flow Estimation | 场景流估计

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement

3DSFLabeling: Boosting 3D Scene Flow Estimation by Pseudo Auto Labeling

Regularizing Self-supervised 3D Scene Flows with Surface Awareness and Cyclic Consistency

14)Point Cloud | 点云

Point Transformer V3: Simpler, Faster, Stronger

Rethinking Few-shot 3D Point Cloud Semantic Segmentation

PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation

Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle

GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds

15) Efficient Network

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

RepViT: Revisiting Mobile CNN From ViT Perspective

16) Segmentation

OMG-Seg: Is One Model Good Enough For All Segmentation?

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning

17)Radar | 毫米波雷达

DART: Doppler-Aided Radar Tomography

RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation

18)Nerf与Gaussian Splatting

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

Dynamic LiDAR Re-simulation using Compositional Neural Fields

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

19)MOT: Muti-object Tracking | 多物体跟踪

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking

DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking

20)Multi-label Atomic Activity Recognition

Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes

21) Motion Prediction | 运动预测

SmartRefine: An Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

22) Trajectory Prediction | 轨迹预测

Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory

Producing and Leveraging Online Map Uncertainty in Trajectory Prediction

23) Depth Estimation | 深度估计

AFNet: Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

24) Event Camera | 事件相机

Seeing Motion at Nighttime with an Event Camera

Citation

If you find this review useful in your research, please consider citing: @misc{wang2024comprehensivereview3dobject, title={A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions}, author={Yu Wang and Shaohua Wang and Yicheng Li and Mingchun Liu}, year={2024}, eprint={2408.16530}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2408.16530}, }