Awesome
<p align="center"> <a href="https://arxiv.org/abs/2408.16530"> <img width="350" alt="image" src="assets/image.jpg"> </a> </p> <p align="center"> <strong>Yu Wang</strong> . <strong>Shaohua Wang</strong> . <strong>Yicheng Li</strong> . <strong>Mingchun Liu</strong> </p> <p align="center"> <a href='https://arxiv.org/abs/2408.16530'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> </p>A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions
This repository is associated with the review paper titled “A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions,” which provides an extensive overview of recent advancements in 3D object perception for autonomous driving systems. The review covers key approaches including Camera-Based Detection, LiDAR-Based Detection, and Fusion Detection Techniques.
We provide a thorough analysis of the strengths and limitations of each method, highlighting advancements in accuracy and robustness. Furthermore, the review discusses future directions such as Temporal Perception, Occupancy Grids, End-to-End Learning Frameworks, and Cooperative Perception methods, which extend the perception range through collaborative communication.
This repository will be actively maintained with continuous updates on the latest advancements in 3D object detection for autonomous driving. By offering a comprehensive view of the current state and future developments, we aim to be a valuable resource for researchers and practitioners in this field.
Overview
- Paper Title: A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions
- Authors: Yu Wang, Shaohua Wang, Yicheng Li,Mingchun Liu
- Link to Paper: arXiv
Abstract
In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perception tasks and their developments from a broader perspective. This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques. We provide a comprehensive analysis of the strengths and limitations of each approach, highlighting advancements in accuracy and robustness. Furthermore, we discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks. We also explore cooperative perception methods that extend the perception range through collaborative communication. By providing a holistic view of the current state and future developments in 3D object perception, we aim to offer a more comprehensive understanding of perception tasks for autonomous driving.
Key Contributions
-
To our knowledge, this is the first time that different development trends in autonomous driving environmental perception have been summarized and analyzed, providing a holistic view of the evolution and future trends in 3D object perception.
-
We provide a comprehensive summary, classification, and analysis of the latest methods in camera-based, LiDAR-based, and fusion-based 3D object detection.
-
We offer a panoramic view of perception in autonomous driving environments, not only summarizing the perception methods comprehensively but also compiling datasets and evaluation metrics used by different methods to promote research insights.
Dataset Resources
Comprehensive Autonomous Driving Datasets
-
KITTI
- Type: Real, Lidar (L), Camera (C)
- Use Cases: 2D/3D object detection, stereo vision, depth estimation.
- Link: KITTI Dataset
-
nuScenes
- Type: Real, Lidar (L), Camera (C), Radar
- Use Cases: 360-degree perception, object tracking, sensor fusion.
- Link: nuScenes Dataset
-
Waymo Open Dataset
- Type: Real, Lidar (L), Camera (C)
- Use Cases: 3D object detection, tracking, segmentation.
- Link: Waymo Open Dataset
3D Occupancy Prediction Datasets
-
Occ3D
- Type: Real, Lidar (L), Camera (C)
- Use Cases: 3D occupancy, semantic segmentation.
- Link: Occ3D Dataset
-
Semantic KITTI
- Type: Real, Lidar (L)
- Use Cases: Semantic segmentation, SLAM.
- Link: Semantic KITTI Dataset
-
KITTI-360
- Type: Real, Lidar (L), Camera (C)
- Use Cases: 360-degree scene reconstruction, mapping, localization.
- Link: KITTI-360 Dataset
Vehicle-Road Collaboration Datasets
-
OPV2V V2X-Sim V2XSet
- Type: Simulated, Lidar (L), Camera (C)
- Use Cases: V2V communication.
- Link: OPV2V Dataset
-
DAIR-V2X Rope3D
- Type: Real, Camera (C), GPS/IMU
- Use Cases: Camera-based localization.
- Link: Rope3D Dataset
-
TUMTraf V2X
- Type: Real, Lidar (L), Camera (C), GPS/IMU
- Use Cases: V2I
- Link: https://tum-traffic-dataset.github.io/tumtraf-v2x/
Simulators
CARLA Simulator
- Description: CARLA is an open-source simulator for autonomous driving research. It provides realistic environments and supports various sensors, including cameras, lidar, and GPS. It's widely used for training and testing autonomous driving models in a controlled and customizable environment.
- Link: CARLA Simulator
LGSVL Simulator
- Description: LGSVL (now known as SVL Simulator) is an autonomous vehicle simulator developed by LG Electronics. It offers high-fidelity simulations and integrates with popular frameworks like Autoware and Apollo. It's ideal for testing vehicle perception, planning, and control algorithms in complex scenarios.
- Link: SVL Simulator
Microsoft AirSim
- Description: AirSim is a cross-platform open-source simulator developed by Microsoft. It supports the simulation of drones, cars, and other vehicles in 3D environments. AirSim is compatible with Unreal Engine and offers APIs for deep integration with machine learning frameworks.
- Link: Microsoft AirSim
High-Quality Papers on 3D Object Detection
Camera-Based 3D Object Detection
-
Mono3D: Monocular 3D Object Detection for Autonomous Driving
- Description: Introduces Mono3D, a method for predicting 3D object proposals from a single image using a multi-view approach. Pioneering work in monocular 3D object detection.
- Link: Mono3D
- Year: 2016
-
M3D-RPN: Monocular 3D Region Proposal Network for Object Detection
- Description: Presents a region proposal network that directly predicts 3D bounding boxes from monocular images, advancing the state-of-the-art in monocular 3D detection.
- Link: M3D-RPN
- Year: 2019
-
MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
- Description: A recent method that leverages depth-aware transformers for monocular 3D object detection, achieving state-of-the-art results.
- Link: MonoDTR
- Year: 2022
Lidar-Based 3D Object Detection
-
VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
- Description: One of the first methods to directly apply deep learning on raw point clouds, using voxelization and 3D convolutional networks for 3D object detection.
- Link: VoxelNet
- Year: 2018
-
PointPillars: Fast Encoders for Object Detection from Point Clouds
- Description: Uses vertical columns (pillars) to encode point cloud data, enabling fast and efficient 3D object detection, widely used in real-time applications.
- Link: PointPillars
- Year: 2019
-
CenterPoint: Center-based 3D Object Detection and Tracking
- Description: A more recent approach that represents objects as points and shows strong performance on both detection and tracking tasks.
- Link: CenterPoint
- Year: 2020
Fusion-Based 3D Object Detection
-
MV3D: Multi-View 3D Object Detection Network
- Description: Combines RGB images and lidar point clouds for robust 3D object detection, a pioneering work in multi-sensor fusion.
- Link: MV3D
- Year: 2017
-
AVOD: Aggregated View Object Detection in Autonomous Driving
- Description: A two-stage object detection framework that combines lidar point clouds with RGB images, achieving high accuracy in 3D detection.
- Link: AVOD
- Year: 2018
-
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
- Description: A recent method that uses transformers to achieve robust lidar-camera fusion for 3D object detection, pushing the boundaries of fusion techniques.
- Link: TransFusion
- Year: 2022
High-Quality Papers on 3D Occupancy Prediction
-
OccNet: Occupancy Networks for 3D Reconstruction in a Single Forward Pass
- Description: Introduces Occupancy Networks (OccNet), a deep learning framework for predicting continuous occupancy values in 3D space, enabling high-quality and detailed 3D reconstruction from sparse inputs.
- Link: OccNet
- Year: 2019
-
Predicting Sharp and Accurate Occupancy Grids Using Variational Autoencoders
- Description: Proposes the use of variational autoencoders (VAEs) to predict sharp and accurate 3D occupancy grids from lidar data, improving the reliability of occupancy predictions in autonomous driving.
- Link: Predicting Sharp Occupancy Grids
- Year: 2020
-
Deep Occupancy Flow: 3D Motion Prediction from Partial Observations
- Description: A recent work that combines occupancy prediction with motion flow estimation, enabling the prediction of dynamic 3D occupancy grids from partial observations, particularly useful for autonomous driving.
- Link: Deep Occupancy Flow
- Year: 2022
High-Quality Papers on Streaming Perception
-
StreamYOLO: Real-Time Object Detection for Streaming Perception
- Description: StreamYOLO is designed for real-time object detection in streaming perception scenarios, optimizing latency and accuracy by employing a novel feature alignment mechanism. It’s particularly suited for applications requiring continuous perception, such as autonomous driving.
- Link: StreamYOLO
- Year: 2023
-
Towards Streaming Perception
- Description: This paper introduces the concept of streaming perception, focusing on real-time processing of sensor data to maintain continuous perception in dynamic environments. The authors propose new benchmarks and models to handle the challenges of streaming data.
- Link: Towards Streaming Perception
- Year: 2022
-
STM: SpatioTemporal Modeling for Efficient Online Video Object Detection
- Description: STM (SpatioTemporal Modeling) enhances online video object detection by integrating spatial and temporal features to maintain high accuracy and efficiency in streaming perception scenarios, addressing the challenges of real-time processing.
- Link: STM
- Year: 2020
High-Quality Papers on End-to-End Autonomous Driving
-
End-to-End Driving via Conditional Imitation Learning
- Description: This paper introduces a framework for end-to-end autonomous driving using conditional imitation learning. The model learns to predict driving actions directly from sensory input based on high-level commands, effectively bridging the gap between perception and control.
- Link: Conditional Imitation Learning
- Year: 2018
-
ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst
- Description: ChauffeurNet combines imitation learning with data augmentation techniques to handle challenging driving scenarios. The model learns end-to-end driving policies by imitating expert drivers while also synthesizing difficult driving situations to improve robustness.
- Link: ChauffeurNet
- Year: 2018
-
Learning by Cheating: Imitating Features from Graphical Environments for Real-World Reinforcement Learning
- Description: This paper presents a novel approach where a model is first trained in a simulated environment using rich graphical features, and then fine-tuned in the real world. This method allows for the transfer of end-to-end driving skills from simulation to reality, leveraging the advantages of both environments.
- Link: Learning by Cheating
- Year: 2020
High-Quality Papers on Vehicle-Road Collaboration (V2X Communication)
-
Cooperative Perception with V2X Communication: Exploring the Design Space
- Description: This paper explores the design space for cooperative perception using V2X communication, analyzing the trade-offs between different communication strategies and their impact on perception performance in autonomous driving scenarios. It provides a comprehensive study of how vehicle-road collaboration can enhance situational awareness.
- Link: Cooperative Perception with V2X Communication
- Year: 2020
-
V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction
- Description: V2VNet introduces a framework for vehicle-to-vehicle (V2V) communication that enables joint perception and prediction across multiple vehicles. This approach improves the accuracy and robustness of autonomous driving systems by sharing sensor data and predictive models between vehicles.
- Link: V2VNet
- Year: 2020
-
V2XSet: An Extended Dataset for Vehicle-to-Everything (V2X) Cooperative Perception
- Description: V2XSet extends existing datasets to include V2X scenarios, providing a rich resource for training and evaluating vehicle-road collaboration algorithms. The paper discusses the challenges of V2X communication and presents baseline results for cooperative perception tasks.
- Link: V2XSet
- Year: 2022
-
V2X-Sim: A Simulation Dataset for Multi-Agent Collaborative Perception
- Description: V2X-Sim is a simulation dataset designed for multi-agent collaborative perception in V2X scenarios. It provides diverse and challenging environments to train and evaluate vehicle-road collaboration models, focusing on the interaction between vehicles and infrastructure in a simulated setting.
- Link: V2X-Sim
- Year: 2021
Related Work
CVPR 2024 Papers Autonomous Driving
This repository is continuously updated. We prioritize including articles that have already been submitted to arXiv.
We kindly invite you to our platform, Auto Driving Heart, for paper interpretation and sharing. If you would like to promote your paper, please feel free to contact me.
1) End to End | 端到端自动驾驶
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
Visual Point Cloud Forecasting enables Scalable Autonomous Driving
PlanKD: Compressing End-to-End Motion Planner for Autonomous Driving
VLP: Vision Language Planning for Autonomous Driving
2)LLM Agent | 大语言模型智能体
ChatSim: Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
RegionGPT: Towards Region Understanding Vision Language Model
Towards Learning a Generalist Model for Embodied Navigation
3)SSC: Semantic Scene Completion | 语义场景补全
Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
SemCity: Semantic Scene Generationwith Triplane Diffusion
4)OCC: Occupancy Prediction | 占用感知
SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
5) World Model | 世界模型
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
6)车道线检测
Lane2Seq: Towards Unified Lane Detection via Sequence Generation
7)Pre-training | 预训练
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
8)AIGC | 人工智能内容生成
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
SemCity: Semantic Scene Generation with Triplane Diffusion
- Paper:
- Code: https://github.com/zoomin-lee/SemCity
BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
9)3D Object Detection | 三维目标检测
PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection
SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects
VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection
CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection
CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images
UniMODE: Unified Monocular 3D Object Detection
Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors
SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection
RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection
MonoCD: Monocular 3D Object Detection with Complementary Depths
- Paper:
- Code: https://github.com/dragonfly606/MonoCD
10)Stereo Matching | 双目立体匹配
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching
Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching
Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching
Neural Markov Random Field for Stereo Matching
11)Cooperative Perception | 协同感知
RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception
12)SLAM
SNI-SLAM: SemanticNeurallmplicit SLAM
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition
Implicit Event-RGBD Neural SLAM
13)Scene Flow Estimation | 场景流估计
DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement
3DSFLabeling: Boosting 3D Scene Flow Estimation by Pseudo Auto Labeling
Regularizing Self-supervised 3D Scene Flows with Surface Awareness and Cyclic Consistency
14)Point Cloud | 点云
Point Transformer V3: Simpler, Faster, Stronger
Rethinking Few-shot 3D Point Cloud Semantic Segmentation
PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation
Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle
- Paper:
- Code: https://github.com/jihun1998/AO
GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds
- Paper:
- Code: https://github.com/GLiDR-CVPR2024/GLiDR
15) Efficient Network
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
RepViT: Revisiting Mobile CNN From ViT Perspective
16) Segmentation
OMG-Seg: Is One Model Good Enough For All Segmentation?
Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning
17)Radar | 毫米波雷达
DART: Doppler-Aided Radar Tomography
RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation
18)Nerf与Gaussian Splatting
DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
Dynamic LiDAR Re-simulation using Compositional Neural Fields
- Paper: https://arxiv.org/pdf/2312.05247.pdf
- Code: https://github.com/prs-eth/Dynamic-LiDAR-Resimulation
NARUTO: Neural Active Reconstruction from Uncertain Target Observations
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
19)MOT: Muti-object Tracking | 多物体跟踪
Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking
20)Multi-label Atomic Activity Recognition
Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
21) Motion Prediction | 运动预测
SmartRefine: An Scenario-Adaptive Refinement Framework for Efficient Motion Prediction
22) Trajectory Prediction | 轨迹预测
Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory
Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
- Paper: https://arxiv.org/pdf/2403.16439.pdf
- Code: https://github.com/alfredgu001324/MapUncertaintyPrediction
23) Depth Estimation | 深度估计
AFNet: Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving
24) Event Camera | 事件相机
Seeing Motion at Nighttime with an Event Camera
Citation
If you find this review useful in your research, please consider citing: @misc{wang2024comprehensivereview3dobject, title={A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions}, author={Yu Wang and Shaohua Wang and Yicheng Li and Mingchun Liu}, year={2024}, eprint={2408.16530}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2408.16530}, }