Home

Awesome

A Survey on Multimodal Large Language Models for Autonomous Driving

We add new references from CVPR 2024 in our repo, some references are from 自动驾驶之心.

:boom: News: MAPLM (Tencent, UIUC) and LaMPilot (Purdue University) from our team are accepted by CVPR 2024.

News: LLVM-AD Workshop is successfully organized at WACV 2024.

On-site

WACV 2024 Proceedings | Arxiv | Workshop | Report by 机器之心

Summary of the 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD)

Abstract

With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this repo, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.

Awesome Papers

MLLM for Perception & Planning & Control for Autonomous Driving

Please ping us if you find any interesting new papers in this area. We will update them into the Table. And all of them will be included in the next version of the survey paper.

ModelYearBackboneTaskModalityLearningInputOutput
Driving with LLMs2023LLaMAPerception ControlVision, LanguageFinetuningVector QueryResponse / Actions
Talk2BEV2023Flan5XXL Vicuna-13bPerception PlanningVision, LanguageIn-context learningImage QueryResponse
GAIA-12023-PlanningVision, LanguagePretrainingVideo PromptVideo
Dilu2023GPT-3.5 GPT-4Planning ControlLanguageIn-context learningTextAction
Drive as You Speak2023GPT-4PlanningLanguageIn-context learningTextCode
Receive, Reason, and React2023GPT-4Planning ControlLanguageIn-context learningTextAction
Drive Like a Human2023GPT-3.5Planning ControlLanguageIn-context learningTextAction
GPT-Driver2023GPT-3.5PlanningVision, LanguageIn-context learningTextTrajectory
SurrealDriver2023GPT-4Planning ControlLanguageIn-context learningTextText / Action
LanguageMPC2023GPT-3.5PlanningLanguageIn-context learningTextAction
DriveGPT42023Llama 2Planning ControlVision, LanguageIn-context learningImage Text ActionText / Action
Domain Knowledge Distillation from LLMs2023GPT-3.5Text GenerationLanguageIn-context learningTextConcept
LaMPilot2023GPT-4 / LLaMA-2 / PaLM2Planning (Code Generation)LanguageIn-context learningTextCode as action
Language Agent2023GPT-3.5PlanningLanguageTrainingTextAction
LMDrive2023CARLA + LLaVAPlanning ControlVision, LanguageTrainingRGB Image LiDAR TextControl Signal
On the Road with GPT-4V(ision)2023GPT-4VisionPerceptionVision, LanguageIn-context learningRGB Image TextText Description
DriveLLM2023GPT-4Planning ControlLanguageIn-context learningTextAction
DriveMLM2023LLaMA+Q-FormerPerception PlanningVision, LanguageTrainingRGB Image LiDAR TextDecision State
DriveLM2023GVQAPerception PlanningVision, LanguageTrainingRGB Image TextText / Action
LangProp2024IL, DAgger, RL + ChatGPTPlanning (Code/Action Generation)CARLA simulator Vsion, LanguageTrainingCARLA simulator TextCode as action
LimSim++2024LimSim, GPT-4PlanningSimulator BEV, LanguageIn-context learningSimulator Vision, LanguageText / Action
DriveVLM2024Qwen-VLPlanningSequence of Images, LanguageTrainingVision, LanguageText / Action
RAG-Driver2024Vicuna1.5-7BPlanning ControlVideo, LanguageTrainingVision, LanguageText / Action
ChatSim2024GPT-4Perception (Image Editing)Image, LanguageIn-context learningVision, LanguageImage
VLP2024CLIP Text EncoderPlanningImage, LanguageTrainingVision, LanguageText / Action

Datasets

The table is inspired by Comparison and stats in DriveLM

DatasetBase DatasetLanguage FormPerspectivesScaleRelease?
BDD-X 2018BDDDescriptionPlanning Description & Justification8M frames, 20k text strings:heavy_check_mark:
HAD HRI Advice 2019HDDAdviceGoal-oriented & stimulus-driven advice5,675 video clips, 45k text strings:heavy_check_mark:
Talk2Car 2019nuScenesDescriptionGoal Point Description30K frames, 10K text strings:heavy_check_mark:
SUTD-TrafficQA 2021Self-CollectedQAQA10k frames 62k text strings:heavy_check_mark:
DRAMA 2022Self-CollectedDescriptionQA + Captions18k frames, 100k text strings:heavy_check_mark:
nuScenes-QA 2023nuScenesQAPerception Result30K frames, 460K generated QA pairsnuScenes-QA
Reason2Drive 2023nuScenes, Waymo, ONCEQAPerception, Prediction and Reasoning600K video-text pairsReason2Drive
Rank2Tell 2023Self-CollectedQARisk Localization and Ranking116 video clips (20s each)Rank2Tell
DriveLM 2023nuScenesQA + Scene DescriptionPerception, Prediction and Planning with Logic30K frames, 360k annotated QA pairsDriveLM
MAPLM 2023THMAQA + Scene DescriptionPerception, Prediction and HD Map Annotation2M frames, 16M annotated HD map Description + 13K released QA pairsMAPLM
LingoQA 2023Collected by WayveQAPerception, and Planning28K frames, 419.9K QA + CaptioningLingoQA

Other Survey Papers

ModelYearFocus
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems2023Vision-Language Models for Transportation Systems
LLM4Drive: A Survey of Large Language Models for Autonomous Driving2023Language Models for Autonomous Driving
Towards Knowledge-driven Autonomous Driving2023Summary on how to use large language models, world models, and neural rendering to contribute to a more holistic, adaptive, and intelligent autonomous driving system.
Applications of Large Scale Foundation Models for Autonomous Driving2023Large Scale Foundation Models (LLMs, VLMs, VFMs, World Models) for Autonomous Driving
Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies2023Closed-Loop Autonomous Driving
A Survey on Autonomous Driving Datasets: Data Statistic, Annotation, and Outlook2024Autonomous Driving Datasets
A Survey for Foundation Models in Autonomous Driving2024Multimodal Foundation Models for Autonomous Driving

Papers Accepted by WACV 2024 LLVM-AD

A Survey on Multimodal Large Language Models for Autonomous Driving

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

A Game of Bundle Adjustment - Learning Efficient Convergence Accepted as a tech report for their ICCV 2023 Paper

VLAAD: Vision and Language Assistant for Autonomous Driving

A Safer Vision-based Autonomous Planning System for Quadrotor UAVs with Dynamic Obstacle Trajectory Prediction and Its Application with LLMs

Human-Centric Autonomous Systems With LLMs for User Command Reasoning

NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations

Latency Driven Spatially Sparse Optimization for Multi-Branch CNNs for Semantic Segmentation

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization

Future Directions Section

Social Behavior for Autonomous Driving (UIUC, Purdue University)

Personalized Autonomous Driving (Purdue University, UIUC)

Hardware Support for LLMs in Autonomous Driving (SambaNova Systems)

LLMs for HD Maps (Tencent)

Code as Action for Autonomous Driving (Purdue University, UIUC)

Citation

If the survey and our workshop inspire you, please cite our work:

@inproceedings{cui2024survey,
  title={A survey on multimodal large language models for autonomous driving},
  author={Cui, Can and Ma, Yunsheng and Cao, Xu and Ye, Wenqian and Zhou, Yang and Liang, Kaizhao and Chen, Jintai and Lu, Juanwu and Yang, Zichong and Liao, Kuei-Da and others},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={958--979},
  year={2024}
}