Awesome

A Survey on Multimodal Large Language Models for Autonomous Driving

We add new references from CVPR 2024 in our repo, some references are from 自动驾驶之心.

:boom: News: MAPLM (Tencent, UIUC) and LaMPilot (Purdue University) from our team are accepted by CVPR 2024.

News: LLVM-AD Workshop is successfully organized at WACV 2024.

On-site

WACV 2024 Proceedings | Arxiv | Workshop | Report by 机器之心

Summary of the 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD)

Abstract

With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this repo, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.

Awesome Papers

MLLM for Perception & Planning & Control for Autonomous Driving

Please ping us if you find any interesting new papers in this area. We will update them into the Table. And all of them will be included in the next version of the survey paper.

Model	Year	Backbone	Task	Modality	Learning	Input	Output
Driving with LLMs	2023	LLaMA	Perception Control	Vision, Language	Finetuning	Vector Query	Response / Actions
Talk2BEV	2023	Flan5XXL Vicuna-13b	Perception Planning	Vision, Language	In-context learning	Image Query	Response
GAIA-1	2023	-	Planning	Vision, Language	Pretraining	Video Prompt	Video
Dilu	2023	GPT-3.5 GPT-4	Planning Control	Language	In-context learning	Text	Action
Drive as You Speak	2023	GPT-4	Planning	Language	In-context learning	Text	Code
Receive, Reason, and React	2023	GPT-4	Planning Control	Language	In-context learning	Text	Action
Drive Like a Human	2023	GPT-3.5	Planning Control	Language	In-context learning	Text	Action
GPT-Driver	2023	GPT-3.5	Planning	Vision, Language	In-context learning	Text	Trajectory
SurrealDriver	2023	GPT-4	Planning Control	Language	In-context learning	Text	Text / Action
LanguageMPC	2023	GPT-3.5	Planning	Language	In-context learning	Text	Action
DriveGPT4	2023	Llama 2	Planning Control	Vision, Language	In-context learning	Image Text Action	Text / Action
Domain Knowledge Distillation from LLMs	2023	GPT-3.5	Text Generation	Language	In-context learning	Text	Concept
LaMPilot	2023	GPT-4 / LLaMA-2 / PaLM2	Planning (Code Generation)	Language	In-context learning	Text	Code as action
Language Agent	2023	GPT-3.5	Planning	Language	Training	Text	Action
LMDrive	2023	CARLA + LLaVA	Planning Control	Vision, Language	Training	RGB Image LiDAR Text	Control Signal
On the Road with GPT-4V(ision)	2023	GPT-4Vision	Perception	Vision, Language	In-context learning	RGB Image Text	Text Description
DriveLLM	2023	GPT-4	Planning Control	Language	In-context learning	Text	Action
DriveMLM	2023	LLaMA+Q-Former	Perception Planning	Vision, Language	Training	RGB Image LiDAR Text	Decision State
DriveLM	2023	GVQA	Perception Planning	Vision, Language	Training	RGB Image Text	Text / Action
LangProp	2024	IL, DAgger, RL + ChatGPT	Planning (Code/Action Generation)	CARLA simulator Vsion, Language	Training	CARLA simulator Text	Code as action
LimSim++	2024	LimSim, GPT-4	Planning	Simulator BEV, Language	In-context learning	Simulator Vision, Language	Text / Action
DriveVLM	2024	Qwen-VL	Planning	Sequence of Images, Language	Training	Vision, Language	Text / Action
RAG-Driver	2024	Vicuna1.5-7B	Planning Control	Video, Language	Training	Vision, Language	Text / Action
ChatSim	2024	GPT-4	Perception (Image Editing)	Image, Language	In-context learning	Vision, Language	Image
VLP	2024	CLIP Text Encoder	Planning	Image, Language	Training	Vision, Language	Text / Action

Datasets

The table is inspired by Comparison and stats in DriveLM

Dataset	Base Dataset	Language Form	Perspectives	Scale	Release?
BDD-X 2018	BDD	Description	Planning Description & Justification	8M frames, 20k text strings	:heavy_check_mark:
HAD HRI Advice 2019	HDD	Advice	Goal-oriented & stimulus-driven advice	5,675 video clips, 45k text strings	:heavy_check_mark:
Talk2Car 2019	nuScenes	Description	Goal Point Description	30K frames, 10K text strings	:heavy_check_mark:
SUTD-TrafficQA 2021	Self-Collected	QA	QA	10k frames 62k text strings	:heavy_check_mark:
DRAMA 2022	Self-Collected	Description	QA + Captions	18k frames, 100k text strings	:heavy_check_mark:
nuScenes-QA 2023	nuScenes	QA	Perception Result	30K frames, 460K generated QA pairs	nuScenes-QA
Reason2Drive 2023	nuScenes, Waymo, ONCE	QA	Perception, Prediction and Reasoning	600K video-text pairs	Reason2Drive
Rank2Tell 2023	Self-Collected	QA	Risk Localization and Ranking	116 video clips (20s each)	Rank2Tell
DriveLM 2023	nuScenes	QA + Scene Description	Perception, Prediction and Planning with Logic	30K frames, 360k annotated QA pairs	DriveLM
MAPLM 2023	THMA	QA + Scene Description	Perception, Prediction and HD Map Annotation	2M frames, 16M annotated HD map Description + 13K released QA pairs	MAPLM
LingoQA 2023	Collected by Wayve	QA	Perception, and Planning	28K frames, 419.9K QA + Captioning	LingoQA

Other Survey Papers

Model	Year	Focus
Vision Language Models in Autonomous Driving and Intelligent Transportation Systems	2023	Vision-Language Models for Transportation Systems
LLM4Drive: A Survey of Large Language Models for Autonomous Driving	2023	Language Models for Autonomous Driving
Towards Knowledge-driven Autonomous Driving	2023	Summary on how to use large language models, world models, and neural rendering to contribute to a more holistic, adaptive, and intelligent autonomous driving system.
Applications of Large Scale Foundation Models for Autonomous Driving	2023	Large Scale Foundation Models (LLMs, VLMs, VFMs, World Models) for Autonomous Driving
Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies	2023	Closed-Loop Autonomous Driving
A Survey on Autonomous Driving Datasets: Data Statistic, Annotation, and Outlook	2024	Autonomous Driving Datasets
A Survey for Foundation Models in Autonomous Driving	2024	Multimodal Foundation Models for Autonomous Driving

Papers Accepted by WACV 2024 LLVM-AD

A Survey on Multimodal Large Language Models for Autonomous Driving

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

A Game of Bundle Adjustment - Learning Efficient Convergence Accepted as a tech report for their ICCV 2023 Paper

VLAAD: Vision and Language Assistant for Autonomous Driving

A Safer Vision-based Autonomous Planning System for Quadrotor UAVs with Dynamic Obstacle Trajectory Prediction and Its Application with LLMs

Human-Centric Autonomous Systems With LLMs for User Command Reasoning

NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations

Latency Driven Spatially Sparse Optimization for Multi-Branch CNNs for Semantic Segmentation

LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization

Future Directions Section

Social Behavior for Autonomous Driving (UIUC, Purdue University)

Personalized Autonomous Driving (Purdue University, UIUC)

Hardware Support for LLMs in Autonomous Driving (SambaNova Systems)

LLMs for HD Maps (Tencent)

Code as Action for Autonomous Driving (Purdue University, UIUC)

Citation

If the survey and our workshop inspire you, please cite our work:

@inproceedings{cui2024survey,
  title={A survey on multimodal large language models for autonomous driving},
  author={Cui, Can and Ma, Yunsheng and Cao, Xu and Ye, Wenqian and Zhou, Yang and Liang, Kaizhao and Chen, Jintai and Lu, Juanwu and Yang, Zichong and Liao, Kuei-Da and others},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={958--979},
  year={2024}
}