Awesome
A Survey on Multimodal Large Language Models for Autonomous Driving
We add new references from CVPR 2024 in our repo, some references are from 自动驾驶之心.
:boom: News: MAPLM (Tencent, UIUC) and LaMPilot (Purdue University) from our team are accepted by CVPR 2024.
News: LLVM-AD Workshop is successfully organized at WACV 2024.
WACV 2024 Proceedings | Arxiv | Workshop | Report by 机器之心
Summary of the 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD)
Abstract
With the emergence of Large Language Models (LLMs) and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this repo, we present a systematic investigation in this field. We first introduce the background of Multimodal Large Language Models (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.
Awesome Papers
MLLM for Perception & Planning & Control for Autonomous Driving
Please ping us if you find any interesting new papers in this area. We will update them into the Table. And all of them will be included in the next version of the survey paper.
Model | Year | Backbone | Task | Modality | Learning | Input | Output |
---|---|---|---|---|---|---|---|
Driving with LLMs | 2023 | LLaMA | Perception Control | Vision, Language | Finetuning | Vector Query | Response / Actions |
Talk2BEV | 2023 | Flan5XXL Vicuna-13b | Perception Planning | Vision, Language | In-context learning | Image Query | Response |
GAIA-1 | 2023 | - | Planning | Vision, Language | Pretraining | Video Prompt | Video |
Dilu | 2023 | GPT-3.5 GPT-4 | Planning Control | Language | In-context learning | Text | Action |
Drive as You Speak | 2023 | GPT-4 | Planning | Language | In-context learning | Text | Code |
Receive, Reason, and React | 2023 | GPT-4 | Planning Control | Language | In-context learning | Text | Action |
Drive Like a Human | 2023 | GPT-3.5 | Planning Control | Language | In-context learning | Text | Action |
GPT-Driver | 2023 | GPT-3.5 | Planning | Vision, Language | In-context learning | Text | Trajectory |
SurrealDriver | 2023 | GPT-4 | Planning Control | Language | In-context learning | Text | Text / Action |
LanguageMPC | 2023 | GPT-3.5 | Planning | Language | In-context learning | Text | Action |
DriveGPT4 | 2023 | Llama 2 | Planning Control | Vision, Language | In-context learning | Image Text Action | Text / Action |
Domain Knowledge Distillation from LLMs | 2023 | GPT-3.5 | Text Generation | Language | In-context learning | Text | Concept |
LaMPilot | 2023 | GPT-4 / LLaMA-2 / PaLM2 | Planning (Code Generation) | Language | In-context learning | Text | Code as action |
Language Agent | 2023 | GPT-3.5 | Planning | Language | Training | Text | Action |
LMDrive | 2023 | CARLA + LLaVA | Planning Control | Vision, Language | Training | RGB Image LiDAR Text | Control Signal |
On the Road with GPT-4V(ision) | 2023 | GPT-4Vision | Perception | Vision, Language | In-context learning | RGB Image Text | Text Description |
DriveLLM | 2023 | GPT-4 | Planning Control | Language | In-context learning | Text | Action |
DriveMLM | 2023 | LLaMA+Q-Former | Perception Planning | Vision, Language | Training | RGB Image LiDAR Text | Decision State |
DriveLM | 2023 | GVQA | Perception Planning | Vision, Language | Training | RGB Image Text | Text / Action |
LangProp | 2024 | IL, DAgger, RL + ChatGPT | Planning (Code/Action Generation) | CARLA simulator Vsion, Language | Training | CARLA simulator Text | Code as action |
LimSim++ | 2024 | LimSim, GPT-4 | Planning | Simulator BEV, Language | In-context learning | Simulator Vision, Language | Text / Action |
DriveVLM | 2024 | Qwen-VL | Planning | Sequence of Images, Language | Training | Vision, Language | Text / Action |
RAG-Driver | 2024 | Vicuna1.5-7B | Planning Control | Video, Language | Training | Vision, Language | Text / Action |
ChatSim | 2024 | GPT-4 | Perception (Image Editing) | Image, Language | In-context learning | Vision, Language | Image |
VLP | 2024 | CLIP Text Encoder | Planning | Image, Language | Training | Vision, Language | Text / Action |
Datasets
The table is inspired by Comparison and stats in DriveLM
Dataset | Base Dataset | Language Form | Perspectives | Scale | Release? |
---|---|---|---|---|---|
BDD-X 2018 | BDD | Description | Planning Description & Justification | 8M frames, 20k text strings | :heavy_check_mark: |
HAD HRI Advice 2019 | HDD | Advice | Goal-oriented & stimulus-driven advice | 5,675 video clips, 45k text strings | :heavy_check_mark: |
Talk2Car 2019 | nuScenes | Description | Goal Point Description | 30K frames, 10K text strings | :heavy_check_mark: |
SUTD-TrafficQA 2021 | Self-Collected | QA | QA | 10k frames 62k text strings | :heavy_check_mark: |
DRAMA 2022 | Self-Collected | Description | QA + Captions | 18k frames, 100k text strings | :heavy_check_mark: |
nuScenes-QA 2023 | nuScenes | QA | Perception Result | 30K frames, 460K generated QA pairs | nuScenes-QA |
Reason2Drive 2023 | nuScenes, Waymo, ONCE | QA | Perception, Prediction and Reasoning | 600K video-text pairs | Reason2Drive |
Rank2Tell 2023 | Self-Collected | QA | Risk Localization and Ranking | 116 video clips (20s each) | Rank2Tell |
DriveLM 2023 | nuScenes | QA + Scene Description | Perception, Prediction and Planning with Logic | 30K frames, 360k annotated QA pairs | DriveLM |
MAPLM 2023 | THMA | QA + Scene Description | Perception, Prediction and HD Map Annotation | 2M frames, 16M annotated HD map Description + 13K released QA pairs | MAPLM |
LingoQA 2023 | Collected by Wayve | QA | Perception, and Planning | 28K frames, 419.9K QA + Captioning | LingoQA |
Other Survey Papers
Papers Accepted by WACV 2024 LLVM-AD
A Survey on Multimodal Large Language Models for Autonomous Driving
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
A Game of Bundle Adjustment - Learning Efficient Convergence Accepted as a tech report for their ICCV 2023 Paper
VLAAD: Vision and Language Assistant for Autonomous Driving
Human-Centric Autonomous Systems With LLMs for User Command Reasoning
Latency Driven Spatially Sparse Optimization for Multi-Branch CNNs for Semantic Segmentation
LIP-Loc: LiDAR Image Pretraining for Cross-Modal Localization
Future Directions Section
Social Behavior for Autonomous Driving (UIUC, Purdue University)
Personalized Autonomous Driving (Purdue University, UIUC)
Hardware Support for LLMs in Autonomous Driving (SambaNova Systems)
LLMs for HD Maps (Tencent)
Code as Action for Autonomous Driving (Purdue University, UIUC)
Citation
If the survey and our workshop inspire you, please cite our work:
@inproceedings{cui2024survey,
title={A survey on multimodal large language models for autonomous driving},
author={Cui, Can and Ma, Yunsheng and Cao, Xu and Ye, Wenqian and Zhou, Yang and Liang, Kaizhao and Chen, Jintai and Lu, Juanwu and Yang, Zichong and Liao, Kuei-Da and others},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={958--979},
year={2024}
}