Home

Awesome

Awesome-Remote-Sensing-Multimodal-Large-Language-Models

πŸ”₯πŸ”₯πŸ”₯ Multimodal Large Language Models for Remote Sensing: A Survey
[Project Page]This Page |

School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University

<div align='center'> :sparkles: The <b>first survey</b> for Multimodal Large Language Models for Remote Sensing (RS-MLLMs). </div>

✨✨✨ Behold our meticulously curated trove of RS-MLLMs resources!!!

πŸŽ‰πŸš€πŸ’‘ The website will be updated in real-time to track the latest state of RS-MLLMs!!!

πŸ“‘πŸ“šπŸ” Feast your eyes on an assortment of model architecture, training pipelines, datasets, comprehensive evaluation benchmarks, intelligent agents for remote sensing, techniques for instruction tuning, and much more.

🌟πŸ”₯πŸ“’ A collection of remote sensing multimodal large language model papers focusing on the vision-language domain.

<p align="center"> <img src="./images/1-timeline.jpg" width="100%" height="100%"> </p>

<font size=7><div align='center' > :apple: Multimodal Large Language Models for Remote Sensing </div></font>

<p align="center"> <img src="./images/6-timeline-agent.jpg" width="70%" height="100%"> </p> <font size=7><div align='center' > :apple: Intelligent Agents for Remote Sensing </div></font>

Please share a <font color='orange'>STAR ⭐</font> if this project does help

πŸ“’ Latest Updates

In this repository, we will collect and document researchers and their outstanding work related to remote sensing multimodal large language model (vision-language).


<font size=5><center><b> Table of Contents </b> </center></font>


Awesome Papers

Multimodal Large Language Models for Remote Sensing

TitleVenueDateCodeNote
Star <br> SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding <br>J. Luo et al. <br>arXiv2024-06-14Github-
Star <br> RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery <br>Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani. <br>Remote Sensing2024-04-23Github-
Star <br> H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model <br> C. Pang, W. Jiang, L. Jiayu, L. Yi, S. Jiaxing, L. Weijia, W. Xingxing, W. Shuai, F. Litong, X. Guisong, H.Conghui. <br>arXiv2024-03-29Github-
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery <br>W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao.<br>arXiv2024-03-06--
Large Language Models for Captioning and Retrieving Remote Sensing Images <br>J. D. Silva, J. Magalhaes, and D. Tuia.<br>arXiv2024-02-09--
Star <br> LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model <br>D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao. <br>arXiv2024-02-04Github-
Star <br> EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain <br>W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao. <br>arXiv2024-01-30Github-
Star <br> SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model <br>Y. Zhan, Z. Xiong, and Y. Yuan. <br>arXiv2024-01-18GithubDataset
Star <br> GeoChat: Grounded Large Vision-Language Model for Remote Sensing <br>K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan. <br>arXiv2023-11-24Githubaccepted by CVPR-24
Star <br> RSGPT: A Remote Sensing Vision Language Model and Benchmark <br>Y. Hu, J. Yuan, and C. Wen. <br>arXiv2023-07-28Github-

Intelligent Agents for Remote Sensing

TitleVenueDateCodeNote
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents <br>W. Xu, Z. Yu, Y. Wang, J. Wang, and M. Peng.<br>arXiv2024-06-11--
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots <br>S. Singh, M. Fore, D. Stamoulis, and D. Group.<br>arXiv2024-04-23--
Evaluating Tool-Augmented Agents in Remote Sensing Platforms <br>S. Singh, M. Fore, and D. Stamoulis.<br>arXiv2024-04-23--
Star <br> Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis <br>C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi.<br>arXiv2024-04-01Github-
Star <br> Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models <br>H. Guo, X. Su, C. Wu, B. Du, L. Zhang, and D. Li.<br>arXiv2024-01-17Github-
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis <br>S. Du, S. Tang, W. Wang, X. Li, and R. Guo.<br>arXiv2023-10-07--

Vision-Language Pre-training Models for Remote Sensing

TitleVenueDateCodeNote
Star <br> RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing <br>Z. Zhang, T. Zhao, Y. Guo, and J. Yin.<br>arXiv2024-01-02Github-
Star <br> RemoteCLIP: A Vision Language Foundation Model for Remote Sensing <br>F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou.<br>T-GRS2024-04-18GithubarXiv
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment <br>U. Mall, C. P. Phoo, M. K. Liu, C. Vondrick, B. Hariharan, and K. Bala.<br>ICLR2024-01-16ProjectarXiv
Star <br> RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision <br>X. Li, C. Wen, Y. Hu, and N. Zhou.<br>JAG2023-09-18Github-
Star <br> Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval <br>Y. Yuan, Y. Zhan, and Z. Xiong.<br>T-GRS2023-08-28GithubarXiv

Survey Papers for Remote Sensing Vision-Language Tasks

TitleVenueDateCodeNote
Star <br>Towards Vision-Language Geo-Foundation Model: A Survey <br>Y. Zhou, L. Feng, Y. Ke, X. Jiang, J. Yan, and W. Zhang.<br>arXiv2024-06-13GithubarXiv
Vision-Language Models in Remote Sensing: Current progress and future trends <br>X. Li, C. Wen, Y. Hu, Z. Yuan, and X. X. Zhu.<br>MGRS2024-04-22--
Language Integration in Remote Sensing: Tasks, datasets, and future directions <br>L. Bashmal, Y. Bazi, F. Melgani, M. M. Al Rahhal, and M. A. Al Zuair.<br>MGRS2023-10-11--
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey <br>L. Jiao et al.<br>JSTARS2023-09-18--

Others

TitleVenueDateCodeNote
On the Foundations of Earth and Climate Foundation Models <br>X. X. Zhu et al.<br>arXiv2024-05-07Github-
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications <br>C. Tan et al.<br>arXiv2023-12-23--
Star <br> Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs <br>J. Roberts, T. LΓΌddecke, R. Sheikh, K. Han, and S. Albanie. <br>arXiv2023-11-24Github-
The Potential of Visual ChatGPT for Remote Sensing <br>L. P. Osco, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, and J. Marcato Junior.<br>Remote Sensing2023-06-22--

Awesome Datasets

Datasets of Pre-Training for Alignment

TitleVenueDateCodeNote
Star <be> ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing <br>Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu.<br>arXiv2024-02-17GithubLink
Star <br> RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing <br>Z. Zhang, T. Zhao, Y. Guo, and J. Yin.<br>arXiv2024-01-02Github-
Star <br> SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing <br>Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal.<br>AAAI2024-03-24GithubarXiv
<p align="center"> <img src="./images/itpair.jpg" width="80%" height="100%"> </p>

Datasets of Multimodal Instruction Tuning

NamePaperLinkNote
FIT-RSSkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language UnderstandingLink1800.8k
RS-GPT4VRS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image UnderstandingLink991k
RS-instructionsRS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing ImageryLink7,058
SkyEye-968kSkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language ModelLink968k
Multi-task InstructionLHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language ModelLink42,322
MMRS-1MEarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing DomainLink>1M
RS-ClsQaGrd-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink78k
MMShipPopeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing ImageryLink81k
RS-Specialized-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink29.8k
RS multimodal instructionGeoChat: Grounded Large Vision-Language Model for Remote SensingLink318k
LHRS-InstructLHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language ModelLink39.8k
HqDC-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink30k
<p align="center"> <img src="./images/instruct.jpg" width="80%" height="100%"> </p>

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

<p align="center"> <img src="./images/caption.jpg" width="80%" height="100%"> </p>

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

<p align="center"> <img src="./images/vqavg.jpg" width="80%" height="100%"> </p>

Remote Sensing Image-Text Retrieval

<p align="center"> <img src="./images/itretrieval.jpg" width="80%" height="100%"> </p>

Remote Sensing Scene Classification

<p align="center"> <img src="./images/rssc.jpg" width="80%" height="100%"> </p>

πŸ€– Contact

If you have any questions about this project, please feel free to contact zhanyangnwpu@gmail.com.