Home

Awesome

Awesome-Remote-Sensing-Multimodal-Large-Language-Models

πŸ”₯πŸ”₯πŸ”₯ Multimodal Large Language Models for Remote Sensing: A Survey
[Project Page]This Page |

School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University

<div align='center'> :sparkles: The <b>first survey</b> for Multimodal Large Language Models for Remote Sensing (RS-MLLMs). </div>

✨✨✨ Behold our meticulously curated trove of RS-MLLMs resources!!!

πŸŽ‰πŸš€πŸ’‘ The website will be updated in real-time to track the latest state of RS-MLLMs!!!

πŸ“‘πŸ“šπŸ” Feast your eyes on an assortment of model architecture, training pipelines, datasets, comprehensive evaluation benchmarks, intelligent agents for remote sensing, techniques for instruction tuning, and much more.

🌟πŸ”₯πŸ“’ A collection of remote sensing multimodal large language model papers focusing on the vision-language domain.

<p align="center"> <img src="./images/1-timeline.jpg" width="100%" height="100%"> </p>

<font size=7><div align='center' > :apple: Multimodal Large Language Models for Remote Sensing </div></font>

<p align="center"> <img src="./images/6-timeline-agent.jpg" width="70%" height="100%"> </p> <font size=7><div align='center' > :apple: Intelligent Agents for Remote Sensing </div></font>

Please share a <font color='orange'>STAR ⭐</font> if this project does help

πŸ“’ Latest Updates

In this repository, we will collect and document researchers and their outstanding work related to remote sensing multimodal large language model (vision-language).


<font size=5><center><b> Table of Contents </b> </center></font>


Awesome Papers

Multimodal Large Language Models for Remote Sensing

TitleVenueDateCodeNote
RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks <br>P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yao, Y. Feng, Z. Zhu, H. Chang, W. Diao, Q. Ye, and X. Sun <br>T-GRS2024-12-04--
Star <br> GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding <br>Y. Zhou, M. Lan, X. Li, Y. Ke, X. Jiang, L. Feng, and W. Zhang <br>arXiv2024-11-16Github-
Large Vision-Language Models for Remote Sensing Visual Question Answering <br>S. Siripong, A. Chaiyapan, and T. Phonchai <br>arXiv2024-11-16--
Star <br> LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation <br>Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He, and X. Zhu <br>arXiv2024-11-14Github-
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering <br>Lin, H., Hong, D., Ge, S., Luo, C., Jiang, K., Jin, H., and Wen, C<br>arXiv2024-11-03--
Star <br> GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing <br>Elgendy, H., Sharshar, A., Aboeitta, A., Ashraf, Y., and Guizani, M. <br>arXiv2024-10-25Github-
Star <br> TEOChat: Large Language and Vision Assistant for Temporal Earth Observation Data <br>J. Irvin, Jeremy Andrew, et al. <br>arXiv2024-10-08Github-
Star <br> EarthMarker: A Visual Prompting MLLM for Region-level and Point-level Remote Sensing Imagery Comprehension <br>Zhang, W., Cai, M., Zhang, T., Zhuang, Y., and Mao, X. <br>arXiv2024-07-18Github-
Star <br> SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding <br>J. Luo et al. <br>arXiv2024-06-14Github-
Star <br> RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery <br>Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani. <br>Remote Sensing2024-04-23Github-
Star <br> H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model <br> C. Pang, W. Jiang, L. Jiayu, L. Yi, S. Jiaxing, L. Weijia, W. Xingxing, W. Shuai, F. Litong, X. Guisong, H.Conghui. <br>arXiv2024-03-29Github-
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery <br>W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao.<br>arXiv2024-03-06--
Large Language Models for Captioning and Retrieving Remote Sensing Images <br>J. D. Silva, J. Magalhaes, and D. Tuia.<br>arXiv2024-02-09--
Star <br> LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model <br>D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao. <br>arXiv2024-02-04Github-
Star <br> EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain <br>W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao. <br>arXiv2024-01-30Githubaccepted by IEEE-TGRS
Star <br> SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model <br>Y. Zhan, Z. Xiong, and Y. Yuan. <br>arXiv2024-01-18GithubDataset
Star <br> GeoChat: Grounded Large Vision-Language Model for Remote Sensing <br>K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan. <br>arXiv2023-11-24Githubaccepted by CVPR-24
Star <br> RSGPT: A Remote Sensing Vision Language Model and Benchmark <br>Y. Hu, J. Yuan, and C. Wen. <br>arXiv2023-07-28Github-

Intelligent Agents for Remote Sensing

TitleVenueDateCodeNote
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents <br>W. Xu, Z. Yu, Y. Wang, J. Wang, and M. Peng.<br>arXiv2024-06-11--
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots <br>S. Singh, M. Fore, D. Stamoulis, and D. Group.<br>arXiv2024-04-23--
Evaluating Tool-Augmented Agents in Remote Sensing Platforms <br>S. Singh, M. Fore, and D. Stamoulis.<br>arXiv2024-04-23--
Star <br> Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis <br>C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi.<br>arXiv2024-04-01Github-
Star <br> Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models <br>H. Guo, X. Su, C. Wu, B. Du, L. Zhang, and D. Li.<br>arXiv2024-01-17Github-
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis <br>S. Du, S. Tang, W. Wang, X. Li, and R. Guo.<br>arXiv2023-10-07--

Vision-Language Pre-training Models for Remote Sensing

TitleVenueDateCodeNote
Star <br> RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing <br>Z. Zhang, T. Zhao, Y. Guo, and J. Yin.<br>arXiv2024-01-02Githubaccepted by IEEE-TGRS
Star <br> RemoteCLIP: A Vision Language Foundation Model for Remote Sensing <br>F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou.<br>T-GRS2024-04-18GithubarXiv
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment <br>U. Mall, C. P. Phoo, M. K. Liu, C. Vondrick, B. Hariharan, and K. Bala.<br>ICLR2024-01-16ProjectarXiv
Star <br> RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision <br>X. Li, C. Wen, Y. Hu, and N. Zhou.<br>JAG2023-09-18Github-
Star <br> Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval <br>Y. Yuan, Y. Zhan, and Z. Xiong.<br>T-GRS2023-08-28GithubarXiv

Survey Papers for Remote Sensing Vision-Language Tasks

TitleVenueDateCodeNote
Star <br>Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey <br>C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi <br>arXiv2024-12-03GithubarXiv
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing <br>X. Sun, B. Peng, C. Zhang, F. Jin, Q. Niu, J. Liu, K. Chen, M. Li, P. Feng, Z. Bi, M. Liu, and Y. Zhang.<br>arXiv2024-11-05--
Star <br>Foundation Models for Remote Sensing and Earth Observation: A Survey <br>A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya.<br>arXiv2024-10-22GithubarXiv
Star <br>Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques <br>L. Tao, H. Zhang, H. Jing, Y. Liu, K. Yao, C. Li, and X. Xue.<br>arXiv2024-10-15GithubarXiv
Star <br>Towards Vision-Language Geo-Foundation Model: A Survey <br>Y. Zhou, L. Feng, Y. Ke, X. Jiang, J. Yan, and W. Zhang.<br>arXiv2024-06-13GithubarXiv
Vision-Language Models in Remote Sensing: Current progress and future trends <br>X. Li, C. Wen, Y. Hu, Z. Yuan, and X. X. Zhu.<br>MGRS2024-04-22--
Language Integration in Remote Sensing: Tasks, datasets, and future directions <br>L. Bashmal, Y. Bazi, F. Melgani, M. M. Al Rahhal, and M. A. Al Zuair.<br>MGRS2023-10-11--
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey <br>L. Jiao et al.<br>JSTARS2023-09-18--

Others

TitleVenueDateCodeNote
On the Foundations of Earth and Climate Foundation Models <br>X. X. Zhu et al.<br>arXiv2024-05-07Github-
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications <br>C. Tan et al.<br>arXiv2023-12-23--
Star <br> Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs <br>J. Roberts, T. LΓΌddecke, R. Sheikh, K. Han, and S. Albanie. <br>arXiv2023-11-24Github-
The Potential of Visual ChatGPT for Remote Sensing <br>L. P. Osco, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, and J. Marcato Junior.<br>Remote Sensing2023-06-22--

Awesome Datasets

Datasets of Pre-Training for Alignment

TitleVenueDateCodeNote
Star <be> RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models <br>J. Ge, Y. Zheng, K. Guo, and J. Liang.<br>arXiv2024-08-27GithubLink
Star <be> ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing <br>Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu.<br>arXiv2024-02-17GithubLink
Star <br> RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing <br>Z. Zhang, T. Zhao, Y. Guo, and J. Yin.<br>arXiv2024-01-02Github-
Star <br> SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing <br>Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal.<br>AAAI2024-03-24GithubarXiv
<p align="center"> <img src="./images/itpair.jpg" width="80%" height="100%"> </p>

Datasets of Multimodal Instruction Tuning

NamePaperLinkNote
DDFAVDDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation BenchmarkLink27.7k
VRSBenchVRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image UnderstandingLink29.6k
FIT-RSSkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language UnderstandingLink1800.8k
RS-GPT4VRS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image UnderstandingLink991k
RS-instructionsRS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing ImageryLink7,058
SkyEye-968kSkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language ModelLink968k
Multi-task InstructionLHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language ModelLink42,322
MMRS-1MEarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing DomainLink>1M
RS-ClsQaGrd-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink78k
MMShipPopeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing ImageryLink81k
RS-Specialized-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink29.8k
RS multimodal instructionGeoChat: Grounded Large Vision-Language Model for Remote SensingLink318k
LHRS-InstructLHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language ModelLink39.8k
HqDC-InstructH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelLink30k
<p align="center"> <img src="./images/instruct.jpg" width="80%" height="100%"> </p>

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

<p align="center"> <img src="./images/caption.jpg" width="80%" height="100%"> </p>

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

<p align="center"> <img src="./images/vqavg.jpg" width="80%" height="100%"> </p>

Remote Sensing Image-Text Retrieval

<p align="center"> <img src="./images/itretrieval.jpg" width="80%" height="100%"> </p>

Remote Sensing Scene Classification

<p align="center"> <img src="./images/rssc.jpg" width="80%" height="100%"> </p>

πŸ€– Contact

If you have any questions about this project, please feel free to contact zhanyangnwpu@gmail.com.