Home

Awesome

📒Awesome VLMs in RS

<div align='center'> <img src="images\Awesome VLMs in RS.png"> </div>

This is the repository of Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques, a systematic survey of recent VLM studies in Remote Sensing including Datasets, Capabilities, and Enhancement Techniques. For details, please refer to:

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [paper]

<div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/badge/Release-v1.0-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > <img src=https://img.shields.io/badge/arxiv-2410.17283-ccf.svg > </div>

©️Abstract

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.

©️Citation

If you find our work useful in your research, please consider citing:

@misc{tao2024advancementsvisuallanguagemodels,
      title={Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques}, 
      author={Lijie Tao and Haokui Zhang and Haizhao Jing and Yu Liu and Kelu Yao and Chao Li and Xizhe Xue},
      year={2024},
      eprint={2410.17283},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.17283}, 
}

📖Contents

📖Recent in VLMs for RS (©️back👆🏻)

<div id="Recent-Advances-in-Visual-Language-Models-for-Remote-Sensing"></div>

📖Contrastive Methods (©️back👆🏻)

<div id="Contrastive-Methods"></div>
Published inTitlePaperCode/Project
TGRS 2024[RemoteCLIP] Remoteclip: A vision language foundation model for remote sensinglinkRemoteCLIP
RS 2024[CRSR] Cross-modal retrieval and semantic refinement for remote sensing image captioninglink
arXiv 2024[ProGEO] Progeo: Generating prompts through image-text contrastive learning for visual geo-localizationpdfProGEO
ICLR 2024[GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignmentpdf
TGRS 2024[GeoRSCLIP] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensingpdfGeoRSCLIP
ISPRS 2024[ChangeCLIP] Changeclip: Remote sensing change detection with multimodal vision-language representation learninglinkChangeCLIP
CVPR 2023[APPLeNet]Applenet: Visual attention parameterized prompt learning for few-shot remote sensing image generalization using clippdfAPPLeNet
TGRS 2023[MGVLF] Rsvg: Exploring data and models for visual grounding on remote sensing datapdfMGVLF

📖Conversational Methods (©️back👆🏻)

<div id="Conversational-Methods"></div>
Published inTitlePaperCode/Project
RS 2024[RS-LLaVA] Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagerylinkRS-LLaVA
arXiv 2023[H2RSVLM] H2rsvlm: Towards helpful and honest remote sensing large vision language modelpdfH2RSVLM
arXiv 2024[SkySenseGPT] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understandingpdfSkySenseGPT
CVPR 2024[GeoChat] Geochat: Grounded large vision-language model for remote sensingpdfGeoChat
arXiv 2023[RSGPT] RSGPT: A Remote Sensing Vision Language Model and BenchmarkpdfRSGPT
arXiv 2024[Skyeyegpt] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language modelpdfSkyeyegpt
arXiv 2024[RS-CapRet] Large language models for captioning and retrieving remote sensing imagespdf
arXiv 2024[LHRS-Bot] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language modelpdfLHRS-Bot
TGRS 2024[EarthGPT] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domainpdfEarthGPT
TGRS 2023A decoupling paradigm with prompt learning for remote sensing image change captioninglinkcode

📖Other Methods (©️back👆🏻)

<div id="Other-Methods"></div>
Published inTitlePaperCode/Project
TIP 2023[Txt2Img] Txt2img-mhn: Remote sensing image generation from text using modern hopfield networkspdfTxt2Img
WACV 2024[CPSeg] Cpseg: Finer-grained image semantic segmentation via chain-of-thought language promptingpdf
TGRS 2023[SHRNet] A spatial hierarchical reasoning network for remote sensing visual question answeringlink
SIGIR 2023[MGeo] Mgeo: Multi-modal geographic language model pre-trainingpdfMgeo
NeurIPS 2023[GeoCLIP] Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localizationpdfGeoCLIP
TPAMI 2024[SpectralGPT] Spectralgpt: Spectral remote sensing foundation modelpdfSpectralGPT
TGRS 2023[TEMO] Few-shot object detection in aerial imagery guided by text-modal knowledgelink

📖Datasets in VLMs for RS (©️back👆🏻)

<div id="Datasets"></div>

📖Manual Datasets (©️back👆🏻)

<div id="Manual-Datasets"></div>
Published inTitleImagePaperCode/Project
CVPR 2024[Hallusionbench] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models346pdfHallusionbench
arXiv 2023[RSICap] RSGPT: A Remote Sensing Vision Language Model and Benchmark2585pdfRSICap
TGRS 2023[CRSVQA] Multistep Question-Driven Visual Question Answering for Remote Sensing4639pdfCRSVQA

📖Combining Datasets (©️back👆🏻)

<div id="Combining-Datasets"></div>
Published inTitleImagePaperCode/Project
ICCV 2023[SATIN] Satin: A multi-task metadataset for classifying satellite imagery using vision-language models≈775KpdfSATIN
ICCV 2023[GeoPile] Towards geospatial foundation models via continual pretraining600KpdfGeoPile
ICCV 2023[SatlasPretrain] Satlaspretrain: A large-scale dataset for remote sensing image understanding856KpdfSatlasPretrain
TGRS 2023[RSVGD] Rsvg: Exploring data and models for visual grounding on remote sensing data17402pdfRSVGD
TGRS 2024[RefsegRS] Rrsis: Referring remote sensing image segmentation4420pdfRefsegRS
arXiv 2024[SkyEye-968K] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model968KpdfSkyEye-968K
TGRS 2024[MMRS-1M] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain1MpdfMMRS-1M
arXiv 2023[RSSA] H2rsvlm: Towards helpful and honest remote sensing large vision language model44KpdfRSSA
TGRS 2024[FineGrip] Panoptic perception: A novel task and fine-grained dataset for universal remote sensing image interpretation2649pdf
CVPR 2024[RRSIS-D] Rotated multiscale interaction network for referring remote sensing image segmentation17402pdfRRSIS-D
TGRS 2022[RingMo] Ringmo: A remote sensing foundation model with masked image modeling2096640link
arXiv 2023[GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignment-pdf
CVPR 2024[SkySense] Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery21.5Mpdf
AAAI 2024[EarthVQA] Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering6000pdfEarthVQA
TGRS 2024[GeoSense] Generative convnet foundation model with sparse modeling and low-frequency reconstruction for remote sensing image interpretation≈9MlinkGeoSense

📖Automatically Annoteted Datasets (©️back👆🏻)

<div id="Automatically-Annoteted-Datasets"></div>
Published inTitleImagePaperCode/Project
TGRS 2024[RS5M] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing5MpdfRS5M
AAAI 2024[SkyScript] Skyscript: A large and semantically diverse vision-language dataset for remote sensing2.6Mpdf
arXiv 2024[LHRS-Align] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model1.15MpdfLHRS-Align
CVPR 2024[GeoChat] Geochat: Grounded large vision-language model for remote sensing318KpdfGeoChat
ICML 2024[GeoReasoner] Georeasoner: Geo-localization with reasoning in street views using a large vision-language model70K+pdfGeoReasoner
arXiv 2023[HqDC-1.4M] H2rsvlm: Towards helpful and honest remote sensing large vision language model≈1.4MpdfHqDC-1.4M
CVPR 2024[ChatEarthNet] ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models163488pdfChatEarthNet
arXiv 2024[VRSBench] Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding29614pdfVRSBench
arXiv 2024[FIT-RS] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding1800.8KpdfFIT-RS

📖Capabilities in VLMs for RS (©️back👆🏻)

<div id="Capabilities"></div> <div align='center'> <img src="images\capabilities.png"> </div>

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!