Awesome

📒Awesome VLMs in RS

This is the repository of Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques, a systematic survey of recent VLM studies in Remote Sensing including Datasets, Capabilities, and Enhancement Techniques. For details, please refer to:

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [paper]

©️Abstract

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.

©️Citation

If you find our work useful in your research, please consider citing:

@misc{tao2024advancementsvisuallanguagemodels,
      title={Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques}, 
      author={Lijie Tao and Haokui Zhang and Haizhao Jing and Yu Liu and Kelu Yao and Chao Li and Xizhe Xue},
      year={2024},
      eprint={2410.17283},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.17283}, 
}

📖Contents

📖Recent VLMs for RS
- 📖Contrastive Methods
- 📖Conversational Methods🔥
- 📖Other Methods
📖Datasets in VLMs for RS
- 📖Manual Datasets
- 📖Combining Datasets
- 📖Automatically Annoteted Datasets🔥
📖Capabilities in VLMs for RS

📖Recent in VLMs for RS (©️back👆🏻)

📖Contrastive Methods (©️back👆🏻)

Published in	Title	Paper	Code/Project
TGRS 2024	[RemoteCLIP] Remoteclip: A vision language foundation model for remote sensing	link	RemoteCLIP
RS 2024	[CRSR] Cross-modal retrieval and semantic refinement for remote sensing image captioning	link
arXiv 2024	[ProGEO] Progeo: Generating prompts through image-text contrastive learning for visual geo-localization	pdf	ProGEO
ICLR 2024	[GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignment	pdf
TGRS 2024	[GeoRSCLIP] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing	pdf	GeoRSCLIP
ISPRS 2024	[ChangeCLIP] Changeclip: Remote sensing change detection with multimodal vision-language representation learning	link	ChangeCLIP
CVPR 2023	[APPLeNet]Applenet: Visual attention parameterized prompt learning for few-shot remote sensing image generalization using clip	pdf	APPLeNet
TGRS 2023	[MGVLF] Rsvg: Exploring data and models for visual grounding on remote sensing data	pdf	MGVLF

📖Conversational Methods (©️back👆🏻)

Published in	Title	Paper	Code/Project
RS 2024	[RS-LLaVA] Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery	link	RS-LLaVA
arXiv 2023	[H2RSVLM] H2rsvlm: Towards helpful and honest remote sensing large vision language model	pdf	H2RSVLM
arXiv 2024	[SkySenseGPT] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding	pdf	SkySenseGPT
CVPR 2024	[GeoChat] Geochat: Grounded large vision-language model for remote sensing	pdf	GeoChat
arXiv 2023	[RSGPT] RSGPT: A Remote Sensing Vision Language Model and Benchmark	pdf	RSGPT
arXiv 2024	[Skyeyegpt] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model	pdf	Skyeyegpt
arXiv 2024	[RS-CapRet] Large language models for captioning and retrieving remote sensing images	pdf
arXiv 2024	[LHRS-Bot] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model	pdf	LHRS-Bot
TGRS 2024	[EarthGPT] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain	pdf	EarthGPT
TGRS 2023	A decoupling paradigm with prompt learning for remote sensing image change captioning	link	code

📖Other Methods (©️back👆🏻)

Published in	Title	Paper	Code/Project
TIP 2023	[Txt2Img] Txt2img-mhn: Remote sensing image generation from text using modern hopfield networks	pdf	Txt2Img
WACV 2024	[CPSeg] Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting	pdf
TGRS 2023	[SHRNet] A spatial hierarchical reasoning network for remote sensing visual question answering	link
SIGIR 2023	[MGeo] Mgeo: Multi-modal geographic language model pre-training	pdf	Mgeo
NeurIPS 2023	[GeoCLIP] Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization	pdf	GeoCLIP
TPAMI 2024	[SpectralGPT] Spectralgpt: Spectral remote sensing foundation model	pdf	SpectralGPT
TGRS 2023	[TEMO] Few-shot object detection in aerial imagery guided by text-modal knowledge	link

📖Datasets in VLMs for RS (©️back👆🏻)

📖Manual Datasets (©️back👆🏻)

Published in	Title	Image	Paper	Code/Project
CVPR 2024	[Hallusionbench] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models	346	pdf	Hallusionbench
arXiv 2023	[RSICap] RSGPT: A Remote Sensing Vision Language Model and Benchmark	2585	pdf	RSICap
TGRS 2023	[CRSVQA] Multistep Question-Driven Visual Question Answering for Remote Sensing	4639	pdf	CRSVQA

Published in	Title	Image	Paper	Code/Project
ICCV 2023	[SATIN] Satin: A multi-task metadataset for classifying satellite imagery using vision-language models	≈775K	pdf	SATIN
ICCV 2023	[GeoPile] Towards geospatial foundation models via continual pretraining	600K	pdf	GeoPile
ICCV 2023	[SatlasPretrain] Satlaspretrain: A large-scale dataset for remote sensing image understanding	856K	pdf	SatlasPretrain
TGRS 2023	[RSVGD] Rsvg: Exploring data and models for visual grounding on remote sensing data	17402	pdf	RSVGD
TGRS 2024	[RefsegRS] Rrsis: Referring remote sensing image segmentation	4420	pdf	RefsegRS
arXiv 2024	[SkyEye-968K] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model	968K	pdf	SkyEye-968K
TGRS 2024	[MMRS-1M] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain	1M	pdf	MMRS-1M
arXiv 2023	[RSSA] H2rsvlm: Towards helpful and honest remote sensing large vision language model	44K	pdf	RSSA
TGRS 2024	[FineGrip] Panoptic perception: A novel task and fine-grained dataset for universal remote sensing image interpretation	2649	pdf
CVPR 2024	[RRSIS-D] Rotated multiscale interaction network for referring remote sensing image segmentation	17402	pdf	RRSIS-D
TGRS 2022	[RingMo] Ringmo: A remote sensing foundation model with masked image modeling	2096640	link
arXiv 2023	[GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignment	-	pdf
CVPR 2024	[SkySense] Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery	21.5M	pdf
AAAI 2024	[EarthVQA] Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering	6000	pdf	EarthVQA
TGRS 2024	[GeoSense] Generative convnet foundation model with sparse modeling and low-frequency reconstruction for remote sensing image interpretation	≈9M	link	GeoSense

Published in	Title	Image	Paper	Code/Project
TGRS 2024	[RS5M] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing	5M	pdf	RS5M
AAAI 2024	[SkyScript] Skyscript: A large and semantically diverse vision-language dataset for remote sensing	2.6M	pdf
arXiv 2024	[LHRS-Align] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model	1.15M	pdf	LHRS-Align
CVPR 2024	[GeoChat] Geochat: Grounded large vision-language model for remote sensing	318K	pdf	GeoChat
ICML 2024	[GeoReasoner] Georeasoner: Geo-localization with reasoning in street views using a large vision-language model	70K+	pdf	GeoReasoner
arXiv 2023	[HqDC-1.4M] H2rsvlm: Towards helpful and honest remote sensing large vision language model	≈1.4M	pdf	HqDC-1.4M
CVPR 2024	[ChatEarthNet] ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models	163488	pdf	ChatEarthNet
arXiv 2024	[VRSBench] Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding	29614	pdf	VRSBench
arXiv 2024	[FIT-RS] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding	1800.8K	pdf	FIT-RS

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to star & submit a PR to this repo!