Awesome
📒Awesome VLMs in RS
<div align='center'> <img src="images\Awesome VLMs in RS.png"> </div>This is the repository of Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques, a systematic survey of recent VLM studies in Remote Sensing including Datasets, Capabilities, and Enhancement Techniques. For details, please refer to:
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques [paper]
<div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/badge/Release-v1.0-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > <img src=https://img.shields.io/badge/arxiv-2410.17283-ccf.svg > </div>©️Abstract
Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.
©️Citation
If you find our work useful in your research, please consider citing:
@misc{tao2024advancementsvisuallanguagemodels,
title={Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques},
author={Lijie Tao and Haokui Zhang and Haizhao Jing and Yu Liu and Kelu Yao and Chao Li and Xizhe Xue},
year={2024},
eprint={2410.17283},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.17283},
}
📖Contents
📖Recent in VLMs for RS (©️back👆🏻)
<div id="Recent-Advances-in-Visual-Language-Models-for-Remote-Sensing"></div>📖Contrastive Methods (©️back👆🏻)
<div id="Contrastive-Methods"></div>Published in | Title | Paper | Code/Project |
---|---|---|---|
TGRS 2024 | [RemoteCLIP] Remoteclip: A vision language foundation model for remote sensing | link | RemoteCLIP |
RS 2024 | [CRSR] Cross-modal retrieval and semantic refinement for remote sensing image captioning | link | |
arXiv 2024 | [ProGEO] Progeo: Generating prompts through image-text contrastive learning for visual geo-localization | ProGEO | |
ICLR 2024 | [GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignment | ||
TGRS 2024 | [GeoRSCLIP] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing | GeoRSCLIP | |
ISPRS 2024 | [ChangeCLIP] Changeclip: Remote sensing change detection with multimodal vision-language representation learning | link | ChangeCLIP |
CVPR 2023 | [APPLeNet]Applenet: Visual attention parameterized prompt learning for few-shot remote sensing image generalization using clip | APPLeNet | |
TGRS 2023 | [MGVLF] Rsvg: Exploring data and models for visual grounding on remote sensing data | MGVLF |
📖Conversational Methods (©️back👆🏻)
<div id="Conversational-Methods"></div>Published in | Title | Paper | Code/Project |
---|---|---|---|
RS 2024 | [RS-LLaVA] Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery | link | RS-LLaVA |
arXiv 2023 | [H2RSVLM] H2rsvlm: Towards helpful and honest remote sensing large vision language model | H2RSVLM | |
arXiv 2024 | [SkySenseGPT] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding | SkySenseGPT | |
CVPR 2024 | [GeoChat] Geochat: Grounded large vision-language model for remote sensing | GeoChat | |
arXiv 2023 | [RSGPT] RSGPT: A Remote Sensing Vision Language Model and Benchmark | RSGPT | |
arXiv 2024 | [Skyeyegpt] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model | Skyeyegpt | |
arXiv 2024 | [RS-CapRet] Large language models for captioning and retrieving remote sensing images | ||
arXiv 2024 | [LHRS-Bot] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model | LHRS-Bot | |
TGRS 2024 | [EarthGPT] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain | EarthGPT | |
TGRS 2023 | A decoupling paradigm with prompt learning for remote sensing image change captioning | link | code |
📖Other Methods (©️back👆🏻)
<div id="Other-Methods"></div>Published in | Title | Paper | Code/Project |
---|---|---|---|
TIP 2023 | [Txt2Img] Txt2img-mhn: Remote sensing image generation from text using modern hopfield networks | Txt2Img | |
WACV 2024 | [CPSeg] Cpseg: Finer-grained image semantic segmentation via chain-of-thought language prompting | ||
TGRS 2023 | [SHRNet] A spatial hierarchical reasoning network for remote sensing visual question answering | link | |
SIGIR 2023 | [MGeo] Mgeo: Multi-modal geographic language model pre-training | Mgeo | |
NeurIPS 2023 | [GeoCLIP] Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization | GeoCLIP | |
TPAMI 2024 | [SpectralGPT] Spectralgpt: Spectral remote sensing foundation model | SpectralGPT | |
TGRS 2023 | [TEMO] Few-shot object detection in aerial imagery guided by text-modal knowledge | link |
📖Datasets in VLMs for RS (©️back👆🏻)
<div id="Datasets"></div>📖Manual Datasets (©️back👆🏻)
<div id="Manual-Datasets"></div>Published in | Title | Image | Paper | Code/Project |
---|---|---|---|---|
CVPR 2024 | [Hallusionbench] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 346 | Hallusionbench | |
arXiv 2023 | [RSICap] RSGPT: A Remote Sensing Vision Language Model and Benchmark | 2585 | RSICap | |
TGRS 2023 | [CRSVQA] Multistep Question-Driven Visual Question Answering for Remote Sensing | 4639 | CRSVQA |
📖Combining Datasets (©️back👆🏻)
<div id="Combining-Datasets"></div>Published in | Title | Image | Paper | Code/Project |
---|---|---|---|---|
ICCV 2023 | [SATIN] Satin: A multi-task metadataset for classifying satellite imagery using vision-language models | ≈775K | SATIN | |
ICCV 2023 | [GeoPile] Towards geospatial foundation models via continual pretraining | 600K | GeoPile | |
ICCV 2023 | [SatlasPretrain] Satlaspretrain: A large-scale dataset for remote sensing image understanding | 856K | SatlasPretrain | |
TGRS 2023 | [RSVGD] Rsvg: Exploring data and models for visual grounding on remote sensing data | 17402 | RSVGD | |
TGRS 2024 | [RefsegRS] Rrsis: Referring remote sensing image segmentation | 4420 | RefsegRS | |
arXiv 2024 | [SkyEye-968K] Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model | 968K | SkyEye-968K | |
TGRS 2024 | [MMRS-1M] Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain | 1M | MMRS-1M | |
arXiv 2023 | [RSSA] H2rsvlm: Towards helpful and honest remote sensing large vision language model | 44K | RSSA | |
TGRS 2024 | [FineGrip] Panoptic perception: A novel task and fine-grained dataset for universal remote sensing image interpretation | 2649 | ||
CVPR 2024 | [RRSIS-D] Rotated multiscale interaction network for referring remote sensing image segmentation | 17402 | RRSIS-D | |
TGRS 2022 | [RingMo] Ringmo: A remote sensing foundation model with masked image modeling | 2096640 | link | |
arXiv 2023 | [GRAFT] Remote sensing vision-language foundation models without annotations via ground remote alignment | - | ||
CVPR 2024 | [SkySense] Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery | 21.5M | ||
AAAI 2024 | [EarthVQA] Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering | 6000 | EarthVQA | |
TGRS 2024 | [GeoSense] Generative convnet foundation model with sparse modeling and low-frequency reconstruction for remote sensing image interpretation | ≈9M | link | GeoSense |
📖Automatically Annoteted Datasets (©️back👆🏻)
<div id="Automatically-Annoteted-Datasets"></div>Published in | Title | Image | Paper | Code/Project |
---|---|---|---|---|
TGRS 2024 | [RS5M] Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing | 5M | RS5M | |
AAAI 2024 | [SkyScript] Skyscript: A large and semantically diverse vision-language dataset for remote sensing | 2.6M | ||
arXiv 2024 | [LHRS-Align] Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model | 1.15M | LHRS-Align | |
CVPR 2024 | [GeoChat] Geochat: Grounded large vision-language model for remote sensing | 318K | GeoChat | |
ICML 2024 | [GeoReasoner] Georeasoner: Geo-localization with reasoning in street views using a large vision-language model | 70K+ | GeoReasoner | |
arXiv 2023 | [HqDC-1.4M] H2rsvlm: Towards helpful and honest remote sensing large vision language model | ≈1.4M | HqDC-1.4M | |
CVPR 2024 | [ChatEarthNet] ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models | 163488 | ChatEarthNet | |
arXiv 2024 | [VRSBench] Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding | 29614 | VRSBench | |
arXiv 2024 | [FIT-RS] Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding | 1800.8K | FIT-RS |
📖Capabilities in VLMs for RS (©️back👆🏻)
<div id="Capabilities"></div> <div align='center'> <img src="images\capabilities.png"> </div>©️License
GNU General Public License v3.0
🎉Contribute
Welcome to star & submit a PR to this repo!