Home

Awesome

Awesome remote sensing vision language models

This is a repository for visual language models in remote sensing, including advanced methods and commonly used datasets in different applications, such as image-text retrieval, visual question answering, pretraining, etc.

If you find any relevant papers that are not included here, please feel free to pull requests at any time.

PRs Welcome

Table of Contents

Surveys

PaperPublished inCode/Project
Vision-Language Models in Remote Sensing: Current Progress and Future Trendsarxiv 2023-
The Potential of Visual ChatGPT For Remote Sensingarxiv 2023-
Brain-inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive SurveyJSTARG 2023-

Remote Sensing Vision Language Model

PaperPublished inCode/Project
RSGPT: A Remote Sensing Vision Language Model and Benchmarkarxiv 2023code
RemoteGLM2023code
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysisarxiv 2023-
Towards Automatic Satellite Images Captions Generation Using Large Language Modelsarxiv 2023-
GeoChat: Grounded Large Vision-Language Model for Remote Sensingarxiv 2023code
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote SensingAAAI 2024code

Applications

Pretraining

PaperPublished inCode/Project
S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captionsarxiv 2023code
RemoteCLIP: A Vision Language Foundation Model for Remote Sensingarxiv 2023code
RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Modelarxiv 2023Project

Image Captioning

PaperPublished inCode/Project
Deep Semantic Understanding of High Resolution Remote Sensing ImageCITS 2016-
Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?TGRS 2017-
Exploring models and data for remote sensing image caption generationTGRS 2017code
Natural language escription of remote sensing images based on deep learningIGARSS 2017-
Description Generation for Remote Sensing Images Using Attribute Attention MechanismRemote Sensing 2019-
Vaa:Visual aligning attention model for remote sensing image captioningIEEE Access 2019-
Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image CaptioningIEEE Access 2019-
A multi-level attention model for remote sensing image captionsRemote Sensing 2020-
Remote sensing image captioning via variational autoencoder and reinforcement learningKnowledge-Based Systems 2020-
Truncation cross entropy loss for remote sensing image captioninTGRS 2020-
Word–Sentence Framework for Remote Sensing Image CaptioningTGRS 2020code
A novel SVM-based decoder for remote sensing image captioningTGRS 2021-
High-resolution remote sensing image captioning based on structured attentionTGRS 2021code
Exploring transformer and multilabel classification for remote sensing image captioningGRSL 2022-
NWPU-captions dataset and mlca-net for remote sensing image captioningTGRS 2022-
Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale DatasetTGRS 2022code
Transforming remote sensing images to textual descriptionsINT J APPL EARTH OBS 2022-
Remote-sensing image captioning based on multilayer aggregated transformerGRSL 2022-
Vlca: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioningJ SYST ENG ELECTRON 2023-
Multi-source interactive stair attention for remote sensing image captioningRemote Sensing 2023-
Changes to Captions: An Attentive Network for Remote Sensing Change Captioningarxiv 2023code
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioningarxiv 2023code

Text-based Image Generation

PaperPublished inCode/Project
Retro-Remote Sensing: Generating Images From Ancient TextsJ-STARS 2019-
Remote sensing image augmentation based on text description for waterside change detectionRemote Sensing 2021-
Text-to-remote-sensing-image generation with structured generative adversarial networksGRSL 2021-
Txt2img-MHN:Remote sensing image generation from text using modern hopfield networkarxiv 2022code

Image-text Retrieval

PaperPublished inCode/Project
Textrs: Deep bidirectional triplet network for matching text to remote sensing images.Remote Sensing 2020-
Deep unsupervised embedding for remote sensing image retrieval using textual cuesApplied Sciences 2020-
A deep semantic alignment network for the cross-modal image-text retrieval in remote sensingJ-STARS 2021-
A lightweight multi-scale crossmodal text-image retrieval method in remote sensingTGRS 2021code
Remote sensing cross-modal text-image retrieval based on global and local informationTGRS 2022code
Multilanguage transformer for improved text to remote sensing image retrievalJ-STARS 2022-
Exploring a fine-grained multiscale method for cross-modal remote sensing image retrievaTGRS 2022code
Contrasting dual transformer architectures for multi-modal remote sensing image retrievalApplied Sciences 2023-
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrievalarxiv 2023-
Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrievalarxiv 2023-

Visual Question Answering

PaperPublished inCode/Project
RSVQA: Visual question answering for remote sensing dataTGRS 2020code
Mutual Attention Inception Network for Remote Sensing Visual Question AnsweringTGRS 2021code
How to find a good image-text embedding for remote sensing visual question answering?ECML-PKDD 2021-
Cross-Modal Visual Question Answering for Remote Sensing Data: The International Conference on Digital Image Computing: Techniques and ApplicationsDICTA 2021-
RSVQA meets bigearthnet: a new,large-scale, visual question answering dataset for remote sensingIGARSS 2021code
Self-Paced Curriculum Learning for Visual Question Answering on Remote Sensing DataIGARSS 2021-
From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing dataTGRS 2022code
Language transformers for remote sensing visual question answeringIGARSS 2022-
Open-ended remote sensing visual question answering with transformersIJRS 2022-
Bi-modal transformer-based approach for visual question answering in remote sensing imageryTGRS 2022-
Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answeringCVPRW 2022-
Change detection meets visual question answeringTGRS 2022code
A spatial hierarchical reasoning network for remote sensing visual question answeringTGRS 2023-
Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing ImagesJURSE 2023-
LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote SensingIGARSS 2023code
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMsarXiv 2023code

Visual Grounding

PaperPublished inCode/Project
Visual Grounding in Remote Sensing ImagesACMMM 2022data
RSVG: Exploring data and models for visual grounding on remote sensing dataTGRS 2023code

Scene Classification

PaperPublished inCode/Project
Zero-shot scene classification for high spatial resolution remote sensing imagesTGRS 2017-
Fine-grained object recognition and zero-shot learning in remote sensing imageryTGRS 2017-
Structural alignment based zero-shot classification for remote sensing scenesICECE 2018-
A distance-constrained semantic autoencoder for zero-shot remote sensing scene classificationJ-STARS 2021-
Learning deep crossmodal embedding networks for zero-shot remote sensing image scene classificationTGRS 2021-
Generative adversarial networks for zero-shot remote sensing scene classificationApplied Sciences 2022-
APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIPCVPR 2023code

Object Detection

PaperPublished inCode/Project
Text semantic fusion relation graph reasoning for few-shot object detection on remote sensing imagesRemote Sensing 2023-
Few-shot object detection in aerial imagery guided by textmodal knowledgeTGRS 2023-

Semantic Segmentation

PaperPublished inCode/Project
Semi-supervised contrastive learning for few-shot segmentation of remote sensing imagesRemote Sensing 2022-
Few-shot segmentation of remote sensing images using deep metric learningGRSL 2022.
Language-aware domain generalization network for cross-scene hyperspectral image classificationTGRS 2023code
RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Modelarxiv 2023code
RRSIS: Referring Remote Sensing Image Segmentationarxiv 2023-
CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Promptingarxiv 2023-

Others

Dataset

Image Captioning Dataset

DatasetHome/GithubDownload link
RSICDGithub[BaiduYun] [Google Drive]
Sydney-CaptionsGithub[BaiduYun]
UCM-CaptionsGithub[BaiduYun]
NWPU-RESISC45Github[BaiduYun] [OneDrive]
DIOR-Captions--
RS-5MGithub[HuggingFace]
LEVIR-CCGithubGoogle Drive
SkyScriptgithub

Text-based Image Generation Dataset

Text-based Image Retrieval Dataset

DatasetHome/ProjectDownload link
RSITMDGithub[BaiduYun] [Google Drive]

Visual Question Answering Dataset

DatasetHome/ProjectDownload link
RSVQAHome[data]
RSVQA×BEN[Github] [Home]-
RSIVQAGithub-
CDVQAGithub-

Visual Grounding Dataset

DatasetHome/ProjectDownload link
DIOR-RSVGGithub[Google Drive]

Scene Classification Dataset

DatasetHome/ProjectDownload link
NWPU-RESISC45Home[OneDrive] [BaiduYun]
AIDHome[OneDrive] [BaiduYun]
UC Merced Land-Use(UCM)Home-
SATINHome[HuggingFace]

Object Detection Dataset

DatasetHome/ProjectDownload link
NWPU VHR-10Home[OneDrive] [BaiduYun]
DIORHome[Google Drive] [BaiduYun]
FAIR1M-[BaiduYun]

Semantic Segmentation Dataset

DatasetHome/ProjectDownload link
VaihingenHome[BaiduYun]
PotsdamHome[BaiduYun]
TorontoHome-
GIDHome[BaiduYun code:GID5] [OneDrive]