Home

Awesome

A curated list of Visual Language Models papers and resources for Earth Observation (VLM4EO) Awesome

This list is created and maintained by Ali Koteich and Hasan Moughnieh from the GEOspatial Artificial Intelligence (GEOAI) research group at the National Center for Remote Sensing - CNRS, Lebanon.

We encourage you to contribute to this project according to the following guidelines.

---If you find this repository useful, please consider giving it a ⭐

Table Of Contents

Foundation Models

YearTitlePaperCodeVenue
2024EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domainpaper
2024RemoteCLIP: A Vision Language Foundation Model for Remote Sensingpapercode
2024Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Modelspapercode
2024SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Modelpapercode
2024VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysispapercode
2023GeoChat: Grounded Large Vision-Language Model for Remote Sensingpapercode
2023Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignmentpaper

Image Captioning

YearTitlePaperCodeVenue
2024A Lightweight Transformer for Remote Sensing Image Change Captioningpapercode
2024RSCaMa: Remote Sensing Image Change Captioning with State Space Modelpapercode
2023Captioning Remote Sensing Images Using Transformer ArchitecturepaperInternational Conference on Artificial Intelligence in Information and Communication
2023Multi-Source Interactive Stair Attention for Remote Sensing Image CaptioningpaperMDPI Remote Sensing
2023Progressive Scale-aware Network for Remote sensing Image Change Captioningpaper
2023Towards Unsupervised Remote Sensing Image Captioning and Retrieval with Pre-Trained Language ModelspaperProceedings of the Japanese Association for Natural Language Processing
2022A Joint-Training Two-Stage Method for Remote Sensing Image CaptioningpaperIEEE TGRS
2022A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image CaptioningpaperMDPI Remote Sensing
2022Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image AnalysispaperIEEE TGRS
2022Exploring Transformer and Multilabel Classification for Remote Sensing Image CaptioningpapercodeIEEE GRSL
2022Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approachpapercodeEngineering Applications of Artificial Intelligence
2022Global Visual Feature and Linguistic State Guided Attention for Remote Sensing ImagepaperIEEE TGRS
2022High-Resolution Remote Sensing Image Captioning Based on Structured AttentionpaperIEEE TGRS
2022Meta captioning: A meta learning based remote sensing image captioning frameworkpapercodeElsevier PHOTO
2022Multiscale Multiinteraction Network for Remote Sensing Image CaptioningpaperIEEE JSTARS
2022NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image CaptioningpapercodeIEEE TGRS
2022Recurrent Attention and Semantic Gate for Remote Sensing Image CaptioningpaperIEEE TGRS
2022Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale DatasetpaperIEEE TGRS
2022Transforming remote sensing images to textual descriptionspaperInt J Appl Earth Obs Geoinf
2022Using Neural Encoder-Decoder Models with Continuous Outputs for Remote Sensing Image CaptioningpaperIEEE Access
2021A Novel SVM-Based Decoder for Remote Sensing Image CaptioningpaperIEEE TGRS
2021SD-RSIC: Summarization Driven Deep Remote Sensing Image CaptioningpapercodeIEEE TGRS
2021Truncation Cross Entropy Loss for Remote Sensing Image CaptioningpaperIEEE TGRS
2021Word-Sentence Framework for Remote Sensing Image CaptioningpaperIEEE TGRS
2020A multi-level attention model for remote sensing image captionspaperMDPI Remote Sensing
2020Remote sensing image captioning via Variational Autoencoder and Reinforcement LearningpaperElservier Knowledge-Based Systems
2020Toward Remote Sensing Image Retrieval Under a Deep Image Captioning PerspectivepaperIEEE JSTARS
2019LAM: Remote sensing image captioning with attention-based language modelpaperIEEE TGRS
2019Learning to Caption Remote Sensing Images by Geospatial Feature Driven Attention MechanismpaperIEEE JSTARS
2019Remote Sensing Image Captioning by Deep Reinforcement Learning with Geospatial FeaturespaperIEEE TGRS

Text-Image Retrieval

YearTitlePaperCodeVenue
2024Composed Image Retrieval for Remote Sensingpapercode
2024Multi-Spectral Remote Sensing Image Retrieval using Geospatial Foundation Modelspapercode
2024Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrievalpapercode
2023A Prior Instruction Representation Framework for Remote Sensing Image-text RetrievalpapercodeACM MM 2023 (Oral)
2023A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote SensingpaperMDPI Remote Sensing
2023An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image RetrievalpaperMDPI Mathematics
2023Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image RetrievalpaperMDPI Applied Sciences
2023Hypersphere-Based Remote Sensing Cross-Modal Text–Image Retrieval via Curriculum LearningpapercodeIEEE TGRS
2023Parameter-Efficient Transfer Learning for Remote Sensing Image-Text RetrievalpaperIEEE TGRS
2023Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal RetrievalpapercodeICMR'23
2022A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote SensingpapercodeIEEE TGRS
2022An Unsupervised Cross-Modal Hashing Method Robust to Noisy Training Image-Text Correspondences in Remote SensingpapercodeIEEE ICIP
2022CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case StudypaperVirginia Polytechnic Institute and State University
2022Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Imagespaper
2022MCRN: A Multi-source Cross-modal Retrieval Network for remote sensingpapercodeInt J Appl Earth Obs Geoinf
2022Multilanguage Transformer for Improved Text to Remote Sensing Image RetrievalpaperIEEE JSTARS
2022Multisource Data Reconstruction-Based Deep Unsupervised Hashing for Unisource Remote Sensing Image RetrievalPapercodeIEEE TGRS
2022Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local InformationpapercodeIEEE TGRS
2022Unsupervised Contrastive Hashing for Cross-Modal Retrieval in Remote SensingpapercodeIEEE ICASSP
2021Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image RetrievalpapercodeIEEE TGRS
2020Deep unsupervised embedding for remote sensing image retrieval using textual cuespaperMDPI Applied Sciences
2020TextRS: Deep bidirectional triplet network for matching text to remote sensing imagespaperMDPI Remote Sensing
2020Toward Remote Sensing Image Retrieval under a Deep Image Captioning PerspectivepaperIEEE JSTARS

Visual Grounding

YearTitlePaperCodeVenue
2024GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Groundingpapercode
2023LaLGA: Multi-Scale Language-Aware Visual Grounding on Remote Sensing Datapapercode
2023Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Modelspapercode
2022RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing DatapapercodeIEEE TGRS
2022Visual Grounding in Remote Sensing ImagespaperACM MM

Visual Question Answering

YearTitlePaperCodeVenue
2023A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question AnsweringpaperIEEE TGRS
2023EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question AnsweringpapercodeAAAI 2024
2023LIT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote SensingpapercodeIEEE IGARSS
2023Multistep Question-Driven Visual Question Answering for Remote SensingpapercodeIEEE TGRS
2023RSGPT: A Remote Sensing Vision Language Model and Benchmarkpapercode
2023RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answeringpapercode
2022Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing ImagerypaperIEEE TGRS
2022Change Detection Meets Visual Question AnsweringpapercodeIEEE TGRS
2022From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing DatapapercodeIEEE TGRS
2022Language Transformers for Remote Sensing Visual Question AnsweringpaperIEEE IGARSS
2022Multi-Modal Fusion Transformer for Visual Question Answering in Remote SensingpapercodeSPIE Image and Signal Processing for Remote Sensing
2022Mutual Attention Inception Network for Remote Sensing Visual Question AnsweringpapercodeIEEE TGRS
2022Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question AnsweringpaperCVPRW
2021How to find a good image-text embedding for remote sensing visual question answering?paperCEUR Workshop Proceedings
2021Mutual Attention Inception Network for Remote Sensing Visual Question AnsweringpapercodeIEEE TGRS
2021RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensingpapercodeIEEE IGARSS
2020RSVQA: Visual Question Answering for Remote Sensing DatapapercodeIEEE TGRS

Vision-Language Remote Sensing Datasets

NameLinkPaper LinkDescription
RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation ModelLinkPaper LinkSize: 5 million remote sensing images with English descriptions <br>Resolution : 256 x 256 <br> Platforms: 11 publicly available image-text paired dataset<br>
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote SensingLinkpaper LinkSize : 5.2 million remote sensing image-text pairs in total, covering more than 29K distinct semantic tags
Remote Sensing Visual Question Answering Low Resolution Dataset(RSVQA LR)LinkPaper LinkSize: 772 images & 77,232 questions and answers <br>Resolution : 256 x 256 <br> Platforms: Sentinel-2 and Open Street Map<br>Use: Remote Sensing Visual Question Answering <br>
Remote Sensing Visual Question Answering High Resolution Dataset(RSVQA HR)LinkPaper LinkSize: 10,659 images & 955,664 questions and answers <br>Resolution : 512 x 512 <br> Platforms: USGS and Open Street Map<br>Use: Remote Sensing Visual Question Answering <br>
Remote Sensing Visual Question Answering BigEarthNet Dataset (RSVQA x BEN)LinkPaper LinkSize: 140,758,150 image/question/answer triplets <br>Resolution : High-resolution (15cm) <br> Platforms: Sentinel-2, BigEarthNet and Open Street Map<br>Use: Remote Sensing Visual Question Answering <br>
Remote Sensing Image Visual Question Answering (RSIVQA)LinkPaper LinkSize: 37,264 images and 111,134 image-question-answer triplets <br>A small part of RSIVQA is annotated by human. Others are automatically generated using existing scene classification datasets and object detection datasets<br>Use: Remote Sensing Visual Question Answering <br>
FloodNet Visual Question Answering DatasetLinkPaper LinkSize: 11,000 question-image pairs <br>Resolution : 224 x 224 <br> Platforms: UAV-DJI Mavic Pro quadcopters, after Hurricane Harvey<br>Use: Remote Sensing Visual Question Answering <br>
Change Detection-Based Visual Question Answering DatasetLinkPaper LinkSize: 2,968 pairs of multitemporal images and more than 122,000 question–answer pairs <br> Classes: 6 <br> Resolution : 512×512 pixels <br> Platforms: It is based on semantic change detection dataset (SECOND)<br>Use: Remote Sensing Visual Question Answering <br>
LAION-EOlinkPaper LinkSize : 24,933 samples with 40.1% english captions as well as other common languages from LAION-5B <br> mean height of 633.0 pixels (up to 9,999) and mean width of 843.7 pixels (up to 19,687) <br> Platforms : Based on LAION-5B <br>
CapERA: Captioning Events in Aerial VideosLinkPaper LinkSize : 2864 videos and 14,320 captions, where each video is paired with five unique captions
Remote Sensing Image Captioning Dataset (RSICap)linkPaper LinkRSICap comprises 2,585 human-annotated captions with rich and high-quality information <br> This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc) <br>
Remote Sensing Image Captioning Evaluation Dataset (RSIEval)linkPaper Link100 human-annotated captions and 936 visual question-answer pairs with rich information and open-ended questions and answers.<br> Can be used for Image Captioning and Visual Question-Answering tasks <br>
Revised Remote Sensing Image Captioning Dataset (RSCID)LinkPaper LinkSize: 10,921 images with five captions per image <br> Number of Classes: 30 <br>Resolution : 224 x 224 <br> Platforms: Google Earth, Baidu Map, MapABC and Tianditu<br>Use: Remote Sensing Image Captioning <br>
Revised University of California Merced dataset (UCM-Captions)LinkPaper LinkSize: 2,100 images with five captions per image <br> Number of Classes: 21 <br>Resolution : 256 x 256 <br> Platforms: USGS National Map Urban Area Imagery collection<br>Use: Remote Sensing Image Captioning <br>
Revised Sydney-Captions DatasetLinkPaper LinkSize: 613 images with five captions per image <br> Number of Classes: 7 <br>Resolution : 500 x 500<br> Platforms: GoogleEarth<br> Use: Remote Sensing Image Captioning <br>
LEVIR-CC datasetLinkPaper LinkSize: 10,077 pairs of RS images and 50,385 corresponding sentences <br> Number of Classes: 10 <br>Resolution : 1024 × 1024 pixels<br> Platforms: Beihang University<br> Use: Remote Sensing Image Captioning <br>
NWPU-Captions datasetimages_Link, info_LinkPaper LinkSize: 31,500 images with 157,500 sentences <br> Number of Classes: 45 <br>Resolution : 256 x 256 pixels<br> Platforms: based on NWPU-RESISC45 dataset <br> Use: Remote Sensing Image Captioning <br>
Remote sensing Image-Text Match dataset (RSITMD)LinkPaper LinkSize: 23,715 captions for 4,743 images <br> Number of Classes: 32 <br>Resolution : 500 x 500 <br> Platforms: RSCID and GoogleEarth <br> Use: Remote Sensing Image-Text Retrieval<br>
PatterNetLinkPaper LinkSize: 30,400 images <br> Number of Classes: 38 <br>Resolution : 256 x 256 <br> Platforms: Google Earth imagery and via the Google Map AP <br> Use: Remote Sensing Image Retrieval<br>
Dense Labeling Remote Sensing Dataset (DLRSD)LinkPaper LinkSize: 2,100 images <br> Number of Classes: 21 <br>Resolution : 256 x 256 <br> Platforms: Extension of the UC Merced <br> Use: Remote Sensing Image Retrieval (RSIR), Classification and Semantic Segmentation<br>
Dior-Remote Sensing Visual Grounding Dataset (RSVGD)LinkPaper LinkSize: 38,320 RS image-query pairs and 17,402 RS images<br>Number of Classes: 20<br>Resolution : 800 x 800 <br> Platforms: DIOR dataset <br> Use: Remote Sensing Visual Grounding <br>
OPT-RSVG DatasetlinkPaper LinkSize : 25,452 Images and 48,952 expression in English and Chinese <br> Number of Classes : 14 <br> Resolution : 800 x 800
Visual Grounding in Remote Sensing ImageslinkPaper LinkSize : 4,239 images including 5,994 object instances and 7,933 referring expressions <br> Images are 1024×1024 pixels<br>Platforms: multiple sensors and platforms (e.g. Google Earth) <br>
Remote Sensing Image Scene Classification (NWPU-RESISC45)LinkPaper LinkSize: 31,500 images <br>Number of Classes: 45<br>Resolution : 256 x 256 pixels <br> Platforms: Google Earth <br> Use: Remote Sensing Image Scene Classification <br>
<!-- | High Resolution Remote Sensing Detection (HRRSD) | [Link](https://drive.google.com/open?id=1bffECWdpa0jg2Jnm7V0oCyFFh0N-EIkr) | [Paper Link](https://ieeexplore.ieee.org/document/8676107) | Size: 21,761 images and 55,740 object instances <br>Number of Classes: 13<br>Resolution : spatial resolution from 0.15-m to 1.2-m <br> Platforms: Google Earth and Baidu Map <br> Use: Remote Sensing Object Detection <br>| | Dior Dataset | [Link](https://drive.google.com/open?id=1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC) | [Paper Link](https://arxiv.org/abs/1909.00133) | Size: 23,463 images and 192,518 object instances <br>Number of Classes: 20<br>Resolution : 800 x 800 <br> Platforms: Technical University of Munich · Northwestern Polytechnical University · Zhengzhou Institute of Surveying and Mapping <br> Use: Remote Sensing Object Detection <br>| | Remote Sensing Object Detection (RSOD) |Each object has its own link: [aircraft](http://pan.baidu.com/s/1eRWFV5C), [playground](http://pan.baidu.com/s/1nuD4KLb), [overpass](http://pan.baidu.com/s/1kVKAFB5) and [oiltank](http://pan.baidu.com/s/1kUZn4zX) | [Paper Link](http://ieeexplore.ieee.org/abstract/document/7827088/) | Size: 976 images and 6,950 object instances<br>Number of Classes: 4<br>Resolution : range from 0.3m to 3m <br> Platforms: Google Earth and Tianditu <br> Use: Remote Sensing Object Detection <br>| | DOTA-v1.0 | [Training_Set](https://drive.google.com/drive/folders/1gmeE3D7R62UAtuIFOB9j2M5cUPTwtsxK?usp=sharing), [Validation_Set](https://drive.google.com/drive/folders/1n5w45suVOyaqY84hltJhIZdtVFD9B224?usp=sharing), and [Testing_set](https://drive.google.com/drive/folders/1mYOf5USMGNcJRPcvRVJVV1uHEalG5RPl?usp=sharing) | [Paper Link](https://arxiv.org/abs/1711.10398) | Size: 2,806 images and 188, 282 instances<br>Number of Classes: 15<br>Resolution : range from 800 × 800 to 20,000 × 20,000 pixels <br> Platforms: Google Earth, GF-2 and JL-1 satellite provided by the China Centre for Resources Satellite Data and Application, and aerial images provided by CycloMedia B.V <br> Use: object detection in aerial images <br>| | DOTA-v1.5 | [Training_Set](https://drive.google.com/drive/folders/1gmeE3D7R62UAtuIFOB9j2M5cUPTwtsxK?usp=sharing), [Validation_Set](https://drive.google.com/drive/folders/1n5w45suVOyaqY84hltJhIZdtVFD9B224?usp=sharing), and [Testing_set](https://drive.google.com/drive/folders/1mYOf5USMGNcJRPcvRVJVV1uHEalG5RPl?usp=sharing) | [Paper Link](https://arxiv.org/abs/1711.10398) | Size: 2,806 images with 403,318 instances in total<br>Number of Classes: 16<br>Resolution : range from 800 × 800 to 20,000 × 20,000 pixels <br> *uses the same images as DOTA-v1.0, but the extremely small instances (less than 10 pixels) are also annotated. Moreover, a new category, ”container crane” is added. <br> Use: object detection in aerial images <br>| | DOTA-v2.0 |You need to download DOTA-v1.0 images, and then download the extra images and annotations of [DOTA-v2.0](https://whueducn-my.sharepoint.com/:f:/g/personal/2014301200247_whu_edu_cn/EiJ3JsfWPqhPn2955rjdtxoBZUFYWCX2ZXOtbZ-GT0I7Qw?e=XjeBMB) | [Paper Link](https://arxiv.org/abs/1711.10398) | Size: 11,268 images and 1,793,658 instances<br>Number of Classes: 18<br>Resolution : range from 800 × 800 to 20,000 × 20,000 pixels <br> *Compared to DOTA-v1.5, it further adds the new categories of ”airport” and ”helipad”. <br> Use: object detection in aerial images <br>| | iSAID Dataset | [Training_Set](https://drive.google.com/drive/folders/19RPVhC0dWpLF9Y_DYjxjUrwLbKUBQZ2K?usp=sharing), [Validation_Set](https://drive.google.com/drive/folders/17MErPhWQrwr92Ca1Maf4mwiarPS5rcWM?usp=sharing), [Testing_Set](https://drive.google.com/drive/folders/1mYOf5USMGNcJRPcvRVJVV1uHEalG5RPl?usp=sharing), and [testing_images_info](https://drive.google.com/open?id=1nQokIxSy3DEHImJribSCODTRkWlPJLE3) | [Paper Link](http://openaccess.thecvf.com/content_CVPRW_2019/papers/DOAI/Zamir_iSAID_A_Large-scale_Dataset_for_Instance_Segmentation_in_Aerial_Images_CVPRW_2019_paper.pdf) | Size: 2,806 images with 655,451 object instances<br>Number of Classes: 15<br>Resolution : high resolution <br> Platforms: Dota Dataset <br> Use: semantic segmentation or object detection <br>| | WHU dataset |[link](https://www.kaggle.com/datasets/xiaoqian970429/whu-building-dataset) - http://gpcv.whu.edu.cn/data/building_dataset.html | [Paper Link](https://arxiv.org/pdf/2208.00657v1.pdf) | Size: more than 220, 000 independent buildings <br>Number of Classes: 1<br>Resolution : 0.075 m spatial resolution and 450 km2 covering in Christchurch, New Zealand <br> Platforms: QuickBird, Worldview series, IKONOS, ZY-3 and 6 neighboring satellite images covering 550 km2 on East Asia with 2.7 m ground resolution.<br> Use: Remote Sensing Building detection and change detection <br>| | Vaihingen/Enz, Germany dataset |[link](https://seafile.projekt.uni-hannover.de/f/6a06a837b1f349cfa749/) | [Paper Link](https://arxiv.org/pdf/2206.09731v2.pdf) | Size: The data set contains 33 patches (of different sizes), each consisting of a true orthophoto (TOP) extracted from a larger TOP mosaic <br>Number of Classes: five foreground classes and one background class <br>Resolution : 9 cm resolution <br> Platforms: Intergraph/ZI DMC block, Leica ALS50 system and digital aerial cameras carried out by the German Association of Photogrammetry and Remote Sensing (DGPF) <br> Use: Urban Classification, 3D Building Reconstruction and Semantic Labeling <br>| | Potsdam dataset |[link](https://seafile.projekt.uni-hannover.de/f/429be50cc79d423ab6c4/) | [Paper Link](https://arxiv.org/pdf/2206.09731v2.pdf) | Size: 38 patches (of the same size), each consisting of a true orthophoto (TOP) extracted from a larger TOP mosaic <br>Number of Classes: same category information as the Vaihingen dataset<br>Resolution : 6000x6000 pixels and 5cm resolution <br> Platforms: Google Maps and OSM (DGPF)<br> Use: Semantic Segmentation <br>| -->

Related Repositories & Libraries

<!-- - [awesome-satellite-imagery-datasets][https://github.com/chrieke/awesome-satellite-imagery-datasets] -->

---Stay tuned for continuous updates and improvements! 🚀