Awesome

This is the repository of Foundation Models for Remote Sensing and Earth Observation: A Survey, a comprehensive survey of recent progress in multimodal foundation models for remote sensing and earth observation. For details, please refer to:

Foundation Models for Remote Sensing and Earth Observation: A Survey
[Paper]

Abstract

Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS systems, including the complexity of Earth's environments, diverse sensor modalities, distinctive feature patterns, varying spatial and spectral resolutions, and temporal dynamics. Meanwhile, recent breakthroughs in large Foundation Models (FMs) have expanded AI’s potential across many domains due to their exceptional generalizability and zero-shot transfer capabilities. However, their success has largely been confined to natural data like images and video, with degraded performance and even failures for RS data of various non-optical modalities. This has inspired growing interest in developing Remote Sensing Foundation Models (RSFMs) to address the complex demands of Earth Observation (EO) tasks, spanning the surface, atmosphere, and oceans. This survey systematically reviews the emerging field of RSFMs. It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts. It then categorizes and reviews existing RSFM studies including their datasets and technical contributions across Visual Foundation Models (VFMs), Visual-Language Models (VLMs), Large Language Models (LLMs), and beyond. In addition, we benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions in this rapidly evolving field.

Citation

If you find our work useful in your research, please consider citing:

@article{xiao2024foundation,
  title={Foundation Models for Remote Sensing and Earth Observation: A Survey},
  author={Xiao, Aoran and Xuan, Weihao and Wang, Junjue and Huang, Jiaxing and Tao, Dacheng and Lu, Shijian and Yokoya, Naoto},
  journal={arXiv preprint arXiv:2410.16602},
  year={2024}
}

Visual Foundation models for RS

VFM Datasets

Datasets	Date	#Samples	Modal	Annotations	Data Sources	GSD	paper	Link
FMoW-RGB	2018	363.6k	RGB	62 classes	QuickBird-2, GeoEye-1, WorldView-2/3	varying	paper	download
BigEarthNet	2019	1.2 million	MSI,SAR	19 LULC classes	Sentinel-1/2	10,20,60m	paper	download
SeCo	2021	1 million	MSI	None	Sentinel-2; NAIP	10,20,60m	paper	download
FMoW-Sentinel	2022	882,779	MSI	None	Sentinel-2	10m	paper	download
MillionAID	2022	1 million	RGB	51 LULC classes	SPOT, IKONOS,WorldView, Landsat, etc.	0.5m-153m	paper	download
GeoPile	2023	600K	RGB	None	Sentinel-2, NAIP, etc.	0.1m-30m	paper	download
SSL4EO-L	2023	5 million	MSI	None	Landsat 4–9	30m	paper	download
SSL4EO-S12	2023	3 million	MSI, SAR	None	Sentinel-1/2	10m	paper	download
SatlasPretrain	2023	856K tiles	RGB,MSI,SAR	137 classes of 7 types	Sentinel-1/2, NAIP, NOAA Lidar Scans	0.5–2m,10m	paper	download
MMEarth	2024	1.2 million	RGB,MSI,SAR,DSM	None	Sentinel-1/2, Aster DEM, etc.	10,20,60m	paper	download

VFM Models

Pre-training studies

An empirical study of remote sensing pretraining. TGRS2022. | paper | code |
Satlaspretrain: A large-scale dataset for remote sensing image understanding. ICCV2023. | paper | code |
Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. ICCV2021. | paper | code |
Geography-aware self-supervised learning. ICCV2021. | paper | code |
Self-supervised material and texture representation learning for remote sensing tasks. CVPR2022. | paper | code |
Change-aware sampling and contrastive learning for satellite images. CVPR2023. | paper | code |
Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. CVPR2024. | paper | code |
Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery. ECCV2024. | paper | code |
Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. NeurIPS2022. | paper | code |
Towards geospatial foundation models via continual pretraining. ICCV2023. | paper | code |
Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. ICCV2023. | paper | code |
Bridging remote sensors with multisensor geospatial foundation models. CVPR2024. | paper | code |
Rethinking transformers pre-training for multi-spectral satellite imagery. CVPR2024. | paper | code |
Masked angle-aware autoencoder for remote sensing images. ECCV2024. | paper | code |
Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. ECCV2024. | paper | code |
Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. NeurIPS2023. | paper | code |
Cross-scale mae: A tale of multiscale exploitation in remote sensing. NeurIPS2023. | paper | code |

SAM-based studies

SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model. NeurIPS2023 (DB). | paper | code |
Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. TGRS2024. | paper | code |
Uv-sam: Adapting segment anything model for urban village identification. AAAI2024. | paper | code |
Cs-wscdnet: Class activation mapping and segment anything model-based framework for weakly supervised change detection. TGRS2023. | paper | code |
Adapting segment anything model for change detection in vhr remote sensing images. TGRS2024. | paper | code |
Segment any change. NeurIPS2024. | paper |
Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. TGRS2024. | paper | code |
Ringmo-sam: A foundation model for segment anything in multimodal remote-sensing images. TGRS2023. | paper
The segment anything model (sam) for remote sensing applications: From zero to one shot. JSTAR2023. | paper | code |
Cat-sam: Conditional tuning for few-shot adaptation of segmentation anything model. ECCV2024 (oral). | paper | code |
Segment anything with multiple modalities. arXiv2024. | paper | code |

Vision-Language Models for RS

VLM Datasets

Task	Dataset	Image Size	GSD (m)	#Text	#Images	Content	Link
VQA	RSVQA-LR	256	10	77K	772	Questions for existing judging, area estimation, object comparison, scene recognition	download
VQA	RSVQA-HR	512	0.15	955K	10,659	Questions for existing judging, area estimation, object comparison, scene recognition	download
VQA	RSVQAxBen	120	10--60	15M	590,326	Questions for existing judging, object comparison, scene recognition	download
VQA	RSIVQA	512--4,000	0.3--8	111K	37,000	Questions for existing judging, area estimation, object comparison, scene recognition	download
VQA	HRVQA	1,024	0.08	1,070K	53,512	Questions for existing judging, object comparison, scene recognition	download
VQA	CDVQA	512	0.5--3	122K	2,968	Questions for object changes	download
VQA	FloodNet	3,000--4,000	-	11K	2,343	Questions for for building and road damage assessment in disaster scenes	download
VQA	RescueNet-VQA	3,000--4,000	0.15	103K	4,375	Questions for building and road damage assessment in disaster scenes	download
VQA	EarthVQA	1,024	0.3	208K	6,000	Questions for relational judging, relational counting, situation analysis, and comprehensive analysis	download
Image-Text Pre-tranining	RemoteCLIP	varied	varied	not specified	not specified	Developed based on retrieval, detection and segmentation data	download
Image-Text Pre-tranining	RS5M	not specified	varied	5M	5M	Filtered public datasets, captioned existing data	download
Image-Text Pre-tranining	SKyScript	not specified	0.1 - 30	2.6M	2.6M	Earth Engine images linked with OpenStreetMap semantics	download
Caption	RSICD	224	-	24,333	10,921	Urban scenes for object description	download
Caption	UCM-Caption	256	0.3	2,100	10,500	Urban scenes for object description	download
Caption	Sydney	500	0.5	613	3,065	Urban scenes for object description	download
Caption	NWPU-Caption	256	0.2-30	157,500	31,500	Urban scenes for object description	download
Caption	RSITMD	224	-	4,743	4,743	Urban scenes for object description	download
Caption	RSICap	512	varied	3,100	2,585	Urban scenes for object description	download
Caption	ChatEarthNet	256	10	173,488	163,488	Urban and rural scenes for object description	download
Visual Grounding	GeoVG	1,024	0.24--4.8	7,933	4,239	Visual grounding based on object properties and relations	download
Visual Grounding	DIOR-RSVG	800	0.5--30	38,320	17,402	Visual grounding based on object properties and relations	download
Mixed Multi-task	MMRS-1M	varied	varied	1M	975,022	Collections of RSICD, UCM-Captions, FloodNet, RSIVQA, UC Merced, DOTA, DIOR-RSVG, etc	download
Mixed Multi-task	Geochat-Set	varied	varied	318k	141,246	Developed based on DOTA, DIOR, FAIR1M, FloodNet, RSVQA and NWPU-RESISC45	download
Mixed Multi-task	LHRS-Align	256	1.0	1.15M	1.15M	Constructed from Google Map and OSM properties	download
Mixed Multi-task	VRSBench	512	varied	205,307	29,614	Developed based on DOTA-v2 and DIOR dataset	download

VLM Models

Remoteclip: A vision language foundation model for remote sensing. TGRS2024. | paper | code |
Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. TGRS2024. | paper | code |
Skyscript: A large and semantically diverse vision-language dataset for remote sensing. AAAI2024. | paper | code |
Remote sensing vision-language foundation models without annotations via ground remote alignment. ICLR2024. | paper |
Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. NeurIPS2024. | paper | code |
Satclip: Global, general-purpose location embeddings with satellite imagery. arXiv2023. | paper | code |
Learning representations of satellite images from metadata supervision. ECCV2024. | paper |

Large Language Models for RS

Geollm: Extracting geospatial knowledge from large language models. ICLR2024. | paper | code |

Generative Foundation Models for RS

Diffusionsat: A generative foundation model for satellite imagery. ICLR2024. | paper | code |
MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation. NeurIPS2024.

Other RSFMs

weather forecasting

Accurate medium-range global weather forecasting with 3d neural networks. Nature, 2023. | paper | code |