Home

Awesome

This is the repository of Foundation Models for Remote Sensing and Earth Observation: A Survey, a comprehensive survey of recent progress in multimodal foundation models for remote sensing and earth observation. For details, please refer to:

Foundation Models for Remote Sensing and Earth Observation: A Survey
[Paper]

arXiv Survey Maintenance PR's Welcome

<!-- [![made-with-Markdown](https://img.shields.io/badge/Made%20with-Markdown-1f425f.svg)](http://commonmark.org) --> <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->

Abstract

Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS systems, including the complexity of Earth's environments, diverse sensor modalities, distinctive feature patterns, varying spatial and spectral resolutions, and temporal dynamics. Meanwhile, recent breakthroughs in large Foundation Models (FMs) have expanded AI’s potential across many domains due to their exceptional generalizability and zero-shot transfer capabilities. However, their success has largely been confined to natural data like images and video, with degraded performance and even failures for RS data of various non-optical modalities. This has inspired growing interest in developing Remote Sensing Foundation Models (RSFMs) to address the complex demands of Earth Observation (EO) tasks, spanning the surface, atmosphere, and oceans. This survey systematically reviews the emerging field of RSFMs. It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts. It then categorizes and reviews existing RSFM studies including their datasets and technical contributions across Visual Foundation Models (VFMs), Visual-Language Models (VLMs), Large Language Models (LLMs), and beyond. In addition, we benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions in this rapidly evolving field.

Citation

If you find our work useful in your research, please consider citing:

@article{xiao2024foundation,
  title={Foundation Models for Remote Sensing and Earth Observation: A Survey},
  author={Xiao, Aoran and Xuan, Weihao and Wang, Junjue and Huang, Jiaxing and Tao, Dacheng and Lu, Shijian and Yokoya, Naoto},
  journal={arXiv preprint arXiv:2410.16602},
  year={2024}
}

Menu

Visual Foundation models for RS

VFM Datasets

DatasetsDate#SamplesModalAnnotationsData SourcesGSDpaperLink
FMoW-RGB2018363.6kRGB62 classesQuickBird-2, GeoEye-1, WorldView-2/3varyingpaperdownload
BigEarthNet20191.2 millionMSI,SAR19 LULC classesSentinel-1/210,20,60mpaperdownload
SeCo20211 millionMSINoneSentinel-2; NAIP10,20,60mpaperdownload
FMoW-Sentinel2022882,779MSINoneSentinel-210mpaperdownload
MillionAID20221 millionRGB51 LULC classesSPOT, IKONOS,WorldView, Landsat, etc.0.5m-153mpaperdownload
GeoPile2023600KRGBNoneSentinel-2, NAIP, etc.0.1m-30mpaperdownload
SSL4EO-L20235 millionMSINoneLandsat 4–930mpaperdownload
SSL4EO-S1220233 millionMSI, SARNoneSentinel-1/210mpaperdownload
SatlasPretrain2023856K tilesRGB,MSI,SAR137 classes of 7 typesSentinel-1/2, NAIP, NOAA Lidar Scans0.5–2m,10mpaperdownload
MMEarth20241.2 millionRGB,MSI,SAR,DSMNoneSentinel-1/2, Aster DEM, etc.10,20,60mpaperdownload

VFM Models

Pre-training studies

  1. An empirical study of remote sensing pretraining. TGRS2022. | paper | code |
  2. Satlaspretrain: A large-scale dataset for remote sensing image understanding. ICCV2023. | paper | code |
  3. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. ICCV2021. | paper | code |
  4. Geography-aware self-supervised learning. ICCV2021. | paper | code |
  5. Self-supervised material and texture representation learning for remote sensing tasks. CVPR2022. | paper | code |
  6. Change-aware sampling and contrastive learning for satellite images. CVPR2023. | paper | code |
  7. Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
  8. Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. CVPR2024. | paper | code |
  9. Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery. ECCV2024. | paper | code |
  10. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. NeurIPS2022. | paper | code |
  11. Towards geospatial foundation models via continual pretraining. ICCV2023. | paper | code |
  12. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. ICCV2023. | paper | code |
  13. Bridging remote sensors with multisensor geospatial foundation models. CVPR2024. | paper | code |
  14. Rethinking transformers pre-training for multi-spectral satellite imagery. CVPR2024. | paper | code |
  15. Masked angle-aware autoencoder for remote sensing images. ECCV2024. | paper | code |
  16. Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. ECCV2024. | paper | code |
  17. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. NeurIPS2023. | paper | code |
  18. Cross-scale mae: A tale of multiscale exploitation in remote sensing. NeurIPS2023. | paper | code |

SAM-based studies

  1. SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model. NeurIPS2023 (DB). | paper | code |
  2. Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. TGRS2024. | paper | code |
  3. Uv-sam: Adapting segment anything model for urban village identification. AAAI2024. | paper | code |
  4. Cs-wscdnet: Class activation mapping and segment anything model-based framework for weakly supervised change detection. TGRS2023. | paper | code |
  5. Adapting segment anything model for change detection in vhr remote sensing images. TGRS2024. | paper | code |
  6. Segment any change. NeurIPS2024. | paper |
  7. Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. TGRS2024. | paper | code |
  8. Ringmo-sam: A foundation model for segment anything in multimodal remote-sensing images. TGRS2023. | paper
  9. The segment anything model (sam) for remote sensing applications: From zero to one shot. JSTAR2023. | paper | code |
  10. Cat-sam: Conditional tuning for few-shot adaptation of segmentation anything model. ECCV2024 (oral). | paper | code |
  11. Segment anything with multiple modalities. arXiv2024. | paper | code |

Vision-Language Models for RS

VLM Datasets

TaskDatasetImage SizeGSD (m)#Text#ImagesContentLink
VQARSVQA-LR2561077K772Questions for existing judging, area estimation, object comparison, scene recognitiondownload
VQARSVQA-HR5120.15955K10,659Questions for existing judging, area estimation, object comparison, scene recognitiondownload
VQARSVQAxBen12010--6015M590,326Questions for existing judging, object comparison, scene recognitiondownload
VQARSIVQA512--4,0000.3--8111K37,000Questions for existing judging, area estimation, object comparison, scene recognitiondownload
VQAHRVQA1,0240.081,070K53,512Questions for existing judging, object comparison, scene recognitiondownload
VQACDVQA5120.5--3122K2,968Questions for object changesdownload
VQAFloodNet3,000--4,000-11K2,343Questions for for building and road damage assessment in disaster scenesdownload
VQARescueNet-VQA3,000--4,0000.15103K4,375Questions for building and road damage assessment in disaster scenesdownload
VQAEarthVQA1,0240.3208K6,000Questions for relational judging, relational counting, situation analysis, and comprehensive analysisdownload
Image-Text Pre-traniningRemoteCLIPvariedvariednot specifiednot specifiedDeveloped based on retrieval, detection and segmentation datadownload
Image-Text Pre-traniningRS5Mnot specifiedvaried5M5MFiltered public datasets, captioned existing datadownload
Image-Text Pre-traniningSKyScriptnot specified0.1 - 302.6M2.6MEarth Engine images linked with OpenStreetMap semanticsdownload
CaptionRSICD224-24,33310,921Urban scenes for object descriptiondownload
CaptionUCM-Caption2560.32,10010,500Urban scenes for object descriptiondownload
CaptionSydney5000.56133,065Urban scenes for object descriptiondownload
CaptionNWPU-Caption2560.2-30157,50031,500Urban scenes for object descriptiondownload
CaptionRSITMD224-4,7434,743Urban scenes for object descriptiondownload
CaptionRSICap512varied3,1002,585Urban scenes for object descriptiondownload
CaptionChatEarthNet25610173,488163,488Urban and rural scenes for object descriptiondownload
Visual GroundingGeoVG1,0240.24--4.87,9334,239Visual grounding based on object properties and relationsdownload
Visual GroundingDIOR-RSVG8000.5--3038,32017,402Visual grounding based on object properties and relationsdownload
Mixed Multi-taskMMRS-1Mvariedvaried1M975,022Collections of RSICD, UCM-Captions, FloodNet, RSIVQA, UC Merced, DOTA, DIOR-RSVG, etcdownload
Mixed Multi-taskGeochat-Setvariedvaried318k141,246Developed based on DOTA, DIOR, FAIR1M, FloodNet, RSVQA and NWPU-RESISC45download
Mixed Multi-taskLHRS-Align2561.01.15M1.15MConstructed from Google Map and OSM propertiesdownload
Mixed Multi-taskVRSBench512varied205,30729,614Developed based on DOTA-v2 and DIOR datasetdownload

VLM Models

  1. Remoteclip: A vision language foundation model for remote sensing. TGRS2024. | paper | code |
  2. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. TGRS2024. | paper | code |
  3. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. AAAI2024. | paper | code |
  4. Remote sensing vision-language foundation models without annotations via ground remote alignment. ICLR2024. | paper |
  5. Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
  6. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. NeurIPS2024. | paper | code |
  7. Satclip: Global, general-purpose location embeddings with satellite imagery. arXiv2023. | paper | code |
  8. Learning representations of satellite images from metadata supervision. ECCV2024. | paper |

Large Language Models for RS

  1. Geollm: Extracting geospatial knowledge from large language models. ICLR2024. | paper | code |

Generative Foundation Models for RS

  1. Diffusionsat: A generative foundation model for satellite imagery. ICLR2024. | paper | code |
  2. MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation. NeurIPS2024.

Other RSFMs

weather forecasting

  1. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 2023. | paper | code |