Awesome
This is the repository of Foundation Models for Remote Sensing and Earth Observation: A Survey, a comprehensive survey of recent progress in multimodal foundation models for remote sensing and earth observation. For details, please refer to:
Foundation Models for Remote Sensing and Earth Observation: A Survey
[Paper]
Abstract
Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS systems, including the complexity of Earth's environments, diverse sensor modalities, distinctive feature patterns, varying spatial and spectral resolutions, and temporal dynamics. Meanwhile, recent breakthroughs in large Foundation Models (FMs) have expanded AI’s potential across many domains due to their exceptional generalizability and zero-shot transfer capabilities. However, their success has largely been confined to natural data like images and video, with degraded performance and even failures for RS data of various non-optical modalities. This has inspired growing interest in developing Remote Sensing Foundation Models (RSFMs) to address the complex demands of Earth Observation (EO) tasks, spanning the surface, atmosphere, and oceans. This survey systematically reviews the emerging field of RSFMs. It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts. It then categorizes and reviews existing RSFM studies including their datasets and technical contributions across Visual Foundation Models (VFMs), Visual-Language Models (VLMs), Large Language Models (LLMs), and beyond. In addition, we benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions in this rapidly evolving field.
Citation
If you find our work useful in your research, please consider citing:
@article{xiao2024foundation,
title={Foundation Models for Remote Sensing and Earth Observation: A Survey},
author={Xiao, Aoran and Xuan, Weihao and Wang, Junjue and Huang, Jiaxing and Tao, Dacheng and Lu, Shijian and Yokoya, Naoto},
journal={arXiv preprint arXiv:2410.16602},
year={2024}
}
Menu
- Visual Foundation models (VFMs) for RS
- Vision-Language Models for RS
- Large Language Models for RS
- Generative Foundation Models for RS
- Other RSFMs
Visual Foundation models for RS
VFM Datasets
Datasets | Date | #Samples | Modal | Annotations | Data Sources | GSD | paper | Link |
---|---|---|---|---|---|---|---|---|
FMoW-RGB | 2018 | 363.6k | RGB | 62 classes | QuickBird-2, GeoEye-1, WorldView-2/3 | varying | paper | download |
BigEarthNet | 2019 | 1.2 million | MSI,SAR | 19 LULC classes | Sentinel-1/2 | 10,20,60m | paper | download |
SeCo | 2021 | 1 million | MSI | None | Sentinel-2; NAIP | 10,20,60m | paper | download |
FMoW-Sentinel | 2022 | 882,779 | MSI | None | Sentinel-2 | 10m | paper | download |
MillionAID | 2022 | 1 million | RGB | 51 LULC classes | SPOT, IKONOS,WorldView, Landsat, etc. | 0.5m-153m | paper | download |
GeoPile | 2023 | 600K | RGB | None | Sentinel-2, NAIP, etc. | 0.1m-30m | paper | download |
SSL4EO-L | 2023 | 5 million | MSI | None | Landsat 4–9 | 30m | paper | download |
SSL4EO-S12 | 2023 | 3 million | MSI, SAR | None | Sentinel-1/2 | 10m | paper | download |
SatlasPretrain | 2023 | 856K tiles | RGB,MSI,SAR | 137 classes of 7 types | Sentinel-1/2, NAIP, NOAA Lidar Scans | 0.5–2m,10m | paper | download |
MMEarth | 2024 | 1.2 million | RGB,MSI,SAR,DSM | None | Sentinel-1/2, Aster DEM, etc. | 10,20,60m | paper | download |
VFM Models
Pre-training studies
- An empirical study of remote sensing pretraining. TGRS2022. | paper | code |
- Satlaspretrain: A large-scale dataset for remote sensing image understanding. ICCV2023. | paper | code |
- Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. ICCV2021. | paper | code |
- Geography-aware self-supervised learning. ICCV2021. | paper | code |
- Self-supervised material and texture representation learning for remote sensing tasks. CVPR2022. | paper | code |
- Change-aware sampling and contrastive learning for satellite images. CVPR2023. | paper | code |
- Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
- Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. CVPR2024. | paper | code |
- Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery. ECCV2024. | paper | code |
- Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. NeurIPS2022. | paper | code |
- Towards geospatial foundation models via continual pretraining. ICCV2023. | paper | code |
- Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. ICCV2023. | paper | code |
- Bridging remote sensors with multisensor geospatial foundation models. CVPR2024. | paper | code |
- Rethinking transformers pre-training for multi-spectral satellite imagery. CVPR2024. | paper | code |
- Masked angle-aware autoencoder for remote sensing images. ECCV2024. | paper | code |
- Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. ECCV2024. | paper | code |
- Croma: Remote sensing representations with contrastive radar-optical masked autoencoders. NeurIPS2023. | paper | code |
- Cross-scale mae: A tale of multiscale exploitation in remote sensing. NeurIPS2023. | paper | code |
SAM-based studies
- SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model. NeurIPS2023 (DB). | paper | code |
- Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. TGRS2024. | paper | code |
- Uv-sam: Adapting segment anything model for urban village identification. AAAI2024. | paper | code |
- Cs-wscdnet: Class activation mapping and segment anything model-based framework for weakly supervised change detection. TGRS2023. | paper | code |
- Adapting segment anything model for change detection in vhr remote sensing images. TGRS2024. | paper | code |
- Segment any change. NeurIPS2024. | paper |
- Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. TGRS2024. | paper | code |
- Ringmo-sam: A foundation model for segment anything in multimodal remote-sensing images. TGRS2023. | paper
- The segment anything model (sam) for remote sensing applications: From zero to one shot. JSTAR2023. | paper | code |
- Cat-sam: Conditional tuning for few-shot adaptation of segmentation anything model. ECCV2024 (oral). | paper | code |
- Segment anything with multiple modalities. arXiv2024. | paper | code |
Vision-Language Models for RS
VLM Datasets
Task | Dataset | Image Size | GSD (m) | #Text | #Images | Content | Link |
---|---|---|---|---|---|---|---|
VQA | RSVQA-LR | 256 | 10 | 77K | 772 | Questions for existing judging, area estimation, object comparison, scene recognition | download |
VQA | RSVQA-HR | 512 | 0.15 | 955K | 10,659 | Questions for existing judging, area estimation, object comparison, scene recognition | download |
VQA | RSVQAxBen | 120 | 10--60 | 15M | 590,326 | Questions for existing judging, object comparison, scene recognition | download |
VQA | RSIVQA | 512--4,000 | 0.3--8 | 111K | 37,000 | Questions for existing judging, area estimation, object comparison, scene recognition | download |
VQA | HRVQA | 1,024 | 0.08 | 1,070K | 53,512 | Questions for existing judging, object comparison, scene recognition | download |
VQA | CDVQA | 512 | 0.5--3 | 122K | 2,968 | Questions for object changes | download |
VQA | FloodNet | 3,000--4,000 | - | 11K | 2,343 | Questions for for building and road damage assessment in disaster scenes | download |
VQA | RescueNet-VQA | 3,000--4,000 | 0.15 | 103K | 4,375 | Questions for building and road damage assessment in disaster scenes | download |
VQA | EarthVQA | 1,024 | 0.3 | 208K | 6,000 | Questions for relational judging, relational counting, situation analysis, and comprehensive analysis | download |
Image-Text Pre-tranining | RemoteCLIP | varied | varied | not specified | not specified | Developed based on retrieval, detection and segmentation data | download |
Image-Text Pre-tranining | RS5M | not specified | varied | 5M | 5M | Filtered public datasets, captioned existing data | download |
Image-Text Pre-tranining | SKyScript | not specified | 0.1 - 30 | 2.6M | 2.6M | Earth Engine images linked with OpenStreetMap semantics | download |
Caption | RSICD | 224 | - | 24,333 | 10,921 | Urban scenes for object description | download |
Caption | UCM-Caption | 256 | 0.3 | 2,100 | 10,500 | Urban scenes for object description | download |
Caption | Sydney | 500 | 0.5 | 613 | 3,065 | Urban scenes for object description | download |
Caption | NWPU-Caption | 256 | 0.2-30 | 157,500 | 31,500 | Urban scenes for object description | download |
Caption | RSITMD | 224 | - | 4,743 | 4,743 | Urban scenes for object description | download |
Caption | RSICap | 512 | varied | 3,100 | 2,585 | Urban scenes for object description | download |
Caption | ChatEarthNet | 256 | 10 | 173,488 | 163,488 | Urban and rural scenes for object description | download |
Visual Grounding | GeoVG | 1,024 | 0.24--4.8 | 7,933 | 4,239 | Visual grounding based on object properties and relations | download |
Visual Grounding | DIOR-RSVG | 800 | 0.5--30 | 38,320 | 17,402 | Visual grounding based on object properties and relations | download |
Mixed Multi-task | MMRS-1M | varied | varied | 1M | 975,022 | Collections of RSICD, UCM-Captions, FloodNet, RSIVQA, UC Merced, DOTA, DIOR-RSVG, etc | download |
Mixed Multi-task | Geochat-Set | varied | varied | 318k | 141,246 | Developed based on DOTA, DIOR, FAIR1M, FloodNet, RSVQA and NWPU-RESISC45 | download |
Mixed Multi-task | LHRS-Align | 256 | 1.0 | 1.15M | 1.15M | Constructed from Google Map and OSM properties | download |
Mixed Multi-task | VRSBench | 512 | varied | 205,307 | 29,614 | Developed based on DOTA-v2 and DIOR dataset | download |
VLM Models
- Remoteclip: A vision language foundation model for remote sensing. TGRS2024. | paper | code |
- Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. TGRS2024. | paper | code |
- Skyscript: A large and semantically diverse vision-language dataset for remote sensing. AAAI2024. | paper | code |
- Remote sensing vision-language foundation models without annotations via ground remote alignment. ICLR2024. | paper |
- Csp: Self-supervised contrastive spatial pre-training for geospatial-visual representations. ICML2023. | paper | code |
- Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. NeurIPS2024. | paper | code |
- Satclip: Global, general-purpose location embeddings with satellite imagery. arXiv2023. | paper | code |
- Learning representations of satellite images from metadata supervision. ECCV2024. | paper |
Large Language Models for RS
Generative Foundation Models for RS
- Diffusionsat: A generative foundation model for satellite imagery. ICLR2024. | paper | code |
- MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation. NeurIPS2024.