Home

Awesome

Awesome PR's Welcome <br />

<p align="center"> <h1 align="center">Towards Vision-Language Geo-Foundation Models: A Survey</h1> <p align="center"> <b> arXiv, 2024 </b> <br /> <a href="https://zytx121.github.io/"><strong>Yue Zhou </strong></a> · <a href="https://scholar.google.com/citations?user=PnNAAasAAAAJ"><strong> Litong Feng </strong></a> <!-- · <a href="https://dr.ntu.edu.sg/cris/rp/rp00908/"><strong> Yiping Ke </strong></a> --> · <a href="https://sp.sjtu.edu.cn/"><strong> Xue Jiang </strong></a> · <a href="https://scholar.google.com/citations?user=ga230VoAAAAJ"><strong> Junchi Yan </strong></a> · <a href="https://yangxue0827.github.io/"><strong> Xue Yang </strong></a> · <a href="https://scholar.google.com/citations?user=5GtyVooAAAAJ"><strong> Wayne Zhang </strong></a> </p> <p align="center"> <a href='https://arxiv.org/abs/2406.09385'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <!-- <a href='https://ieeexplore.ieee.org/document/'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> --> </p> <br />

This repo is used for recording, tracking, and benchmarking several recent vision-language geo-foundation models (VLGFM) to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.

🙌 Add Your Paper in our Repo and Survey!!!!!

<!-- [-] **Our survey will be updated in 2024.3.** -->

🥳 New

🔥🔥🔥 Last Updated on 2024.07.19 🔥🔥🔥

✨ Highlight!!

📖 Introduction

This survey presents the first detailed survey on remote sensing vision language foundation models, including Contrastive/Conversational/Generative VLGFMs.

Alt Text

📗 Summary of Contents

📚 Methods: A Survey

Keywords

image-caption-mask triplets

Contrastive VLGFMs

YearVenueKeywordsPaper TitleCode/Project
2023arXivclipRS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote SensingCode
2024ICLRclipGRAFT: Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote AlignmentProject
2024AAAIclipSkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote SensingCode
2024arXivclipMind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal AlignmentN/A
2024TGRSclipRemoteCLIP: A Vision Language Foundation Model for Remote SensingCode

Conversational VLGFMs

YearVenueKeywordsPaper TitleCode/Project
2023arXivllmRsgpt: A remote sensing vision language model and benchmarkCode
2024CVPRllmGeoChat: Grounded Large Vision-Language Model for Remote SensingCode
2024arXivllmSkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language ModelCode
2024TGRSllmEarthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domainN/A
2024ECCVllmLHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language ModelCode
2024arXivllmLarge Language Models for Captioning and Retrieving Remote Sensing ImagesN/A
2024arXivllmH2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language ModelN/A
2024RSllmRS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing ImageryCode
2024arXivllmSkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language UnderstandingCode
2024arXivllmEarthMarker: A Visual Prompt Learning Framework for Region-level and Point-level Remote Sensing Imagery ComprehensionCode

Generative VLGFMs

YearVenueKeywordsPaper TitleCode/Project
2024ICLRclipDiffusionSat: A Generative Foundation Model for Satellite ImageryCode
2024arXivclipCRS-Diff: Controllable Generative Remote Sensing Foundation ModelN/A

Datasets & Benchmark

YearVenueKeywordsNameCode/ProjectDownload
2016CITSi-tSydney-Captions & UCM-Captions[N/A]link,link2
2017TGRSi-tRSICDProjectLink
2020TGRSi-tRSVQA-LR & RSVQA-HRProjectlink1,link2
2021IGARSSi-tRSVQAxBENProjectlink
2021Accessi-tFloodNetProjectlink
2021TGRSi-tRSITMDCodelink
2021TGRSi-tRSIVQACodelink
2022TGRSi-tNWPU-CaptionsProjectlink
2022TGRSi-tCRSVQAProjectlink
2022TGRSi-tLEVIR-CCProjectlink
2022TGRSi-tCDVQAProjectlink
2022TGRSi-tUAV-CaptionsN/AN/A
2022MMi-t-bRSVGProjectlink
2022RSv-tCapERAProjectlink
2023TGRSi-t-bDIOR-RSVGProjectlink
2023arXivi-tRemoteCountCodeN/A
2023arXivi-tRS5MCodelink
2023arXivi-tRSICap & RSIEvalCodeN/A
2023arXivi-tLAION-EON/Alink
2023ICCVWi-tSATINProjectlink
2024ICLRi-tNAIP-OSMProjectN/A
2024AAAIi-tSkyScriptCodelink
2024AAAIi-t-mEarthVQAProjectN/A
2024TGRSi-t-mRRSISCodelink
2024CVPRi-tGeoChat-Instruct & GeoChat-BenchCodelink
2024CVPRi-t-mRRSIS-DCodelink
2024arXivi-tSkyEye-968kCodeN/A
2024arXivi-tMMRS-1MProjectN/A
2024arXivi-tLHRS-Align & LHRS-InstructCodeN/A
2024arXivi-t-mChatEarthNetprojectlink
2024arXivi-tVLEO-BenchCodelink
2024arXivi-tLuoJiaHOGN/AN/A
2024arXivi-t-mFineGripN/AN/A
2024arXivi-tRS-GPT4VN/AN/A
2024arXivi-tVRSBenchN/AN/A

🕹️ Application

Captioning

YearVenueKeywordsPaper TitleCode/Project
2023TGRSllmA Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change CaptioningCode
2023JSEEllmVLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioningN/A

Retrieval

YearVenueKeywordsPaper TitleCode/Project
2022VTllmCLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case StudyN/A
2024arXivllmMulti-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation ModelsCode

Change Detection

YearVenueKeywordsPaper TitleCode/Project
2023arXivsamTime Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change DetectionCode
2024JPRSclipChangeCLIP: Remote sensing change detection with multimodal vision-language representation learningCode
2024TGRSllmA New Learning Paradigm for Foundation Model-Based Remote-Sensing Change DetectionCode
2024arXivsamChange Detection Between Optical Remote Sensing Imagery and Map Data via Segment Anything Model (SAM)N/A
2024arXivsamSegment Any ChangeN/A

Scene Classification

YearVenueKeywordsPaper TitleCode/Project
2023IJAEOGclipRS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervisionCode

Segmentation

YearVenueKeywordsPaper TitleCode/Project
2023arXivsam clipText2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation ModelsCode
2024TGRSRRSIS: Referring Remote Sensing Image SegmentationCode
2024CVPRRotated Multi-Scale Interaction Network for Referring Remote Sensing Image SegmentationCode
2024WACVCPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language PromptingN/A

Visual Question Answering

YearVenueKeywordsPaper TitleCode/Project
2022CVPRWPrompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answeringN/A
2024AAAIEarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question AnsweringProject

Geospatial Localization

YearVenueKeywordsPaper TitleCode/Project
2023arXivclipLearning Generalized Zero-Shot Learners for Open-Domain Image GeolocalizationCode
2023ICMLclipCSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual RepresentationsCode
2023NeurIPSclipGeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localizationCode
2023arXivclipSatCLIP: Global, General-Purpose Location Embeddings with Satellite ImageryCode

Object Detection

YearVenueKeywordsPaper TitleCode/Project
2023arXivclipStable Diffusion For Aerial Object DetectionN/A

Super-Resolution

YearVenueKeywordsPaper TitleCode/Project
2023arXivclipZooming Out on Zooming In: Advancing Super-Resolution for Remote SensingCode

📊 Exploration

YearVenueKeywordsPaper TitleCode/Project
2022TGRSAn Empirical Study of Remote Sensing PretrainingCode
2023arXivAutonomous GIS: the next-generation AI-powered GISN/A
2023arXivGPT4GEO: How a Language Model Sees the World's GeographyCode
2023arXivCharting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMsCode
2023arXivThe Potential of Visual ChatGPT For Remote SensingN/A

👨‍🏫 Survey

YearVenueKeywordsPaper TitleCode/Project
2023IGARSSAn Agenda for Multimodal Foundation Models for Earth ObservationN/A
2023TGRSSelf-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future WorksN/A
2023GISWULarge Remote Sensing Model: Progress and ProspectsN/A
2023JSTARSBrain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive SurveyN/A
2023arXivOn the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning ApplicationsN/A
2024GRSMVision-Language Models in Remote Sensing: Current Progress and Future TrendsN/A
2024arXivOn the Foundations of Earth and Climate Foundation ModelsN/A

🖊️ Citation

If you find our survey and repository useful for your research project, please consider citing our paper:

@article{zhou2024vlgfm,
  title={Towards Vision-Language Geo-Foundation Models: A Survey},
  author={Yue Zhou and Litong Feng and Yiping Ke and Xue Jiang and Junchi Yan and Xue Yang and Wayne Zhang},
  journal={arXiv preprint arXiv:2406.09385},
  year={2024}
}

🐲 Contact

yue.zhou@ntu.edu.sg