Awesome
<p align="center"> <h1 align="center">Towards Vision-Language Geo-Foundation Models: A Survey</h1> <p align="center"> <b> arXiv, 2024 </b> <br /> <a href="https://zytx121.github.io/"><strong>Yue Zhou </strong></a> · <a href="https://scholar.google.com/citations?user=PnNAAasAAAAJ"><strong> Litong Feng </strong></a> <!-- · <a href="https://dr.ntu.edu.sg/cris/rp/rp00908/"><strong> Yiping Ke </strong></a> --> · <a href="https://sp.sjtu.edu.cn/"><strong> Xue Jiang </strong></a> · <a href="https://scholar.google.com/citations?user=ga230VoAAAAJ"><strong> Junchi Yan </strong></a> · <a href="https://yangxue0827.github.io/"><strong> Xue Yang </strong></a> · <a href="https://scholar.google.com/citations?user=5GtyVooAAAAJ"><strong> Wayne Zhang </strong></a> </p> <p align="center"> <a href='https://arxiv.org/abs/2406.09385'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <!-- <a href='https://ieeexplore.ieee.org/document/'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> --> </p> <br />This repo is used for recording, tracking, and benchmarking several recent vision-language geo-foundation models (VLGFM) to supplement our survey. If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible.
🙌 Add Your Paper in our Repo and Survey!!!!!
-
You are welcome to give us an issue or PR for your VLGFM work !!!!!
-
Note that: Due to the huge paper in Arxiv, we are sorry to cover all in our survey. You can directly present a PR into this repo and we will record it for next version update of our survey.
🥳 New
🔥🔥🔥 Last Updated on 2024.10.11 🔥🔥🔥
- 2024.10.11: Update TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data.
- 2024.8.31: Update MME-RealWorld, which has 3738 High-Resolution Remote Sensing VQA samples.
- 2024.8.28: Update RSTeller.
- 2024.7.24: RS5M accepted by TGRS 2024.
- 2024.7.19: Update EarthMarker.
✨ Highlight!!
-
The first survey for vision-language geo-foundation models, including contrastive/conversational/generative geo-foundation models.
-
It also contains several related works, including exploration and application of some downstream tasks.
-
We list detailed results for the most representative works and give a fairer and clearer comparison of different approaches.
📖 Introduction
This survey presents the first detailed survey on remote sensing vision language foundation models, including Contrastive/Conversational/Generative VLGFMs.
📗 Summary of Contents
- 📖 Introduction
- 📗 Summary of Contents
- 📚 Methods: A Survey
- 🕹️ Application
- 📊 Exploration
- 👨🏫 Survey
- 🖊️ Citation
- 🐲 Contact
📚 Methods: A Survey
Keywords
clip
: Use CLIPllm
: Use LLM (Large Language Model)sam
: Use SAM (Segment Anything Model)i-t
: Annotate using image-text tuplesv-t
: Annotate using video-text tuplesi-t-b
: Annotate using image-text-box tripletsi-t-m
: Annotate using image-text-mask triplets
image-caption-mask triplets
Contrastive VLGFMs
Conversational VLGFMs
Generative VLGFMs
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2024 | ICLR | clip | DiffusionSat: A Generative Foundation Model for Satellite Imagery | Code |
2024 | arXiv | clip | CRS-Diff: Controllable Generative Remote Sensing Foundation Model | N/A |
Datasets & Benchmark
🕹️ Application
Captioning
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | TGRS | llm | A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning | Code |
2023 | JSEE | llm | VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning | N/A |
Retrieval
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | VT | llm | CLIP-RS: A Cross-modal Remote Sensing Image Retrieval Based on CLIP, a Northern Virginia Case Study | N/A |
2024 | arXiv | llm | Multi-Spectral Remote Sensing Image Retrieval Using Geospatial Foundation Models | Code |
Change Detection
Scene Classification
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | IJAEOG | clip | RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision | Code |
Segmentation
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | sam clip | Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Vision Foundation Models | Code |
2024 | TGRS | RRSIS: Referring Remote Sensing Image Segmentation | Code | |
2024 | CVPR | Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation | Code | |
2024 | WACV | CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting | N/A |
Visual Question Answering
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | CVPRW | Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering | N/A | |
2024 | AAAI | EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering | Project |
Geospatial Localization
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip | Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization | Code |
2023 | ICML | clip | CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations | Code |
2023 | NeurIPS | clip | GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization | Code |
2023 | arXiv | clip | SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery | Code |
Object Detection
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip | Stable Diffusion For Aerial Object Detection | N/A |
Super-Resolution
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2023 | arXiv | clip | Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing | Code |
📊 Exploration
Year | Venue | Keywords | Paper Title | Code/Project |
---|---|---|---|---|
2022 | TGRS | An Empirical Study of Remote Sensing Pretraining | Code | |
2023 | arXiv | Autonomous GIS: the next-generation AI-powered GIS | N/A | |
2023 | arXiv | GPT4GEO: How a Language Model Sees the World's Geography | Code | |
2023 | arXiv | Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs | Code | |
2023 | arXiv | The Potential of Visual ChatGPT For Remote Sensing | N/A |
👨🏫 Survey
🖊️ Citation
If you find our survey and repository useful for your research project, please consider citing our paper:
@article{zhou2024vlgfm,
title={Towards Vision-Language Geo-Foundation Models: A Survey},
author={Yue Zhou and Litong Feng and Yiping Ke and Xue Jiang and Junchi Yan and Xue Yang and Wayne Zhang},
journal={arXiv preprint arXiv:2406.09385},
year={2024}
}
🐲 Contact
yue.zhou@ntu.edu.sg