Home

Awesome

Awesome PR's Welcome <br />

<p align="center"> <h1 align="center">Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey</h1> <p align="center"> <br /> <a href="https://chen-yang-liu.github.io/"><strong>Chenyang Liu </strong></a> · <a href="https://levir.buaa.edu.cn/members/index.html"><strong> Jiafan Zhang </strong></a> · <a href="https://chenkeyan.top/"><strong> Keyan Chen </strong></a> · <a href="https://levir.buaa.edu.cn/members/index.html"><strong> Man Wang </strong></a> · <a href="https://scholar.google.com/citations?user=DzwoyZsAAAAJ"><strong> Zhengxia Zou </strong></a> · <a href="https://scholar.google.com/citations?user=kNhFWQIAAAAJ"><strong> Zhenwei Shi </strong></a> </p> <p align="center"> <a href='https://arxiv.org/abs/2412.02573'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <!-- <a href='https://ieeexplore.ieee.org/document/'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> --> </p> <br />

This repo is used for recording, and tracking recent Remote Sensing Temporal Vision-Language Models (RS-TVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

:star: Share us a :star:

Share us a :star: if you're interested in this repo. We will continue to track relevant progress and update this repository.

🙌 Add Your Paper in our Repo and Survey!

🥳 New

🔥🔥🔥 Updated on 2024.12.04 🔥🔥🔥

✨ Highlight!!

📖 Introduction

Timeline of representative RS-TVLMs:

Alt Text

📖 Table of Contents

📚 Methods: A Survey <a id="methods-a-survey"></a>

Change Captioning

Model NamePaper TitleVisual EncoderLanguage DecoderCode/Project
CNN-RNNCaptioning changes in bi-temporal remote sensing imagesVGG-16RNNN/A
CC-RNN/SVMChange captioning: A new paradigm for multitemporal remote sensing image analysisVGG-16RNN,SVMN/A
RSICCformerRemote sensing image change captioning with dual-branch transformers: A new method and a large scale datasetResNet-101Transformer Decodercode
PSNetProgressive Scale-aware Network for Remote sensing Image Change CaptioningViT-B/32Transformer Decodercode
PromptCCA Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change CaptioningViT-B/32GPT-2code
Chg2CapChanges to Captions: An Attentive Network for Remote Sensing Change CaptioningResNet-101Transformer Decodercode
ICT-NetInteractive Change-Aware Transformer Network for Remote Sensing Image Change CaptioningResNet-101Transformer Decodercode
SITS-CCChange Caption for Satellite Images Time SeriesResNet-101Transformer Decodercode
RSCaMaRSCaMa: Remote Sensing Image Change Captioning with State Space ModelViT-B/32Mamba, Transformer Decoder, GPT-2code
SparseFocusA Lightweight Sparse Focus Transformer for Remote Sensing Image Change CaptioningResNet-101Transformer Decodercode
SENSingle-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change CaptioningResNet with 6-channelTransformer Decodercode
Diffusion-RSCCDiffusion model for learning cross-modal data distributionResNet-101Diffusioncode
CARDContext-aware Difference Distilling for Multi-change CaptioningResNet-101Transformer Decodercode
ChangeRetCapTowards a multimodal framework for remote sensing image change retrieval and captioningResNet-101Transformer Decodercode
Intelli-ChangeIntelli-Change Remote Sensing - A Novel Transformer ApproachResNet-101Transformer DecoderN/A
ChangeExpTowards Temporal Change Explanations from Bi-Temporal Satellite ImagesLLaVA-1.5LLaVA-1.5N/A
MAF-NetMulti-scale Attentive Fusion Network for Remote Sensing Image Change CaptioningResNet-101Transformer DecoderN/A
SFENScale-wised feature enhancement network for change captioning of remote sensing imagesWideResNetTransformer DecoderN/A
MfrNetMfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change CaptioningResNet-18Transformer DecoderN/A
SEIFNetInter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change CaptioningResNet-101Transformer Decodercode
MV-CCMV-CC: Mask Enhanced Video Model for Remote Sensing Change CaptionInternVideo2Transformer Decodercode
ChareptionChareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change CaptioningCLIP ViT-L/14LLaMA-7BN/A
MADiffCCRemote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion ModelDiffusionTransformer DecoderN/A
CCExpertCCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational DatasetDiffusionTransformer Decodercode
......

Multitask Learning of Change Detection and Captioning

Model NamePaper TitleVisual EncoderLanguage DecoderCode/Project
Pix4CapPixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change CaptioningViT-B/32Transformer Decodercode
Change-AgentChange-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and AnalysisViT-B/32Transformer Decodercode
Semantic-CCSemantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic GuidanceSAMVicunaN/A
DetACCDetection Assisted Change Captioning for Remote Sensing ImageResNet-101Transformer DecoderN/A
KCFIEnhancing Perception of Key Changes in Remote Sensing Image Change CaptioningViTQwencode
ChangeMindsChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote SensingSwin TransformerTransformer Decodercode
CTMTNetA Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing ImagesResNet-101Transformer DecoderN/A
......

Change Visual Question Answering

Model NamePaper TitleVisual EncoderLanguage DecoderCode/Project
change-aware VQAChange-Aware Visual Question AnsweringCNNRNNN/A
CDVQA-NetChange Detection Meets Visual Question AnsweringCNNRNNcode
ChangeChatChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction TuningCLIP-ViTVicuna-v1.5code
CDchatCDChat: A Large Multimodal Model for Remote Sensing Change DescriptionCLIP ViT-L/14Vicuna-v1.5code
TEOChatTEOChat: A Large Vision-Language Assistant for Temporal Earth Observation DataCLIP ViT-L/14LLaMA-2code
GeoLLaVAGeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote SensingVideo encoderLLaVA-NeXT and Video-LLaVAcode
CDQAGShow Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change DetectionCLIP image EncoderCLIP Text Encodercode
......

Text2Change Retrieval

Model NamePaper TitleCode/Project
ChangeRetCapTowards a multimodal framework for remote sensing image change retrieval and captioningcode
......

Change Grounding

Model NamePaper TitleCode/Project
ChangeChatChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuningcode
CDchatCDChat: A Large Multimodal Model for Remote Sensing Change Descriptioncode
TEOChatTEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Datacode
CDQAGShow Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detectioncode
......

Large Language Models Meets Temporal Images

MethodRelease TimeLLMFine-tuningTaskPaper TitleCode/Project
PromptCC2023.06GPT-2Prompt LearningCCA Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioningcode
Change-Agent2024.07Chatgpt--CC, CDChange-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysiscode
Semantic-CC2024.07VicunaLoRACCSemantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
ChangeChat2024.09Vicuna-v1.5LoRACVQA, CGChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuningcode
KCFI2024.09QwenPromptCCEnhancing Perception of Key Changes in Remote Sensing Image Change Captioningcode
CDChat2024.09Vicuna-v1.5LoRACVQACDChat: A Large Multimodal Model for Remote Sensing Change Descriptioncode
TEOChat2024.10LLaMA-2LoRACVQA, CGTEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Datacode
GeoLLaVA2024.10LLaVA-NeXTLoRACVQAGeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensingcode
Chareption2024.10LLaMA-7BAdapterCCChareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning
CCExpert2024.11Qwen-2LoRACCCCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Datasetcode
......

📊 Dataset <a id="Dataset"></a>

DatasetImage Size/ResolutionImage pairsCaptionsAnnotationDownload Link
DUBAI CCD50×50 (30m)5002,500ManualLink
LEVIR CCD256×256 (0.5m)5002,500ManualLink
LEVIR-CC256×256 (0.5m)10,07750,385ManualLink
WHU-CDC256×256 (0.075m)7,43437,170ManualLink
DatasetImage Size/ResolutionImage pairsCaptionsPixel-level MasksAnnotationDownload Link
LEVIR-MCI256×256 (0.5m)10,07750,38544,380 (building, road)ManualLink
LEVIR-CDC256×256 (0.5m)10,07750,385-- (building)ManualLink
WHU-CDC256×256 (0.075m)7,43437,170-- (building)ManualLink
DatasetTemporal ImagesImage ResolutionInstruction SamplesChange-related TaskAnnotationDownload Link
CDVQA2,968 pairs (bi-temporal)0.5m~3m122,000CVQAManualLink
ChangeChat-87k10,077 pairs (bi-temporal)0.5m87,195CVQA, GroundingAutomatedLink
GeoLLaVA100,000 pairs (bi-temporal)--100,000CVQAAutomatedLink
TEOChatlas-- (variable temporal length)--554,071Classification, CVQA, GroundingAutomatedLink
QVG-360K6,810 pairs (bi-temporal)0.1m~3m360,000CVQA, GroundingAutomatedLink

......

👨‍🏫 Other Survey <a id="Other-Survey"></a>

YearPaper Title
2023An Agenda for Multimodal Foundation Models for Earth Observation
2023Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works
2023Large Remote Sensing Model: Progress and Prospects
2023Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey
2023On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
2024Vision-Language Models in Remote Sensing: Current Progress and Future Trends
2024On the Foundations of Earth and Climate Foundation Models
2024Towards Vision-Language Geo-Foundation Model: A Survey
2024Language Integration in Remote Sensing: Tasks, datasets, and future directions
2024Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
2024An LLM Agent for Automatic Geospatial Data Analysis
2024COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

🖊️ Citation <a id="Citation"></a>

If you find our survey and repository useful for your research, please consider citing our paper:

@misc{liu2024remotesensingtemporalvisionlanguage,
      title={Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey}, 
      author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
      year={2024},
      eprint={2412.02573},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02573}, 
}

🐲 Contact <a id="Contact"></a>

liuchenyang@buaa.edu.cn