Awesome

<h1 align="center">Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey</h1> <a href="https://chen-yang-liu.github.io/">Chenyang Liu </a> · <a href="https://levir.buaa.edu.cn/members/index.html"> Jiafan Zhang </a> · <a href="https://chenkeyan.top/"> Keyan Chen </a> · <a href="https://levir.buaa.edu.cn/members/index.html"> Man Wang </a> · <a href="https://scholar.google.com/citations?user=DzwoyZsAAAAJ"> Zhengxia Zou </a> · <a href="https://scholar.google.com/citations?user=kNhFWQIAAAAJ"> Zhenwei Shi </a> <a href='https://arxiv.org/abs/2412.02573'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a>

This repo is used for recording, and tracking recent Remote Sensing Temporal Vision-Language Models (RS-TVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.

:star: Share us a :star:

Share us a :star: if you're interested in this repo. We will continue to track relevant progress and update this repository.

🙌 Add Your Paper in our Repo and Survey!

You are welcome to give us an issue or PR for your RS-TVLM work !!!!! We will record it for next version update of our survey

🥳 New

🔥🔥🔥 Updated on 2024.12.04 🔥🔥🔥

2024.12.04: The first version is available.

✨ Highlight!!

The first survey for Remote Sensing Temporal Vision-Language models.
Some public datasets and code links are provided.

📖 Introduction

Timeline of representative RS-TVLMs:

Alt Text

📚 Methods: A Survey <a id="methods-a-survey"></a>

Change Captioning

Model Name	Paper Title	Visual Encoder	Language Decoder	Code/Project
CNN-RNN	Captioning changes in bi-temporal remote sensing images	VGG-16	RNN	N/A
CC-RNN/SVM	Change captioning: A new paradigm for multitemporal remote sensing image analysis	VGG-16	RNN,SVM	N/A
RSICCformer	Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset	ResNet-101	Transformer Decoder	code
PSNet	Progressive Scale-aware Network for Remote sensing Image Change Captioning	ViT-B/32	Transformer Decoder	code
PromptCC	A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning	ViT-B/32	GPT-2	code
Chg2Cap	Changes to Captions: An Attentive Network for Remote Sensing Change Captioning	ResNet-101	Transformer Decoder	code
ICT-Net	Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning	ResNet-101	Transformer Decoder	code
SITS-CC	Change Caption for Satellite Images Time Series	ResNet-101	Transformer Decoder	code
RSCaMa	RSCaMa: Remote Sensing Image Change Captioning with State Space Model	ViT-B/32	Mamba, Transformer Decoder, GPT-2	code
SparseFocus	A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning	ResNet-101	Transformer Decoder	code
SEN	Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning	ResNet with 6-channel	Transformer Decoder	code
Diffusion-RSCC	Diffusion model for learning cross-modal data distribution	ResNet-101	Diffusion	code
CARD	Context-aware Difference Distilling for Multi-change Captioning	ResNet-101	Transformer Decoder	code
ChangeRetCap	Towards a multimodal framework for remote sensing image change retrieval and captioning	ResNet-101	Transformer Decoder	code
Intelli-Change	Intelli-Change Remote Sensing - A Novel Transformer Approach	ResNet-101	Transformer Decoder	N/A
ChangeExp	Towards Temporal Change Explanations from Bi-Temporal Satellite Images	LLaVA-1.5	LLaVA-1.5	N/A
MAF-Net	Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning	ResNet-101	Transformer Decoder	N/A
SFEN	Scale-wised feature enhancement network for change captioning of remote sensing images	WideResNet	Transformer Decoder	N/A
MfrNet	MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning	ResNet-18	Transformer Decoder	N/A
SEIFNet	Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning	ResNet-101	Transformer Decoder	code
MV-CC	MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption	InternVideo2	Transformer Decoder	code
Chareption	Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning	CLIP ViT-L/14	LLaMA-7B	N/A
MADiffCC	Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model	Diffusion	Transformer Decoder	N/A
CCExpert	CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset	Diffusion	Transformer Decoder	code
......

Multitask Learning of Change Detection and Captioning

Model Name	Paper Title	Visual Encoder	Language Decoder	Code/Project
Pix4Cap	Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning	ViT-B/32	Transformer Decoder	code
Change-Agent	Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis	ViT-B/32	Transformer Decoder	code
Semantic-CC	Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance	SAM	Vicuna	N/A
DetACC	Detection Assisted Change Captioning for Remote Sensing Image	ResNet-101	Transformer Decoder	N/A
KCFI	Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning	ViT	Qwen	code
ChangeMinds	ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing	Swin Transformer	Transformer Decoder	code
CTMTNet	A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images	ResNet-101	Transformer Decoder	N/A
......

Change Visual Question Answering

Model Name	Paper Title	Visual Encoder	Language Decoder	Code/Project
change-aware VQA	Change-Aware Visual Question Answering	CNN	RNN	N/A
CDVQA-Net	Change Detection Meets Visual Question Answering	CNN	RNN	code
ChangeChat	ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning	CLIP-ViT	Vicuna-v1.5	code
CDchat	CDChat: A Large Multimodal Model for Remote Sensing Change Description	CLIP ViT-L/14	Vicuna-v1.5	code
TEOChat	TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data	CLIP ViT-L/14	LLaMA-2	code
GeoLLaVA	GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing	Video encoder	LLaVA-NeXT and Video-LLaVA	code
CDQAG	Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection	CLIP image Encoder	CLIP Text Encoder	code
......

Text2Change Retrieval

Model Name	Paper Title	Code/Project
ChangeRetCap	Towards a multimodal framework for remote sensing image change retrieval and captioning	code
......

Change Grounding

Model Name	Paper Title	Code/Project
ChangeChat	ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning	code
CDchat	CDChat: A Large Multimodal Model for Remote Sensing Change Description	code
TEOChat	TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data	code
CDQAG	Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection	code
......

Large Language Models Meets Temporal Images

Method	Release Time	LLM	Fine-tuning	Task	Paper Title	Code/Project
PromptCC	2023.06	GPT-2	Prompt Learning	CC	A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning	code
Change-Agent	2024.07	Chatgpt	--	CC, CD	Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis	code
Semantic-CC	2024.07	Vicuna	LoRA	CC	Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
ChangeChat	2024.09	Vicuna-v1.5	LoRA	CVQA, CG	ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning	code
KCFI	2024.09	Qwen	Prompt	CC	Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning	code
CDChat	2024.09	Vicuna-v1.5	LoRA	CVQA	CDChat: A Large Multimodal Model for Remote Sensing Change Description	code
TEOChat	2024.10	LLaMA-2	LoRA	CVQA, CG	TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data	code
GeoLLaVA	2024.10	LLaVA-NeXT	LoRA	CVQA	GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing	code
Chareption	2024.10	LLaMA-7B	Adapter	CC	Chareption: Change-Aware Adaption Empowers Large Language Model for Effective Remote Sensing Image Change Captioning
CCExpert	2024.11	Qwen-2	LoRA	CC	CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset	code
......

📊 Dataset <a id="Dataset"></a>

Dataset Matching Temporal Images and Text: <a id="Matching-Temporal-Images-and-Text"></a>

Dataset	Image Size/Resolution	Image pairs	Captions	Annotation	Download Link
DUBAI CCD	50×50 (30m)	500	2,500	Manual	Link
LEVIR CCD	256×256 (0.5m)	500	2,500	Manual	Link
LEVIR-CC	256×256 (0.5m)	10,077	50,385	Manual	Link
WHU-CDC	256×256 (0.075m)	7,434	37,170	Manual	Link

Dataset Matching Temporal Images, Text, and Masks: <a id="Matching-Temporal-Images,-Text,-and-Masks"></a>

Dataset	Image Size/Resolution	Image pairs	Captions	Pixel-level Masks	Annotation	Download Link
LEVIR-MCI	256×256 (0.5m)	10,077	50,385	44,380 (building, road)	Manual	Link
LEVIR-CDC	256×256 (0.5m)	10,077	50,385	-- (building)	Manual	Link
WHU-CDC	256×256 (0.075m)	7,434	37,170	-- (building)	Manual	Link

Dataset Matching Temporal Images and Question-Answer Instructions: <a id="Matching-Temporal-Images-and-Question-Answer-Instructions"></a>

Dataset	Temporal Images	Image Resolution	Instruction Samples	Change-related Task	Annotation	Download Link
CDVQA	2,968 pairs (bi-temporal)	0.5m~3m	122,000	CVQA	Manual	Link
ChangeChat-87k	10,077 pairs (bi-temporal)	0.5m	87,195	CVQA, Grounding	Automated	Link
GeoLLaVA	100,000 pairs (bi-temporal)	--	100,000	CVQA	Automated	Link
TEOChatlas	-- (variable temporal length)	--	554,071	Classification, CVQA, Grounding	Automated	Link
QVG-360K	6,810 pairs (bi-temporal)	0.1m~3m	360,000	CVQA, Grounding	Automated	Link

......

👨‍🏫 Other Survey <a id="Other-Survey"></a>

Year	Paper Title
2023	An Agenda for Multimodal Foundation Models for Earth Observation
2023	Self-Supervised Remote Sensing Feature Learning: Learning Paradigms, Challenges, and Future Works
2023	Large Remote Sensing Model: Progress and Prospects
2023	Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey
2023	On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
2024	Vision-Language Models in Remote Sensing: Current Progress and Future Trends
2024	On the Foundations of Earth and Climate Foundation Models
2024	Towards Vision-Language Geo-Foundation Model: A Survey
2024	Language Integration in Remote Sensing: Tasks, datasets, and future directions
2024	Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
2024	An LLM Agent for Automatic Geospatial Data Analysis
2024	COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

🖊️ Citation <a id="Citation"></a>

If you find our survey and repository useful for your research, please consider citing our paper:

@misc{liu2024remotesensingtemporalvisionlanguage,
      title={Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey}, 
      author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
      year={2024},
      eprint={2412.02573},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.02573}, 
}

🐲 Contact <a id="Contact"></a>

liuchenyang@buaa.edu.cn