Awesome
<p align="center"> <h1 align="center">Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey</h1> <p align="center"> <br /> <a href="https://chen-yang-liu.github.io/"><strong>Chenyang Liu </strong></a> · <a href="https://levir.buaa.edu.cn/members/index.html"><strong> Jiafan Zhang </strong></a> · <a href="https://chenkeyan.top/"><strong> Keyan Chen </strong></a> · <a href="https://levir.buaa.edu.cn/members/index.html"><strong> Man Wang </strong></a> · <a href="https://scholar.google.com/citations?user=DzwoyZsAAAAJ"><strong> Zhengxia Zou </strong></a> · <a href="https://scholar.google.com/citations?user=kNhFWQIAAAAJ"><strong> Zhenwei Shi </strong></a> </p> <p align="center"> <a href='https://arxiv.org/abs/2412.02573'> <img src='https://img.shields.io/badge/arXiv-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'> </a> <!-- <a href='https://ieeexplore.ieee.org/document/'> <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'> </a> --> </p> <br />This repo is used for recording, and tracking recent Remote Sensing Temporal Vision-Language Models (RS-TVLMs). If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests.
:star: Share us a :star:
Share us a :star: if you're interested in this repo. We will continue to track relevant progress and update this repository.
🙌 Add Your Paper in our Repo and Survey!
- You are welcome to give us an issue or PR for your RS-TVLM work !!!!! We will record it for next version update of our survey
🥳 New
🔥🔥🔥 Updated on 2024.12.04 🔥🔥🔥
- 2024.12.04: The first version is available.
✨ Highlight!!
-
The first survey for Remote Sensing Temporal Vision-Language models.
-
Some public datasets and code links are provided.
📖 Introduction
Timeline of representative RS-TVLMs:
📖 Table of Contents
📚 Methods: A Survey <a id="methods-a-survey"></a>
Change Captioning
Multitask Learning of Change Detection and Captioning
Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project |
---|---|---|---|---|
Pix4Cap | Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning | ViT-B/32 | Transformer Decoder | code |
Change-Agent | Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis | ViT-B/32 | Transformer Decoder | code |
Semantic-CC | Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance | SAM | Vicuna | N/A |
DetACC | Detection Assisted Change Captioning for Remote Sensing Image | ResNet-101 | Transformer Decoder | N/A |
KCFI | Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning | ViT | Qwen | code |
ChangeMinds | ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing | Swin Transformer | Transformer Decoder | code |
CTMTNet | A Multi-Task Network and Two Large Scale Datasets for Change Detection and Captioning in Remote Sensing Images | ResNet-101 | Transformer Decoder | N/A |
...... |
Change Visual Question Answering
Model Name | Paper Title | Visual Encoder | Language Decoder | Code/Project |
---|---|---|---|---|
change-aware VQA | Change-Aware Visual Question Answering | CNN | RNN | N/A |
CDVQA-Net | Change Detection Meets Visual Question Answering | CNN | RNN | code |
ChangeChat | ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning | CLIP-ViT | Vicuna-v1.5 | code |
CDchat | CDChat: A Large Multimodal Model for Remote Sensing Change Description | CLIP ViT-L/14 | Vicuna-v1.5 | code |
TEOChat | TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data | CLIP ViT-L/14 | LLaMA-2 | code |
GeoLLaVA | GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing | Video encoder | LLaVA-NeXT and Video-LLaVA | code |
CDQAG | Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection | CLIP image Encoder | CLIP Text Encoder | code |
...... |
Text2Change Retrieval
Model Name | Paper Title | Code/Project |
---|---|---|
ChangeRetCap | Towards a multimodal framework for remote sensing image change retrieval and captioning | code |
...... |
Change Grounding
Large Language Models Meets Temporal Images
📊 Dataset <a id="Dataset"></a>
- Dataset Matching Temporal Images and Text: <a id="Matching-Temporal-Images-and-Text"></a>
Dataset | Image Size/Resolution | Image pairs | Captions | Annotation | Download Link |
---|---|---|---|---|---|
DUBAI CCD | 50×50 (30m) | 500 | 2,500 | Manual | Link |
LEVIR CCD | 256×256 (0.5m) | 500 | 2,500 | Manual | Link |
LEVIR-CC | 256×256 (0.5m) | 10,077 | 50,385 | Manual | Link |
WHU-CDC | 256×256 (0.075m) | 7,434 | 37,170 | Manual | Link |
- Dataset Matching Temporal Images, Text, and Masks: <a id="Matching-Temporal-Images,-Text,-and-Masks"></a>
Dataset | Image Size/Resolution | Image pairs | Captions | Pixel-level Masks | Annotation | Download Link |
---|---|---|---|---|---|---|
LEVIR-MCI | 256×256 (0.5m) | 10,077 | 50,385 | 44,380 (building, road) | Manual | Link |
LEVIR-CDC | 256×256 (0.5m) | 10,077 | 50,385 | -- (building) | Manual | Link |
WHU-CDC | 256×256 (0.075m) | 7,434 | 37,170 | -- (building) | Manual | Link |
- Dataset Matching Temporal Images and Question-Answer Instructions: <a id="Matching-Temporal-Images-and-Question-Answer-Instructions"></a>
Dataset | Temporal Images | Image Resolution | Instruction Samples | Change-related Task | Annotation | Download Link |
---|---|---|---|---|---|---|
CDVQA | 2,968 pairs (bi-temporal) | 0.5m~3m | 122,000 | CVQA | Manual | Link |
ChangeChat-87k | 10,077 pairs (bi-temporal) | 0.5m | 87,195 | CVQA, Grounding | Automated | Link |
GeoLLaVA | 100,000 pairs (bi-temporal) | -- | 100,000 | CVQA | Automated | Link |
TEOChatlas | -- (variable temporal length) | -- | 554,071 | Classification, CVQA, Grounding | Automated | Link |
QVG-360K | 6,810 pairs (bi-temporal) | 0.1m~3m | 360,000 | CVQA, Grounding | Automated | Link |
......
👨🏫 Other Survey <a id="Other-Survey"></a>
🖊️ Citation <a id="Citation"></a>
If you find our survey and repository useful for your research, please consider citing our paper:
@misc{liu2024remotesensingtemporalvisionlanguage,
title={Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey},
author={Chenyang Liu and Jiafan Zhang and Keyan Chen and Man Wang and Zhengxia Zou and Zhenwei Shi},
year={2024},
eprint={2412.02573},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.02573},
}
🐲 Contact <a id="Contact"></a>
liuchenyang@buaa.edu.cn