Home

Awesome

<font size='5'>VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding</font>

Xiang Li, Jian Ding, Mohamed Elhoseiny

<a href='https://vrsbench.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2406.12384'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/datasets/xiang709/VRSBench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'>

Related Projects

<font size='5'>RSGPT: A Remote Sensing Vision Language Model and Benchmark</font>

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li☨

<a href='https://github.com/Lavender105/RSGPT'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2307.15266'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

<font size='5'>Vision-language models in remote sensing: Current progress and future trends</font>

Xiang Li☨, Congcong Wen, Yuan Hu, Zhenghang Yuan, Xiao Xiang Zhu

<a href='[https://arxiv.org/abs/2307.15266](https://ieeexplore.ieee.org/abstract/document/10506064/)'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>

VRSBench

<center> <img src="fig_example.png" alt="VRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding."> </center>

VRSBench is a Versatile Vision-Language Benchmark for Remote Sensing Image Understanding. It consists of 29,614 remote sensing images with detailed captions, 52,472 object refers, and 3123,221 visual question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks.

🗓️ TODO

Using datasets

The dataset can be downloaded from link and used via the Hugging Face datasets library. To load the dataset, you can use the following code snippet:

from datasets import load_dataset
fw = load_dataset("xiang709/VRSBench", streaming=True)

Dataset curation

To construct our VRSBench dataset, we employed multiple data engineering steps, including attribute extraction, prompting engineering, GPT-4 inference, and human verification.

Model Training

For the above three tasks, we benchmark state-of-the-art models, including LLaVA-1.5, MiniGPT-v2, Mini-Gemini, and GeoChat, to demonstrate the potential of LVMs for remote sensing image understanding. To ensure a fair comparison, we reload the models that are initially trained on large-scale image-text alignment datasets, and then finetune each method using the training set of our RSVBench dataset. For each comparing method, we finetune the model on the training set of our RSVBench dataset for 5 epochs. Following GeoChat, we use LoRA finetuning to finetune all comparing methods, with a rank of 64.

Use the prepare_geochat_eval_all.ipynb to prepare the VRSBench evaluation file for image captioning, visual grounding, and VQA tasks.

Benchmark Results

The code and checkpoints of baseline models can be found at GDrive.

Image Captioning Performance

MethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE_LCIDErCLAIRAvg_L
GeoChat w/o ft13.96.63.01.47.813.20.40.4236
GPT-4V37.222.513.78.620.930.119.10.8367
MiniGPT-v236.822.413.98.717.130.821.40.7337
LLaVA-1.548.131.521.214.721.936.933.90.7849
GeoChat46.730.220.113.821.135.228.20.7752
Mini-Gemini47.631.120.914.321.536.833.50.7747

Caption: Detailed image caption performance on the VRSBench dataset. Avg_L denotes the average word length of generated captions.

Visual Grounding Performance

MethodAcc@0.5 (Unique)Acc@0.7 (Unique)Acc@0.5 (Non Unique)Acc@0.7 (Non Unique)Acc@0.5 (All)Acc@0.7 (All)
GeoChat w/o ft20.75.47.31.712.93.2
GPT-4V8.62.22.50.45.11.1
MiniGPT-v240.718.932.415.235.816.8
LLaVA-1.551.116.434.811.541.613.6
GeoChat57.422.644.518.049.819.9
Mini-Gemini41.19.622.34.930.16.8

Caption: Visual grounding performance on the papernameAbbrev dataset. Boldface indicates the best performance.

Visual Question Answering Performance

MethodCategoryPresenceQuantityColorShapeSizePositionDirectionSceneReasoningAll
# VQAs54357789637435501422101158294774620902
GeoChat w/o ft48.585.919.217.018.332.043.442.144.257.440.8
GPT-4V67.087.645.671.070.854.367.250.769.872.465.6
MiniGPT-v261.326.046.151.041.811.217.112.449.321.938.2
LLaVA-1.586.991.858.269.972.261.569.556.783.973.476.4
GeoChat86.592.156.370.173.860.469.353.583.773.576.0
Mini-Gemini87.892.158.874.075.358.068.056.783.274.477.8

Caption: Visual question answering performance on the VRSBench dataset. Boldface indicates the best performance. Note that different from our initial submission, we use a GPT-based evaluation protocol in our final version. GPT-based evaluation can better account for synonyms in open-set VQA.

Licensing Information

The dataset is released under the CC-BY-4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Related Projects

📜 Citation

@article{li2024vrsbench,
  title={VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding},
  author={Xiang Li, Jian Ding, and Mohamed Elhoseiny},
  journal={arXiv:2406.12384},
  year={2024}
}

🙏 Acknowledgement

Our VRSBench dataset is built based on DOTA-v2 and DIOR datasets.

We are thankful to LLaVA-1.5, MiniGPT-v2, Mini-Gemini, and GeoChat for releasing their models and code as open-source contributions.