Home

Awesome

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Muhammad Sohail Danish*, Muhammad Akhtar Munir*, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro , Alexandre Lacoste and Salman Khan

* Equally contributing first authors

Mohamed bin Zayed University of AI, University College London, LinkΓΆping University, IBM Research Europe, UK, ServiceNow Research, Australian National University

paper Website HuggingFace

Official GitHub repository for GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks.

πŸ“’ Latest Updates


πŸ› οΈ Code and Leaderboard Coming Soon!

The code and leaderboard will be released shortly. Follow this repository for updates!


πŸ’‘ Overview

<p align="center"> <img src="images/teaser_bench.jpg" width="1200"></a> </p> <p align="justify"> <b> Figure</b>: Examples of tasks from the GEOBench-VLM benchmark. Our benchmark is designed to evaluate VLMs on a diverse range of remote sensing applications. The benchmark includes over 10,000 questions spanning a range of tasks essential for Earth Observation, such as Temporal Understanding, Referring Segmentation, Visual Grounding, Scene Understanding, Counting, Detailed Image Captioning, and Relational Reasoning. Each task is tailored to capture unique domain-specific challenges, featuring varied visual conditions and object scales, and requiring nuanced understanding for applications like disaster assessment, urban planning, and environmental monitoring.

<p align="justify"> Abstract: While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique demands of geospatial applications. Generic VLM benchmarks are not designed to handle the complexities of geospatial data, which is critical for applications such as environmental monitoring, urban planning, and disaster management. Some of the unique challenges in geospatial domain include temporal analysis for changes, counting objects in large quantities, detecting tiny objects, and understanding relationships between entities occurring in Remote Sensing imagery. To address this gap in the geospatial domain, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale. We evaluate several state-of-the-art VLMs to assess their accuracy within the geospatial context. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific examples, highlighting the room for further improvements. Specifically, the best-performing GPT4o achieves only 40% accuracy on MCQs, which is only double the random guess performance. </p>

πŸ† Contributions

<hr />

πŸ—‚οΈ Benchmarks Comparison

<p align="center"> <img src="images/benchmark-table.png" width="1200" alt="Dataset Comparison table"></a> </p>
<p align="justify"> <b> <span style="color: blue;">Table</span></b>: Overview of Generic and Geospatial-specific Datasets & Benchmarks, detailing modalities (O=Optical, PAN=Panchromatic, MS=Multi-spectral, IR=Infrared, SAR=Synthetic Aperture Radar, V=Video, MI=Multi-image, BT=Bi-Temporal, MT=Multi-temporal), data sources (DRSD=Diverse RS Datasets, OSM=OpenStreetMap, GE=Google Earth, answer types (MCQ=Multiple Choice, SC=Single Choice, FF=Free-Form, BBox=Bounding Box, Seg=Segmentation Mask), and annotation types (A=Automatic, M=Manual). </p>
<hr />

πŸ” Dataset Annotation Pipeline

<p align="justify"> Our pipeline integrates diverse datasets, automated tools, and manual annotation. Tasks such as scene understanding, object classification, and non-optical analysis are based on classification datasets, while GPT-4o generates unique MCQs with five options: one correct answer, one semantically similar "closest" option, and three plausible alternatives. Spatial relationship tasks rely on manually annotated object pair relationships, ensuring consistency through cross-verification. Caption generation leverages GPT-4o, combining image, object details, and spatial interactions with manual refinement for high precision. </p> <p align="center"> <img src="images/pipeline7.jpg" width="1200"></a> </p> <hr />

πŸ“Š Results

Performance Summary of VLMs Across Geospatial Tasks. GPT-4o achieves better accuracy in relatively easy tasks like Aircraft Type Classification, Disaster Type Classification, Scene Classification, and Land Use Classification. But, on average the best-performing GPT-4o achieves only 40% accuracy on MCQs based on diverse geospatial tasks, which is only double the random guess performance. These results showcase the varying strengths of VLMs in addressing diverse geospatial tasks.

<p align="center"> <img src="images/benchmark_heatmap1.png" width="1200" alt="Results Heatmap"></a> </p>

Temporal Understanding Results

Results highlight the strengths of VLMs in handling temporal geospatial challenges. Evaluation across five tasks: Crop Type Classification, Disaster Type Classification, Farm Pond Change Detection, Land Use Classification, and Damaged Building Count. GPT-4o achieves the highest accuracy overall in classification and counting tasks.

<div align="center">
ModelCrop Type ClassificationDisaster Type ClassificationFarm Pond Change DetectionLand Use ClassificationDamaged Building Count
LLaVA-OneV0.12730.44930.15790.56720.2139
Qwen2-VL0.12730.59030.09210.58690.2270
GPT-4o0.18180.63440.14470.62300.2420
</div>

Reffering Expression Detection

Referring expression detection. We report Precision on 0.5 IoU and 0.25 IoU

<div align="center">
ModelPrecision@0.5 IoUPrecision@0.25 IoU
Sphinx0.34080.5289
GeoChat0.11510.2100
Ferret0.09430.2003
Qwen2-VL0.15180.2524
GPT-4o0.00870.0386
</div> <hr />

πŸ€– Qualitative Results

<p align="justify"> <b> <span style="color: blue;">Scene Understanding</span></b>: This illustrates model performance on geospatial scene understanding tasks, highlighting successes in clear contexts and challenges in ambiguous scenes. The results emphasize the importance of contextual reasoning and addressing overlapping visual cues for accurate classification.
<p align="center"> <img src="images/results-scene.jpg" width="1200" alt="Scene Understanding"> </p>
<p align="justify"> <b> <span style="color: blue;">Counting</span></b>: The figure showcases model performance on counting tasks, where Qwen 2-VL, GPT-4o and LLaVA-One have better performance in identifying objects. Other models, such as Ferret, struggled with overestimation, highlighting challenges in object differentiation and spatial reasoning.
<p align="center"> <img src="images/results-counting.jpg" width="1200" alt="Counting"> </p>
<p align="justify"> <b> <span style="color: blue;">Object Classification</span></b>: The figure highlights model performance on object classification, showing success with familiar objects like the "atago-class destroyer" and "small civil transport/utility" aircraft. However, models struggled with rarer objects like the ``murasame-class destroyer" and ``garibaldi aircraft carrier" indicating a need for improvement on less common classes and fine-grained recognition.
<p align="center"> <img src="images/results-object.jpg" width="1200" alt="Object Classification"> </p>
<p align="justify"> <b> <span style="color: blue;">Event Detection</span></b>: Model performance on disaster assessment tasks, with success in scenarios like 'fire' and 'flooding' but challenges in ambiguous cases like 'tsunami' and 'seismic activity'. Misclassifications highlight limitations in contextual reasoning and insufficient exposure on overlapping disaster features.
<p align="center"> <img src="images/results-event.jpg" width="1200" alt="Event Detection"> </p>
<p align="justify"> <b> <span style="color: blue;">Spatial Relations</span></b>: The figure demonstrates model performance on spatial relationship tasks, with success in close-object scenarios and struggles in cluttered environments with distant objects.
<p align="center"> <img src="images/results-relations.jpg" width="1200" alt="Spatial Relations"> </p> <hr />

πŸ“œ Citation

If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:

@article{danish2024geobenchvlm,
      title={GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks}, 
      author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Kartik Kuckreja and Fahad Shahbaz Khan and Paolo Fraccaro and Alexandre Lacoste and Salman Khan},
      year={2024},
      eprint={2411.19325},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19325}, 
}

πŸ“¨ Contact

If you have any questions, please create an issue on this repository or contact at muhammad.sohail@mbzuai.ac.ae.


<img src="images/MBZUAI_logo.png" width="290" height="85"> <img src="images/IVAL_logo.png" width="160" height="100"> <img src="images/ibm-logo.jpg" width="270"> <img src="images/ServiceNow_logo.png" width="270"> <img src="images/aialliance.png" width="270">