Home

Awesome

GeoChat <img src="images/logo_geochat.png" height="40">: Grounded Large Vision-Language Model for Remote Sensing [CVPR-2024]

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Kartik Kuckreja*, Muhammad Sohail Danish*, Muzammal Naseer, Abhijit Das, Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of AI, Birla Institute of Technology & Science, Australian National University, Linkoping University

Website paper video


📢 Latest Updates


<img src="images/logo_geochat.png" height="40">Overview

GeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. Unlike general-domain models, GeoChat excels in handling high-resolution RS imagery, employing region-level reasoning for comprehensive scene interpretation. Leveraging a newly created RS multimodal dataset, GeoChat is fine-tuned using the LLaVA-1.5 architecture. This results in robust zero-shot performance across various RS tasks, including image and region captioning, visual question answering, scene classification, visually grounded conversations, and referring object detection.


Contents

Install

  1. Clone this repository and navigate to GeoChat folder
git clone https://github.com/mbzuai-oryx/GeoChat.git
cd GeoChat
  1. Install Package
conda create -n geochat python=3.10 -y
conda activate geochat
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation

Upgrade to latest code base

git pull
pip uninstall transformers
pip install -e .

GeoChat Weights and Demo

Please check out our Model Zoo for all public GeoChat checkpoints, and check LoRA.md for instructions on how to run the demo and training.

Train

GeoChat training consists of visual instruction tuning using GeoChat_Instruct Dataset: 318k Vicuna-generated multimodal instruction-following data, finetuned over the pretrained weights of LlaVA-v1.5.

We train GeoChat on 3 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay
GeoChat-7B1442e-5120480

Pretrain (feature alignment)

We use the pretrained projector from LLaVAv1.5, which is trained on 558K subset of the LAION-CC-SBU dataset with BLIP captions. It takes around 3.5 hours for LLaVA-v1.5-7B.

Visual Instruction Tuning

  1. Prepare data

Please download the annotation of the final mixture of our instruction tuning data GeoChat_Instruct.json, and download the split image zips from the hugging face. Save the multiple image zips in a single folder and run the following command to merge them:

cat images_parta* > images.zip

Unzip the images.zip file to a folder and give the folder's path in finetune_lora.sh.

  1. Start training!

Visual instruction tuning takes more time due to the increased resolution of CLIP to 504X504. It takes around ~25 hours to finetune GeoChat-7B on 3x A100 (40G).

Training script with DeepSpeed ZeRO-3: finetune_lora.sh.

Options to note:

Evaluation

We evaluate GeoChat on a diverse set of 7 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. See Evaluation.md.

🏆 Contributions


👁️💬 GeoChat : Grounded Large Vision-Language Model for Remote Sensing

GeoChat can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability.

<p align="center"> <img src="images/overview2.png" alt="GeoChat Overview"> </p>

🛰️ GeoChat : Architecture

An overview of GeoChat - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Vicuna 1.5). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. GeoChat can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.

<p align="center"> <img src="images/architecture.png" alt="GeoChat Architectural"> </p>

🔍 RS Multimodal Instruction Dataset

Types of annotations available in the GeoChat instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations (shown over the image). This structured information is used to create the rich instruction-set with a total of 318k image-instruction pairs.

<p align="center"> <img src="images/dataset.png" alt="Dataset Annotation Pipeline"> </p>

🤖 Qualitative results of GeoChat

Qualitative results of GeoChat. (<em>left-right</em>) Results are shown on grounding, referring object detection, and disaster/damage detection. The user can provide task-specific tokens (e.g., <strong>[grounding]</strong>) to shape model responses according to the desired behavior. The model can generate textual responses (<em>right</em>), only visual grounding (<em>center</em>) and both text and object groundings interleaved together (<em>left</em>). The model can also specify object types, object counts, object attributes and object relationships.

<p align="center"> <img src="images/examples.png" alt="Results_GCG"> </p>

🤖 Visual Question Answering

Qualitative examples for Visual Question Answering tasks. GeoChat is able to hold multi-turn conversations, based on various types of questions, including presence, count, complex comparisons and so on. It is able to detect objects and hold conversations against low resolution images as well.

<p align="center"> <img src="images/vqa.jpg" alt="Visual Question Answering"> </p>

🤖 Scene Classification

Qualitative examples for scene classification. We give the model all the classes from the dataset and ask to choose only one.

<p align="center"> <img src="images/scene.jpg" alt="Visual Question Answering"> </p>

🤖 Grounded Description

When asked to describe the image with the special token '[grounding]', GeoChat outputs both the description of the image as well as the bounding boxes for all the objects detected.

<p align="center"> <img src="images/grounded.jpg" alt="Grounded Description"> </p>

🤖 Referring Expression

When asked about an object as a referred expression, GeoChat is able to locate it and draw rotated bounding boxes around it correspondingly.

<p align="center"> <img src="images/ref1.jpg" alt="Referring Expression"> </p> <p align="center"> <img src="images/ref_2.jpg" alt="Referring Expression"> </p>

🤖 Region Caption

Qualitative examples for region-based captioning. Given a bounding box, GeoChat is able to provide brief descriptions about the area or the object covered by the bounding box.

<p align="center"> <img src="images/iden.jpg" alt="Region Caption"> </p>

📜 Citation

  @article{kuckreja2023geochat,
          title={GeoChat: Grounded Large Vision-Language Model for Remote Sensing},
          author={Kuckreja, Kartik and Danish, Muhammad S. and Naseer, Muzammal and Das, Abhijit and Khan, Salman and Khan, Fahad S.},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }

🙏 Acknowledgement

We are thankful to LLaVA and Vicuna for releasing their models and code as open-source contributions.


<img src="images/IVAL_logo.png" width="200" height="100"> <img src="images/Oryx_logo.png" width="100" height="100"> <img src="images/MBZUAI_logo.png" width="360" height="85">