Home

Awesome

GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing šŸŒ

GeoLLaVA is designed to enhance vision-language models (VLMs) for detecting temporal changes in remote sensing data. By leveraging fine-tuning techniques like LoRA and QLoRA, it significantly improves model performance in tasks such as environmental monitoring and urban planning, especially in detecting geographical landscape evolution over time.

Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Yasser Ashraf and Mohsen Guizani

Mohamed bin Zayed University of AI (MBZUAI)


<p align='center'> <img src="assets/Overview.jpg" height="400"> </p>

Contents


Setup

  1. Clone this repository:

    git clone https://github.com/HosamGen/GeoLLaVA.git
    cd GeoLLaVA
    
  2. Install the necessary dependencies:

    conda create -n geollava python=3.10
    conda activate geollava
    pip install -r requirements.txt
    

GeoLLaVA Custom Dataset

[OPTIONAL] Please refer to the fMoW dataset for the original remote sensing dataset. We provide cleaned annotations in the Annotations section below.

[!NOTE] The full 100K annotations are too large for direct download and can be accessed via Drive.

The videos used in this project can also be found on Drive and unzipped using the following commands:

unzip updated_train_videos.zip
unzip updated_val_videos.zip

Your directory structure should look like this:

GeoLLaVA
ā”œā”€ā”€ annotations
|    ā”œā”€ā”€ updated_train_annotations.json
|    ā”œā”€ā”€ updated_val_annotations.json
ā”œā”€ā”€ updated_train_videos
|    ā”œā”€ā”€ airport_hangar_0_4-airport_hangar_0_2.mp4
|    |   .....
ā”œā”€ā”€ updated_val_videos
|    ā”œā”€ā”€ airport_hangar_0_4-airport_hangar_0_1.mp4
|    |   .....
ā”œā”€ā”€ llavanext_eval.py
ā”œā”€ā”€ llavanext_finetune.py
ā”œā”€ā”€ videollava_finetune.py
ā”œā”€ā”€ videollava_test.py
...

Training

To fine-tune the model on the dataset, run the videollava_finetune.py or llavanext_finetune.py scripts, depending on your model configuration.

For Video-LLaVA:

python videollava_finetune.py

For LLaVA-NeXT:

python llavanext_finetune.py

Modify parameters such as:

MAX_LENGTH = 256
USE_LORA = False
USE_QLORA = True 
USE_8BIT = False 
PRUNE = False 
prune_amount = 0.05 
MODEL_TYPE = "sample" #for 10k sample dataset
# MODEL_TYPE = "full" #for the full 100k dataset
batch_size = 2

#lora parameters
lora_r = 64
lora_alpha = 128

Evaluation

To evaluate the fine-tuned models on the test dataset, use the following commands:

For Video-LLaVA:

python videollava_test.py

For LLaVA-NeXT:

python llavanext_eval.py

[!IMPORTANT] The MODEL_PATH must be changed during evaluation based on the model that was finetuned.

These commands will run the evaluation on the specified test dataset and generate performance metrics, including ROUGE, BLEU, and BERT scores. The results will help assess the model's performance in detecting temporal changes in remote sensing data.

Results

We evaluated the performance of GeoLLaVA across various metrics, including ROUGE, BLEU, and BERT scores. The fine-tuned model demonstrated significant improvements in capturing and describing temporal changes in geographical landscapes.

To calculate the scores after evaluating the models, please check the steps in the Results.ipynb notebook.

Video-LLaVA Results

ModelROUGE-1ROUGE-2ROUGE-LBLEUBERT
Base0.2110.0410.1220.0390.456
10K LoRA0.5630.2140.3130.2430.849
100K LoRA0.5760.2260.3250.2500.863
10K QLoRA0.5650.2120.3100.2430.845
100K QLoRA0.5710.2200.3160.2500.854
10K Pruning 5%0.0310.0070.0240.0100.265
100K Pruning 5%0.1250.0340.1100.0430.359

LLaVA-NeXT Results

ModelROUGE-1ROUGE-2ROUGE-LBLEUBERT
Base0.1970.0370.1130.0420.404
10K LoRA0.5540.1980.3000.2320.856
100K LoRA0.5620.1990.3000.2390.864
10K QLoRA0.5430.1930.2830.2130.836
100K QLoRA0.5610.2020.3020.2290.858
10K Pruning 5%0.5320.1780.2780.2090.829
100K Pruning 5%0.5410.1830.2840.2100.840

Final Model (100K LoRA) | 0.556 | 0.202 | 0.290 | 0.227 | 0.850 |

These metrics illustrate how well the models performed in describing temporal changes in remote sensing data, with fine-tuning techniques like LoRA and QLoRA leading to notable improvements.

Acknowledgement

Citation

please cite using this BibTeX:

    @misc{elgendy2024geollava,
      title={GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing}, 
      author={Hosam Elgendy and Ahmed Sharshar and Ahmed Aboeitta and Yasser Ashraf and Mohsen Guizani},
      year={2024},
      eprint={2410.19552},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.19552}, 
}