Home

Awesome

<img src="docs/images/h2rsvlm_logo-removebg-preview.png" style="vertical-align: -10px;" :height="50px" width="50px"> VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

arXiv

[Project Page] [Paper] [Model🤗] [Dataset🤗]

<!-- <div style="display: flex; justify-content: center;" align="center"> <center> <img src="doc/images/h2rsvlm_logo.png" style="width: 200 px;"> </div> --> <div style="display: flex; justify-content: center;"> <img src="docs/images/h2rsvlm_logo.png" style="height: 300px;"> </div>

News

TODO

Contents

Install

Refer to the following command for installation.

git clone git@github.com:opendatalab/VHM.git
cd VHM
conda create -n vhm 
conda activate vhm
pip install -r requirment.txt

Data

You should follow this instruction Data.md to manage the datasets. If you need to train our model from scratch, please refer to for data download and preparation first.

Models

VHM consists of a visual encoder, a projector layer, and a large language model (LLM). The visual encoder uses a pretrained CLIP-14-336px, the projector layer is composed of two MLP layers, and the LLM is based on the pretrained Vicuna-7B. The model is trained in two stages, as shown in the diagram below.

We provide not only the weights after the SFT stage but also the Pretrained weights.

NameDescription
VHM_sftThe LLM and MLP weights obtained from the SFT stage
VHM_pretrainThe LLM and MLP weights obtained from the Pretraining stage.
CLIP_pretrainThe CLIP weights obtained from the Pretraining stage.

Train

VHM model training consists of two stages: (1) Pretrain stage: use our VersaD dataset with 1.4M image-text pairs to finetune the vision encoder, projector, and the LLM to align the textual and visual modalities; (2) Supervised Fine-Tuning(SFT) stage: finetune the projector and LLM to teach the model to follow multimodal instructions.

Pretrain

First, you should download the MLP projector pretrained by LLaVA-1.5. Because a rough modality alignment process is beneficial before using high quality detailed captions for modality alignment.

You can run sh scripts/rs/slurm_pretrain.sh to pretrain the model. Remember to specify the projector path in the script. In this stage, we fine-tuned the second half of the vision encoder's blocks, projector, and LLM.

In our setup we used 16 A100 (80G) GPUs and the whole pre-training process lasted about 10 hours. You can adjust the number of gradient accumulation steps to reduce the number of GPUs.

In the sh scripts/rs/slurm_pretrain.sh, you need to revise three paths:

DATA_DIR=pretrain_base # directory of VersaD dataset
export LIST_FILE=${DATA_DIR}/list_pretrain.json # json file of VersaD data  
export CKPT_PATH=weight_path # llava-1.5 MLP weight path
export SAVE_PATH=vhm-7b_prtrained # file save path

Supervised Fine-Tuning

In this stage, we finetune the projector and LLM with our VHM_SFT dataset.

In our setup we used 8 A100 (80G) GPUs and the whole sft process lasted about 4 hours. You can adjust the number of gradient accumulation steps to reduce the number of GPUs.

You can run sh scripts/rs/slurm_finetune.sh to finetune the model, and you need to revise three paths:

DATA_DIR=sft_base # directory of vhm-sft dataset
export LIST_FILE=${DATA_DIR}/list_sft.json # json file of sft data  
CKPT=vhm-7b_pretrained # pretrain weight path
export SAVE_PATH=vhm-7b_sft # file save path

Evaluation

In order to facilitate the use of remote sensing vision-language large models, we have developed a specialized evaluation project RSEvalKit for remote sensing large models. Please refer to the following command for installation.

git clone https://github.com/fitzpchao/RSEvalKit
cd RSEvalKit
conda create -n rseval
conda activate rseval
pip install -r requirements.txt

All evaluation tasks for this paper are implemented in RSEval and can be evaluated with one click. First, you need to download our model weights and VHM_Eval data, then follow the instructions to complete the evaluation.

Citation

@misc{pang2024vhmversatilehonestvision,
      title={VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis}, 
      author={Chao Pang and Xingxing Weng and Jiang Wu and Jiayu Li and Yi Liu and Jiaxing Sun and Weijia Li and Shuai Wang and Litong Feng and Gui-Song Xia and Conghui He},
      year={2024},
      eprint={2403.20213},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2403.20213}, 
}

Acknowledgement

We gratefully acknowledge these wonderful works:

License

Code License Data License Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and Gemini. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.