Awesome
BITA
This is the official code for "Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning"
Dependencies
The project environment in my local is PyTorch 2.0:
pip install -r requirements.txt
Dataset
This paper utilizes the NWPU-Caption, RSICD, and UCM-Caption datasets. During the pre-training phase, we exclusively employ the training sets of these three datasets. For the final fine-tuning stage, please uncomment the val and test fields for the three datasets located in the BITA/configs/datasets/ directory.
Weights
The download links for the weights from the two-stage pre-training and the final fine-tuning stage are available here. Within this, the 'Caption' folder contains the model weights with the best validation accuracy on the validation set during the fine-tuning stage.
Pre-training (stage1)
In the first stage of pre-training, the visual encoder used for training is ViT-L/14 from CLIP. Please ensure the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_pretrain_vitL.yaml' file is set to 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_vitL.pth'. Then, run the following script:
bash ./scripts/bita/train/pretrain_stage1.sh
Pre-training (stage2)
"In the second stage of pre-training, please replace the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_pretrain_opt2.7b.yaml' file with the weights from the completion of the first stage of pre-training, located at '/usr/code/BITA/BITA_weights/Stage1/checkpoint_best.pth'. Then, run the following script:"
bash ./scripts/bita/train/pretrain_stage2.sh
Fine-tune & Evaluation
In the final fine-tuning stage, please replace the value of the 'pretrained' field in the 'BITA/configs/models/bita/bita_caption_opt2.7b.yaml' file with the weights from after the completion of the second stage of pre-training, '/usr/code/BITA/BITA_weights/Stage2/checkpoint_best.pth'. Then, run the following script:
bash ./scripts/bita/train/train_caption.sh
Evaluating Only
bash ./scripts/bita/eval/eval_caption.sh
Acknowledgments
This implementation is largely based on the code of LAVIS - A Library for Language-Vision Intelligence. Thanks a lot.
Citation
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{10415446,
author={Yang, Cong and Li, Zuchao and Zhang, Lefei},
journal={IEEE Transactions on Geoscience and Remote Sensing},
title={Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning},
year={2024},
volume={62},
number={},
pages={1-12},
doi={10.1109/TGRS.2024.3359316}}