Home

Awesome

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Vision-Language Foundation Model for Remote Sensing

RS5M Dataset

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by 3% ~ 20% in Zero-shot Classification (ZSC) tasks, 3% ~ 6% in Remote Sensing Cross-Modal Text–Image Retrieval (RSCTIR) and 4% ~ 5% in Semantic Localization (SeLo) tasks.

teaser

GeoRSCLIP Model

Installation

  pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
  pip install pillow pandas scikit-learn ftfy tqdm matplotlib transformers adapter-transformers open_clip_torch pycocotools timm clip-benchmark torch-rs

Usage

git clone https://huggingface.co/Zilun/GeoRSCLIP
cd GeoRSCLIP
unzip data/rs5m_test_data.zip
  python codebase/inference.py --ckpt-path /your/local/path/to/RS5M_ViT-B-32.pt --test-dataset-dir /your/local/path/to/rs5m_test_data

  import open_clip
  import torch
  from inference_tool import get_preprocess


  ckpt_path = "/your/local/path/to/RS5M_ViT-B-32.pt"
  model, _, _ = open_clip.create_model_and_transforms("ViT-B/32", pretrained="openai")
  checkpoint = torch.load(ckpt_path, map_location="cpu")
  msg = model.load_state_dict(checkpoint, strict=False)
  model = model.to("cuda")
  img_preprocess = get_preprocess(
        image_resolution=224,
  )

  import open_clip
  import torch
  from inference_tool import get_preprocess

  ckpt_path = "/your/local/path/to/RS5M_ViT-H-14.pt"
  model, _, _ = open_clip.create_model_and_transforms("ViT-H/14", pretrained="laion2b_s32b_b79k")
  checkpoint = torch.load(ckpt_path, map_location="cpu")
  msg = model.load_state_dict(checkpoint, strict=False)
  model = model.to("cuda")
  img_preprocess = get_preprocess(
        image_resolution=224,
  )

Experiment Result

GeoRSSD

RS5M Dataset Download (About 500GB, 128 webdataset tars)

RS5M

Geometa

MetaFile

How to use this dataset

Option 1 (Recommended)

  1. Download the webdataset files from the link provided above. The dataset directory should look like this:
        /nas/zilun/RS5M_v5/webdataset                                                       
        ├── train                        
            ├── pub11-train-0000.tar                                                         
            ├── pub11-train-0001.tar
            ├── ......
            ├── pub11-train-0030.tar                                         
            ├── pub11-train-0031.tar
            ├── rs3-train-0000.tar                                              
            ├── rs3-train-0001.tar
            ├── ......
            ├── rs3-train-0030.tar                                              
            ├── rs3-train-0031.tar
        ├── val                        
            ├── pub11-val-0000.tar                                                         
            ├── pub11-val-0001.tar
            ├── ......
            ├── pub11-val-0030.tar                                         
            ├── pub11-val-0031.tar
            ├── rs3-val-0000.tar                                              
            ├── rs3-val-0001.tar
            ├── ......
            ├── rs3-val-0030.tar                                              
            ├── rs3-val-0031.tar
    
    
  2. An example of data IO pipeline using webdataset files is provided in "dataloader.py". The throughput (images per second) is ~1800 images per second. (With Ryzen 3950x CPU and dual-channel 3200MHZ DDR4 RAM)
  3. Run the following to have a taste:
    python dataloader.py --train_dir /media/zilun/mx500/RS5M/data/train --val_dir /media/zilun/mx500/RS5M/data/val --num_worker 16 --batch_size 400 --num_shuffle 10000
    

Option 2

  1. Download the files from the Dropbox link or Baidu disk link provided. The dataset directory should look like this:
        /nas/zilun/RS5M_v5/img_only                                                      
        ├── pub11                        
            ├── pub11.tar.gz_aa                                                       
            ├── pub11.tar.gz_ab
            ├── ......
            ├── pub11.tar.gz_ba                                              
            ├── pub11.tar.gz_bc
        ├── rs3                        
            ├── ben
                ├── ben.tar.gz_aa                                       
            ├── fmow
                ├── fmow.tar.gz_aa
                ├── fmow.tar.gz_ab
                ├── ......
                ├── fmow.tar.gz_ap
                ├── fmow.tar.gz_aq
            ├── millionaid
                ├── millionaid.tar.gz_aa
                ├── millionaid.tar.gz_ab
                ├── ......
                ├── millionaid.tar.gz_ap
                ├── millionaid.tar.gz_aq                                   
    
  2. Combine and untar the files. You will have the images files now.
     # optional, for split and zip the dataset
     tar -I pigz -cvf - pub11 | split --bytes=500MB - pub11.tar.gz_
    
     # combine different parts into one
     cat pub11.tar.gz_* > pub11.tar
    
     # extract
     pigz -dc pub11.tar | tar -xvf - -C /data/zilun/RS5M_v5/img_only/
    

Statistics

PUB11 Subset

NameAmountAfter Keyword FilteringDownload ImageInvalid Image (Removed)Duplicate Image (Removed)Outlier images (Removed by VLM and RS Detector)Remain
LAION2B2.3B1,980,9781,737,584102343,017333,6861,060,779
COYO700M746M680,089566,07628245,65094,329226,069
LAIONCOCO662M3,014,2832,549,73880417,689527,9411,604,028
LAION400M413M286,102241,32425141,65823,86075,781
WIT37 M98,54093,754074,0819,29910,374
YFCC15M15M27,16625,020026515,1269,629
CC12M12M18,89216,23401,8704,33010,034
Redcaps12M2,8422,68602289721,486
CC3M3.3M12,56311,71813281,8179,572
SBU1M10291043651
VG0.1M262600206
Total4.2B6,121,5835,244,2512361,224,7901,011,4163,007,809

RS3 Subset

NameAmountOriginal SplitHas Class label
FMoW727,144TrainYes
BigEarthNet344,385TrainYes
MillionAID990,848TestNo
Total2,062,377--

Geo-Statistics

BLIP2 fine-tuned with RSITMD dataset

Image-Text Pair Rating Tool

Awesome Remote Sensing Vision-Language Models & Papers

Contact

Email: zilun.zhang@zju.edu.cn

WeChat: zilun960822

Slack Group: https://join.slack.com/t/visionlanguag-fks1990/shared_invite/zt-290vxhx5y-SUkCzf2aH3G9eu3lye2YvQ

Acknowledgement

We thank Delong Chen and his ITRA framework for helping us fine-tune the CLIP-like models. https://itra.readthedocs.io/en/latest/Contents/introduction/overview.html

BibTeX Citation

If you use RS5M or GeoRSCLIP in a research paper, we would appreciate using the following citations:

@ARTICLE{10679571,
  author={Zhang, Zilun and Zhao, Tiancheng and Guo, Yulong and Yin, Jianwei},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing}, 
  year={2024},
  volume={},
  number={},
  pages={1-1},
  keywords={Remote sensing;Data models;Visualization;Semantics;Tuning;Location awareness;Computational modeling;Image-text Paired Dataset;Remote Sensing;Vision-Language Model;Parameter Efficient Tuning;General Vision-Language Model;Domain Vision-Language Model;Remote Sensing Cross-Modal Text–Image Retrieval;Zero-shot Classification;Semantic Localization},
  doi={10.1109/TGRS.2024.3449154}
}

Some other citations:

@article{Long2021DiRS,
  title={On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances and Million-AID},
  author={Yang Long and Gui-Song Xia and Shengyang Li and Wen Yang and Michael Ying Yang and Xiao Xiang Zhu and Liangpei Zhang and Deren Li},
  journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
  year={2021},
  volume={14},
  pages={4205-4230}
}

@inproceedings{Sumbul_2019,
  title={Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding},
  url={http://dx.doi.org/10.1109/IGARSS.2019.8900532},
  DOI={10.1109/igarss.2019.8900532},
  booktitle={IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium},
  publisher={IEEE},
  author={Sumbul, Gencer and Charfuelan, Marcela and Demir, Begum and Markl, Volker},
  year={2019},
  month=jul
}

@inproceedings{fmow2018,
  title={Functional Map of the World},
  author={Christie, Gordon and Fendley, Neil and Wilson, James and Mukherjee, Ryan},
  booktitle={CVPR},
  year={2018}
}