Home

Awesome

MMEarth-logo

MMEarth - Data Downloading

Project Website Paper Code - Models

This repository contains scripts to download the data presented in the paper MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning. The scripts are used to download large scale satellite data from different sensors and satellites (Sentinel-2, Sentinel-1, ERA5 - temperature & precipitation, Aster GDEM etc) which we call modalities. The data is downloaded from Google Earth Engine.

šŸ“¢ Latest Updates

:fire::fire::fire: Last Updated on 2024.11.07 :fire::fire::fire:

Table of contents

  1. Data Download
  2. Data Loading
  3. Getting Started
  4. Data Stacks
  5. Code Structure
  6. Slurm Execution
  7. Citation

Data Download

The MMEarth data can be downloaded using the following links. To enable more easier development with Multi-Modal data, we also provide 2 more "taster" datasets along with the original MMEarth data. The license for the data is CC BY 4.0.

:bangbang: UPDATE: The new Version 001 data is ready to download.

DatasetImage SizeNumber of TilesDataset sizeData LinkBash Script
MMEarth128x1281.2M597GBdownloadbash
MMEarth6464x641.2M152GBdownloadbash
MMEarth100k128x128100k48GBdownloadbash

All 3 dataset have a similar structure as below:

.
ā”œā”€ā”€ data_1M_v001/                      # root data directory
ā”‚   ā”œā”€ā”€ data_1M_v001.h5                # h5 file containing the data
ā”‚   ā”œā”€ā”€ data_1M_v001_band_stats.json   # json file containing information about the bands present in the h5 file for each data stack
ā”‚   ā”œā”€ā”€ data_1M_v001_splits.json       # json file containing information for train, val, test splits
ā”‚   ā””ā”€ā”€ data_1M_v001_tile_info.json    # json file containing additional meta information of each tile that was downloaded. 

Data Loading

A sample Jupyter Notebook that shows an example to load the data using pytorch is here. Alternatively, the dataloader has also been added to TorchGeo.

Getting Started

To get started with this repository, you can install the dependencies and packages with this command

pip install -r requirements.txt

Once this is done, you need to setup gcloud and earthengine to make the code work. Follow the below steps:

Data Stacks

This repository allows downloading data from various sensors. Currently the code is written to download the following sensors/modalities:

Code Structure

The data downloading happens only when you have a geojson file with all the tiles you want to download. Here tiles represent ROI (or polygons) for each location that you want. Once you have the tiles, the data stacks (data for each modality) are downloaded for each tile in the geojson. The data can be downloaded by following this broad structure, and each of these points are further explained below:

Creating Tiles

Downloading Data Stacks

Post Processing

Redownload

(NOTE: The files are executed by making use of SLURM. More information on this is provided in the Slurm Execution section)

Slurm Execution

<img width="815" alt="MMEarth-data" src="https://github.com/vishalned/MMEarth-data/assets/27778126/02764bda-7384-4359-bdae-01c4456239a0">

Downloading Data Stacks: GEE provides a function called getDownloadUrl() that allows you to export images as GeoTIFF files. We extend this by merging all modalities for a single location into one image, and export this as a single GeoTIFF file. To further speed up the data downloading, we make use of parallel processing using SLURM. The above figures give an idea of how this is done. The tile information (tile GeoJSON) contains location information and more about N tiles we need to download. N/40 tiles are downloaded by 40 slurm jobs (we set the max jobs as 40 since this is the maximum number of concurrent requests by the GEE API).

To run the slurm parallel download, execute the following command

sbatch slurm_scripts/slurm_download_parallel.sh

Citation

Please cite our paper if you use this code or any of the provided data.

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, & Nico Lang (2024). MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning.

@misc{nedungadi2024mmearth,
      title={MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning},
      author={Vishal Nedungadi and Ankit Kariryaa and Stefan Oehmcke and Serge Belongie and Christian Igel and Nico Lang},
      year={2024},
      eprint={2405.02771},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}