Home

Awesome

MOSAIKS

This repository provides the code required to produce the figures appearing in the main text and Supplementary Materials of:

E. Rolf, J. Proctor, T. Carleton, I. Bolliger, V. Shankar, M. Ishihara, B. Recht, and S. Hsiang. "A Generalizable and Accessible Approach to Machine Learning with Global Satellite Imagery," Nature Communications, 2021.

This repository also serves provides examples of the generalizability of the MOSAIKS approach, which enables users with access to limited computing resources and without expertise in machine learning or remote sensing to generate observations of new variables in new locations. While this repository reproduces the results from the paper titled above, it also describes how users could generate observations of new variables in new research and policy contexts, with only minimal changes to the existing code base.

If you are viewing this repository on Github, please check out our Code Ocean capsule, where you will find a mirror of this repository along with data and a computing environment set up to run all of the analyses in our paper. You may interact with the code via this platform or simply download the data for use on your own platform.

Additional material related to this repository and the corresponding paper are available at http://www.globalpolicy.science/mosaiks.

1. Structure of the repository

1.1. Code

The code folder is organized into an analysis pipeline and a package containing tools necessary to enable that pipeline.

1.2. Data

Data is hosted within our Code Ocean capsule. Due to a variety of license and data use agreement restrictions across data sources, as well as storage size constraints, we cannot host the raw label data nor the imagery used in our analysis. Instead, we provide detailed instructions on how to obtain and preprocess this data in the Installation section. Preprocessing scripts are available in code/analysis/1_feature_extraction (for image feature extraction) and code/analysis/2_label_creation (for label creation). For data sources that can be downloaded programmatically, scripts in the latter folder also contain code that initiates a download of the corresponding labels. Note that over time, URLs and access instructions may change. If you are not able to obtain any of this data, please file an issue in our GitHub repository and we will do our best to update the instructions accordingly.

The data folder is organized as follows:

Obtaining Raw Data

All data used in this analysis is from free, publicly available sources and is available for download other than the house price data (see below). While all other data is freely available, the source data is under a variety of licenses that limit re-distribution of data in various ways. To accommodate these agreements, we do not host any of the raw data that falls under these restrictions. Instructions are provided below for obtaining each of the input datasets.

Imagery

The .npz grid files in data/int/grids (produced by Step 0 of our analysis pipeline) provide the lat/lon centroids of image tiles used in our analysis. You will need to acquire ~1x1km 256x256 pixel RGB images centered at these locations. The function centroidstoSquareVertices can help by mapping these centroids to exact grid cell boundaries (use zoom=16, numPix=640 arguments to achieve the appropriately sized boundaries).

Downloading imagery may be the most difficult piece of reproducing out analysis. Thus, in order to facilitate reproducibility of our article's results, we provide all of the pre-featurized images necessary to conduct our analysis within data/int/feature_matrices.

Labels

1.3 Results

Results, such as paper figures and tables, are produced within a results/ folder at the top level of this repository. On Code Ocean, these are archived after each "Reproducible Run".

2. Installation

The easiest way to replicate and/or interact with our analysis is to use this Code Ocean capsule, which provides a cloud platform containing our code, data, and computing environment together. An alternative approach is to separately obtain these three items for use on an alternative computing platform. Both approaches are described below. In either case, if you wish to begin the analysis/replication from raw data (rather than preprocessed data), you must obtain additional data directly from various data providers (see Obtaining Raw Data).

2.1. Using Code Ocean

To work with the interactive notebooks we have written to (a) walk through a typical user's experience with MOSAIKS, or (b) reproduce the analyses in the associated paper, you will likely want to use the Launch Cloud Workstation-->JupyterLab functionality of Code Ocean. After doing so, you will want to run the following code snippet to install the package needed to execute our analyses:

pip install -e code

Note 1: The Code Ocean capsule contains ~50 GB of data, which takes ~10 minutes to load when launching a cloud workstation or running the non-interactive Reproducible Run script. The Reproducible Run itself takes ~10 hours.

Note 2: In both the Reproducible Run and Interactive Cloud Workstation environments, our code is configured to use the GPU provided on Code Ocean's workstations. For some figures and tables, slightly better performance may be observed upon replication with a GPU than that which is presented in our manuscript. This is because the package we use to solve the ridge regression on a CPU, where the majority of our analysis was run, raises a warning when the $X^TX + \lambda I$ matrix used in a ridge regression is ill-conditioned. We throw these runs out when performing hyperparameter selection. When run on a GPU, a different solver is used (from cupy rather than scipy) which does not raise these warnings and thus those hyperparameters are not ignored. In rare cases, a hyperparameter that raises an "ill-conditioned" warning may give better out-of-sample performance and will be selected when replicating our analysis with a GPU.

2.2. Using an Alternative Computing Platform

If you choose option two, you will separately need to obtain three things: code, data, and an appropriate computing environment:

2.2.1. Code

You should clone this repository, which is mirrored on Github and Code Ocean. Either source is appropriate to clone as they contain the same code

2.2.2. Data

When viewing our Code Ocean capsule, hover over data and click the caret that appears. You will see an option to download this folder. Place this downloaded data folder in the root directory of this repository (i.e. at the same level as the code/ folder). Alternatively, you may place a symlink at that location that points to this data folder.

2.3. Computing Environment

You will need to install and activate our mosaiks-env conda environment. Once you have conda installed on your machine, from the root directory of this repository, run:

conda env create -f environment/environment.yml
conda activate mosaiks-env

Note that depending on the operating system and GPU capability of your machine, you may need to change pytorch, torchvision, and/or cudatoolkit packages in environment.yml. Please see https://pytorch.org/get-started/locally/ for instructions. Additionally, to reduce the size of the environment you may wish to comment out some of the "infrastructure" packages (e.g. jupyterlab) if you already have these installed and are able to access the R and python kernels that get installed in the mosaiks-env environment. If you're not sure, do not comment these out.

Finally, you will also need to install our mosaiks package. From the root directory of the repo, call

pip install -e code

2.4. Alternative locations for code, data, and results folders

Using the above instructions, our code should correctly identify the locations of the code, data, and results folders as residing within the root directory of this repository. If, for whatever reason, you need to direct our scripts to look in different locations, you can set the following environment variables

MOSAIKS_HOME is overridden by the other variables, such that if you had MOSAIKS_HOME=path1/path2 and MOSAIKS_DATA=path3, then our code would search for a code folder at path1/path2/code, a results folder at path1/path2/results, and a data folder at path3.

3. Details on the contents of each subfolder within code

3.1. analysis/

We recommend that new users begin with the notebook MOSAIKS_example.ipynb, which can be found in the code/analysis/ folder. If users want to replicate the entire MOSAIKS pipeline, the scripts and subfolders contained in the analysis folder are named according to their stage in the pipeline. Thus, if users wish to begin at the beginning of the pipeline, they should start with 0_grid_creation, proceed to 1_feature_extraction/, then 2_label_creation/, and finally to 3_figures_and_tables/. A key component of the MOSAIKS framework is that feature extraction only needs to be performed once per image, no matter how many tasks are being predicted. If a user wanted to predict a new task, for example, they would skip straight to step 2. See the accompanying manuscript for further details.

Note: Some of the analysis subdirectories containing multiple scripts and/or notebooks contain an additional README providing further instructions.

mosaiks/

This package contains all functions called by the analysis scripts in analysis/ and subfolders therein.

Use of code and data

Our code can be used, modified, and distributed freely for educational, research, and not-for-profit uses. For all other cases, please contact us. Further details are available in the code license. All data products created through our work that are not covered under upstream licensing agreements are available via a CC BY 4.0 license (see the data license available within the Code Ocean capsule). All upstream data use restrictions take precedence over this license.