Home

Awesome

<!-- README.md is generated from README.Rmd. Please edit that file -->

ABDtools

<!-- badges: start --> <!-- badges: end -->

This is a companion package for the Africa Bird Data packages ABAP and CWAC. These packages allow us to pull African bird citizen science data from the projects of the same name to our R session. For now, the ABDtools package adds the functionality necessary to annotate these data with environmental information from the Google Earth Engine data catalog. This should make the analysis of ABAP and CWAC data easier and more reproducible.

However, there is nothing preventing you from using the ABDtools package to annotate other types of data. As long as the data comes in a spatial format, as either points or polygons, we can use ABDtools to annotate them with data from the GEE catalog.

HOW TO ANNOTATE DATA WITH ABDtools

In the instructions below we assume that you have a basic understanding of how the ABAP and CWAC packages work and the type of data they provide. If you don’t, please visit their repository pages, and read the use cases.

The only difference between ABAP and CWAC data with respect to how ABDtools works is that each ABAP data point refers to a polygon (pentad) and each CWAC data point refers to a point in space. All GEE data comes in the form of images (pixels), so each spatial point from CWAC corresponds to a single pixel in a GEE image, but an ABAP pentad contains multiple image pixels. Therefore, we need a way to summarise the values of the different pixels contained in a polygon. We will use functions, which in GEE jargon are called “reducers”, such as “mean” or “count”. Keep this in mind when reading through the next sections.

Installation

Install the latest version of ABDtools, which for now is only present on GitHub.

install.packages("remotes") # In case this is not already installed
remotes::install_github("AfricaBirdData/ABDtools")

The package ABDtools builds upon rgee, which translates R code into Python code using reticulate, and allows us to use the Google Earth Engine (GEE) Python client libraries from R! You can find extensive documentation about rgee here.

But first, we need to create a GEE account, install rgee (and dependencies) and configure our Google Drive to upload and download data to and from GEE. There are other ways of upload and download but we recommend Google Drive, because it is free, simple and effective. Configuration is a bit of a process, but you will have to do this only once.

We have nothing to do with the above steps, so if you get stuck along the way, please search the web or contact the developers of these packages directly.

Well done if you managed that! With configuration out of the way, let’s see how to annotate some data. We’ve coded some wrappers around basic functions from the rgee package to provide some basic functionality without having to know almost anything about GEE. However, if you want more complicated workflows, we totally recommend learning how to use GEE and rgee and exploit their enormous power.

Initialize

Most image processing and handling of spatial features happens in GEE servers. Therefore, there will be constant flow of information between our computer (client) and GEE servers (server). This information flow happens through Google Drive. So when we start our session we need to initialize a connection with GEE and Google Drive.

# Initialize Earth Engine
library(rgee)

# Check installation
ee_check()
#> ◉  Python version
#> ✔ [Ok] /home/pachorris/.virtualenvs/rgee/bin/python v3.8
#> ◉  Python packages:
#> ✔ [Ok] numpy
#> ✔ [Ok] earthengine-api

# Initialize rgee and Google Drive
ee_Initialize(drive = TRUE)
#> ── rgee 1.1.5 ─────────────────────────────────────── earthengine-api 0.1.323 ── 
#>  ✔ user: not_defined
#>  ✔ Google Drive credentials: ✔ Google Drive credentials:  FOUND
#>  ✔ Initializing Google Earth Engine: ✔ Initializing Google Earth Engine:  DONE!
#>  ✔ Earth Engine account: users/ee-assets 
#> ────────────────────────────────────────────────────────────────────────────────

Make sure that all tests and checks are passed. If so, you are good to go!

Uploading data to GEE

Firstly, you will need to upload the data you want to annotate to GEE. These data will go to your ‘assets’ directory in the GEE server and it will stay there until you remove it. So if you have uploaded some data, you don’t have to upload it again.

GEE-related functions from the ADBtools package work with spatial data and therefore our detection data must be uploaded as spatial objects.

For example, to upload all ABAP pentads in the North West province of South Africa (these are already on an sf format out of the box!), we can use:

library(ABAP)
library(sf)
library(dplyr, warn.conflicts = FALSE)
library(ABDtools)

# Load ABAP pentads
pentads <- getRegionPentads(.region_type = "province", .region = "North West")

# Set an ID for your remote asset (data in GEE)
assetId <- file.path(ee_get_assethome(), 'pentads')

# Upload to GEE (if not done already - do this only once)
uploadFeaturesToEE(feats = pentads,
                   asset_id = assetId,
                   load = FALSE)

# Load the remote asset to you local computer to work with it
ee_pentads <- ee$FeatureCollection(assetId)

Now, the object pentads lives in your machine, but the object ee_pentads lives in the GEE server. You only have a “handle” to it in your machine to manipulate it. This might seem a bit confusing at first but you will get used to it.

We are now going to also upload some CWAC count data to see how annotating these data compares to ABAP data.

GEE can be temperamental with the type of data you upload, so we recommend leaving in the data frame only those variables, that you will be needing for your GEE analysis. Remember to always include an identifier field that allows you to join the results back to you original data. In this case, we will leave only the start date of the survey and the aforementioned identifier. GEE likes dates in a character format, so lets transform the variable before proceeding.


# Select variables and format them
counts_to_upload <- bd_counts %>%
  dplyr::select(ID, StartDate) %>%
  mutate(StartDate = as.character(StartDate))

# Set an ID for your remote asset (data in GEE)
assetId <- file.path(ee_get_assethome(), 'my_cwac_counts')

# Upload to GEE (if not done already - do this only once)
uploadFeaturesToEE(feats = counts_to_upload,
                   asset_id = assetId,
                   load = FALSE)

# Load the remote asset to you local computer to work with it
ee_counts <- ee$FeatureCollection(assetId)

Annotate data with a GEE image

An image in GEE jargon is the same thing as a raster in R. There are also image collections which are like raster stacks (we’ll see more about these later). You can find a full catalog of what is available in GEE here. If you want to use data from a single image you can use the function addVarEEimage().

For example, let’s annotate our ABAP pentads with surface water occurrence, which is the frequency with which water was present in each pixel. We’ll need the name of the layer in GEE, which is given in the field “Earth Engine Snippet”. Images can have multiple bands – in this case, we select “occurrence”.

Finally, images have pixels but our spatial objects are polygons, so we need to define some type of summarizing function. The same thing that raster::extract() would need. In GEE this is called a “reducer”. In this case, we will select the “mean” function (i.e., mean water occurrence per pixel within each pentad).


pentads_water <- addVarEEimage(ee_feats = ee_pentads,                   # Note that we need our remote asset here
                               image = "JRC/GSW1_3/GlobalSurfaceWater",   # You can find this in the code snippet
                               reducer = "mean",
                               bands = "occurrence")

If we were to annotate CWAC counts, everything would be a bit easier, because counts are not associated with polygons but with specific locations in space (i.e., points).

counts_water <- addVarEEimage(ee_feats = ee_counts,                   # Note that we need our remote asset here
                              image = "JRC/GSW1_4/GlobalSurfaceWater",   # You can find this in the code snippet
                              bands = "occurrence")

Note that now we didn’t need to include a reducer argument, because we don’t need to summarise multiple pixels enclosed by a polygon.

Annotate data with a GEE collection

Sometimes environmental data don’t come in a single image, but in multiple images. For example, we might have one image for each day, week, month, etc. Again, we can check all available data in the GEE catalog. When we want to annotate data with a collection, we have two options:

We demonstrate both options above by annotating data with the TerraClimate dataset, which provides monthly climate data. If we were interested in annotating pentad data with the mean minimum temperature across the year 2010, we could use


pentads_tmmn <- addVarEEcollection(ee_feats = ee_pentads,                    # Note that we need our remote asset here
                                   collection = "IDAHO_EPSCOR/TERRACLIMATE",   # You can find this in the code snippet
                                   dates = c("2010-01-01", "2011-01-01"),
                                   temp_reducer = "mean",
                                   spt_reducer = "mean",
                                   bands = "tmmn")

Note that in this case, we had to specify a temporal reducer temp_reducerto summarize pixel values over time. A reducer in GEE is a function that summarizes temporal or spatial data. The dates argument subsets the whole TerraClimate dataset to those images between dates[1] (inclusive) and dates[2] (exclusive). Effectively, the function computes a summary of the values of each pixel (a mean, in this case) across all images (i.e., months), to create a single image. Then, it uses this new image to annotate our data.

We then, use the argument spt_reducer to specify a function to summarise pixel values contained in each pentad (a spatial summary, as opposed to the previous temporal summary). As explained earlier, this spatial summary is not necessary when working with point data, such as CWAC counts.

If we wanted to annotate data with the closest image in the collection, instead of with a temporal summary, then we would need to upload to GEE data with an associated date. Dates must be in a character format (“yyyy-mm-dd”) and the variable must be called ‘Date’ (case sensitive). We already did some of this at the beginning (please check the section ‘Uploading data to GEE’ if you can’t remember), but we will need to adjust our data slightly to work with TerraClimate data.

TerraClimate offers monthly data and the date associated with each image is always the first day of the month. This means that if we have data corresponding to a date after the 15th of the month they will be matched against the next month, because that’s the one closest in time.

Each image collection has its own convention, and we must check what is appropriate in each case. Here, for illustration purposes, we will change all dates in our data to be on the first of the month to match TerraClimate.

As an example, let’s download ABAP data for the Maccoa Duck in 2010 and annotate these data with TerraClimate’s minimum temperature data.


# Load ABAP pentads
pentads <- getRegionPentads(.region_type = "country", .region = "South Africa")

# Download Maccoa Duck
id <- searchAbapSpecies("Duck") %>% 
    filter(Common_species == "Maccoa") %>% 
    pull(SAFRING_No)

visit <- getAbapData(.spp_code = id,
                     .region_type = "country",
                     .region = "South Africa",
                     .years = 2008)

# Make spatial object
visit <- visit %>% 
  left_join(pentads, by = c("Pentad" = "Name")) %>% 
  st_sf() %>% 
  filter(!st_is_empty(.))   # Remove rows without geometry

# NOTE: TerraClimate offers monthly data. The date of each image is the beginning of
# the month, which means that dates after the 15th will be matched against the
# next month. I will change all dates to be on the first of the month for the
# analysis
visit <- visit %>% 
    dplyr::select(CardNo, StartDate, Pentad, TotalHours, Spp) %>% 
    mutate(Date = lubridate::floor_date(StartDate, "month"))

# Load to EE (if not done already)
assetId <- file.path(ee_get_assethome(), 'visit2008')

# Format date and upload to GEE
visit %>%
    dplyr::select(CardNo, Pentad, Date) %>%
    mutate(Date = as.character(Date)) %>%   # GEE doesn't like dates
    sf_as_ee(assetId = assetId,
             via = "getInfo_to_asset")

# Load the remote data asset
ee_visit <- ee$FeatureCollection(assetId)

# Annotate with GEE TerraClimate
visit_new <- addVarEEclosestImage(ee_feats = ee_visit,
                                  collection = "IDAHO_EPSCOR/TERRACLIMATE",
                                  reducer = "mean",                          # We only need spatial reducer
                                  maxdiff = 15,                              # This is the maximum time difference that GEE checks
                                  bands = c("tmmn"))

Convert an image collection to a multi-band image

Lastly, we have made a convenience function that converts an image collection into a multi-band image. This is useful because you can only annotate one image at a time, but all the bands in the image get annotated. So if you want to add several variables to your data, you can first create a multi-band image and then annotate with all bands at once. In this way you minimize the traffic between your machine and GEE servers saving precious time and bandwidth.

Here we show how to find the mean NDVI for each year between 2008 and 2010, create a multi-band image and annotate our data with these bands.


# Create a multi-band image with mean NDVI for each year
multiband <- EEcollectionToMultiband(collection = "MODIS/006/MOD13A2",
                                     dates = c("2008-01-01", "2020-01-01"),
                                     band = "NDVI",                       # You can find what bands are available from GEE catalog
                                     group_type = "year",
                                     groups = 2008:2019,
                                     reducer = "mean",
                                     unmask = FALSE)

# Find mean (mean) NDVI for each pentad and year
pentads_ndvi <- addVarEEimage(ee_feats = ee_pentads,
                              image = multiband,
                              reducer = "mean")

CWAC count data would be annotated in the exact same way but without any reducer (reducer = NULL).

Use reference polygons to annotate CWAC data

Up until now, we have been using the centroid of the CWAC site as a reference to annotate our counts with. However, we might be interested in taking a broader reference area. For example, we might want to use the boundaries of the CWAC site where the counts were collected. We can use any other polygon we want, like the catchment the wetland belongs to, for example. Which polygon to use will depend on the objectives of the study.

Here we will focus on the boundaries of the CWAC site, which can be downloaded from the CWAC server, but the workflow will be exactly the same for any other polygons.

We will use the same function uploadFeaturesToEE() to upload our counts, but now we will have polygons associated with these counts. The workflow is almost exactly the same as the one we saw for pentads. Now, we will need to specify a spatial reducer in some of the functions. This is because our pixel information will now need to be summarized to a single value for each polygon.

The first thing we need to do is to download the polygons corresponding to the sites we are interested in. We can do this with the function getCwacSiteBoundary().


# Download our counts for the Black Duck just in case they got lost
counts <- listCwacSpp() %>% 
  filter(Common_species == "African Black",
         Common_group == "Duck") %>% 
  pull("SppRef") %>% 
  getCwacSppCounts()

# Then let's extract the boundaries of the CWAC sites in our data
# At the moment getCwacSiteBoundary() can only retrieve boundaries from sites of
# one province/country at at a time

# First identify the countries in our data
unique(counts$Country)

# It's only South Africa at the time, so we can pull all the sites at once. Otherwise,
# we would just repeat the process for the different countries.
sites <- unique(counts$LocationCode)

boundaries <- getCwacSiteBoundary(loc_code = sites,
                                  region_type = "country",
                                  region = "South Africa")

# Add boundaries to the count data
counts_bd <- counts %>% 
  dplyr::left_join(boundaries, by = "LocationCode")

You may notice that some of the site boundaries might not be yet available in the database. This will hopefully be fixed soon! For the purpose of this tutorial we will focus on the those sites for which boundaries are available, but you probably shouldn’t do this in your own analysis!

# Filter counts to those with boundary geometry
counts_bd <- counts_bd %>% 
  dplyr::filter(!sf::st_is_empty(geometry))

After this we just need to follow a very similar procedure as before to annotate with layers from the Google Earth Engine catalog.


# Select variables and format them
counts_to_upload <- counts_bd %>%
  dplyr::select(ID, StartDate) %>%
  mutate(StartDate = as.character(StartDate))

# Set an ID for your remote asset (data in GEE)
assetId <- file.path(ee_get_assethome(), 'cwac_counts_bd')

# Upload to GEE (if not done already - do this only once)
ee_counts <- uploadFeaturesToEE(feats = counts_to_upload,
                                asset_id = assetId,
                                load = TRUE)

# Annotate with surface water occurrence. Now we need to specify a reducer function to
# summarise our variable per polygon. We will use the mean in this case
counts_water <- addVarEEimage(ee_feats = ee_counts,
                              image = "JRC/GSW1_3/GlobalSurfaceWater",
                              bands = "occurrence",
                              reducer = "mean") # Note this reducer to summarize per polygon

# We could similarly annotate with an image collection. In this case, we use
# the minimum temperature from TerraClimate and we will have to specify a 
# spatial reducer, in addition to the temporal reducer specified earlier. We will
# use the min minimum temperature.
counts_tmmn <- addVarEEcollection(ee_feats = ee_counts, 
                                  collection = "IDAHO_EPSCOR/TERRACLIMATE",
                                  dates = c("2010-01-01", "2011-01-01"),
                                  temp_reducer = "mean",
                                  spt_reducer = "min",
                                  bands = "tmmn")

The other annotating functions will work similarly. We just need to think of using the appropriate spatial reducer.

INSTRUCTIONS TO CONTRIBUTE CODE

First clone the repository to your local machine:

For site owners:

There is the danger of multiple people working simultaneously on the project code. If you make changes locally on your computer and, before you push your changes, others push theirs, there might be conflicts. This is because the HEAD pointer in the main branch has moved since you started working.

To deal with these lurking issues, I would suggest opening and working on a topic branch. This is a just a regular branch that has a short lifespan. In steps:

Opening branches is quick and easy, so there is no harm in opening multiple branches a day. However, it is important to merge and delete them often to keep things tidy. Git provides functionality to deal with conflicting branches. More about branches here:

https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell

Another idea is to use the ‘issues’ tab that you find in the project header. There, we can identify issues with the package, assign tasks and warn other contributors that we will be working on the code.