Awesome
saqgetr <a href='https://github.com/skgrange/saqgetr'><img src='man/figures/logo.png' align="right" height="131.5" /></a>
saqgetr is an R package to import air quality monitoring data in a fast and easy way. Currently, only European data are available, but the package is generic and therefore data from other areas may be included in the future. For documentation on what data sources are accessible, please see saqgetr's technical note.
saqgetr has been made possible with the help of Ricardo Energy & Environment.
Retirement note
saqgetr will be retired in mid-2024. There are several reasons for the retirement, but the main points are that I no longer have the scope to ensure I catch all issues when they arise, the access to the remote servers used for saqgetr has become progressively more difficult due to my relocation and stricter security policies, and the near-real-time (E2a) data flow contains far more unreliable observations that in the past that are not being fixed or updated but the member states. Therefore, the database underlying saqgetr requires more maintenance than I can provide. The final update of observations was conducted on 2024-02-17
.
Installation
saqgetr is available on CRAN and can be installed in the normal way:
# Install saqgetr package
install.packages("saqgetr")
If desired, the development version can be installed with the help of devtools or remotes like this:
# Install development version of saqgetr
remotes::install_github("skgrange/saqgetr")
Framework
saqgetr acts as an interface to pre-prepared data files located on a web server. For each monitoring site serviced, there is a single file containing all observations for each year. There are a collection of metadata tables too which enable users to further understand the location and type of observations are available. The data files are compressed text files (.csv.gz
) which allows for simple and fast importing and if other interfaces wish to be developed, this should be simple.
Usage
Sites
To import data with saqgetr, functions with the get_saq_*
prefix are used. A monitoring site must be supplied to get observations. To find what sites are available use get_saq_sites
:
# Load packages
library(dplyr)
library(saqgetr)
# Import site information
data_sites <- get_saq_sites()
# Glimpse tibble
glimpse(data_sites)
#> Observations: 9,016
#> Variables: 16
#> $ site <chr> "ad0942a", "ad0944a", "ad0945a", "al0201a", "a…
#> $ site_name <chr> "Fixa", "Fixa oz", "Estacional oz Envalira", "…
#> $ latitude <dbl> 42.50969, 42.51694, 42.53488, 41.33027, 41.345…
#> $ longitude <dbl> 1.539138, 1.565250, 1.716986, 19.821772, 19.85…
#> $ elevation <dbl> 1080, 1637, 2515, 162, 207, 848, 25, 1, 13, 15…
#> $ country <chr> "andorra", "andorra", "andorra", "albania", "a…
#> $ country_iso_code <chr> "AD", "AD", "AD", "AL", "AL", "AL", "AL", "AL"…
#> $ site_type <chr> "background", "background", "background", NA, …
#> $ site_area <chr> "urban", "rural", "rural", NA, NA, "suburban",…
#> $ date_start <dttm> 2013-12-31 23:00:00, 2013-12-31 23:00:00, 201…
#> $ date_end <dttm> 2019-04-27 14:00:00, 2019-04-27 14:00:00, 201…
#> $ network <chr> "NET-AD001A", "NET-AD001A", "NET-AD001A", NA, …
#> $ eu_code <chr> "STA-AD0942A", "STA-AD0944A", "STA-AD0945A", N…
#> $ eoi_code <chr> "AD0942A", "AD0944A", "AD0945A", NA, NA, "AL02…
#> $ observation_count <dbl> 309037, 45174, 18268, 168983, 140812, 247037, …
#> $ data_source <chr> "aqer:e1a; aqer:e2a", "aqer:e1a; aqer:e2a", "a…
Observations
Sites are represented by a code which is prefixed with the country's ISO code, for example, a site in York, England, United Kingdom is identified as gb0919a
(the ISO code for the United Kingdom is non-standard and GB is for Great Britain). To get observations this site, use get_saq_observations
:
# Get air quality monitoring data for a York site
data_york <- get_saq_observations(site = "gb0919a", start = 2005)
# Glimpse tibble
glimpse(data_york)
#> Observations: 370,235
#> Variables: 10
#> $ date <dttm> 2008-01-01, 2008-01-02, 2008-01-03, 2008-01-04, 2008-…
#> $ date_end <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ site <chr> "gb0919a", "gb0919a", "gb0919a", "gb0919a", "gb0919a",…
#> $ variable <chr> "pm10", "pm10", "pm10", "pm10", "pm10", "pm10", "pm10"…
#> $ process <int> 62392, 62392, 62392, 62392, 62392, 62392, 62392, 62392…
#> $ summary <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20…
#> $ validity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, …
#> $ unit <chr> "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", …
#> $ value <dbl> 21.625, 22.708, 24.667, 21.833, 24.000, 29.875, 16.833…
get_saq_observations
takes a vector of sites to import many sites at once. Beware that if a user stacks the sites, a lot of data can be returned. For example, using the two sites below returns a tibble/data frame/table with over 10 million observations.
# Get 10 million observations, verbose is used to give an indication on
# what is occuring
data_large_ish <- get_saq_observations(
site = c("gb0036r", "gb0682a"),
start = 1960,
verbose = TRUE
)
# Glimpse tibble
glimpse(data_large_ish)
#> Observations: 9,981,977
#> Variables: 9
#> $ date <dttm> 1995-09-11, 1995-09-12, 1995-09-13, 1995-09-14, 1995-…
#> $ date_end <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ site <chr> "gb0036r", "gb0036r", "gb0036r", "gb0036r", "gb0036r",…
#> $ variable <chr> "so2", "so2", "so2", "so2", "so2", "so2", "so2", "so2"…
#> $ process <int> 57295, 57295, 57295, 57295, 57295, 57295, 57295, 57295…
#> $ summary <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20…
#> $ validity <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ unit <chr> "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", "µg/m3", …
#> $ value <dbl> 0.983, 0.792, 1.362, 0.483, 14.633, 1.171, 0.821, 15.2…
Cleaning observations
Once a data are imported, valid data for a certain averaging period/summary can be isolated with saq_clean_observations
. saq_clean_observations
can also "spread" data where the variable/pollutants become columns:
# Get only valid hourly data and reshape (spread)
data_york_spread <- data_york %>%
saq_clean_observations(summary = "hour", valid_only = TRUE, spread = TRUE)
# Glimpse tibble
glimpse(data_york_spread)
Processes
Information on the specific time series/processes can also be retrieved.
# Get processes
data_processes <- get_saq_processes()
# Glimpse tibble
glimpse(data_processes)
#> Observations: 171,992
#> Variables: 15
#> $ process <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ site <chr> "al0201a", "al0201a", "al0201a", "al0201a", "a…
#> $ variable <chr> "so2", "so2", "pm10", "pm10", "o3", "o3", "o3"…
#> $ variable_long <chr> "Sulphur dioxide (air)", "Sulphur dioxide (air…
#> $ period <chr> "day", "hour", "day", "hour", "day", "dymax", …
#> $ unit <chr> "ug.m-3", "ug.m-3", "ug.m-3", "ug.m-3", "ug.m-…
#> $ date_start <dttm> NA, 2011-01-01 00:00:00, 2011-01-01 00:00:00,…
#> $ date_end <dttm> NA, 2011-12-31 23:00:00, 2012-12-30 00:00:00,…
#> $ sample <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sampling_point <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ sampling_process <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ observed_property <int> 1, 1, 5, 5, 7, 7, 7, 7, 8, 8, 9, 9, 10, 10, 10…
#> $ group_code <int> 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
#> $ data_source <chr> "airbase", "airbase", "airbase", "airbase", "a…
#> $ observation_count <dbl> 0, 6806, 729, 17336, 352, 352, 16413, 8358, 69…
Other metadata
Other helper tables are also available:
# Get other helper tables
# Summary integers
data_summary_integers <- get_saq_summaries() %>%
print(n = Inf)
#> # A tibble: 20 x 2
#> averaging_period summary
#> <chr> <int>
#> 1 hour 1
#> 2 day 20
#> 3 week 90
#> 4 var 91
#> 5 month 92
#> 6 fortnight 93
#> 7 3month 94
#> 8 2month 95
#> 9 2day 96
#> 10 3day 97
#> 11 2week 98
#> 12 4week 99
#> 13 3hour 100
#> 14 8hour 101
#> 15 hour8 101
#> 16 year 102
#> 17 dymax 21
#> 18 quarter 103
#> 19 other 91
#> 20 n-hour 104
# Validity integers
data_validity_integers <- get_saq_validity() %>%
print(n = Inf)
#> # A tibble: 6 x 4
#> validity valid description notes
#> <int> <lgl> <chr> <chr>
#> 1 NA FALSE data is considered to be invalid due to the… from aqer
#> 2 -1 FALSE invalid due to other circumstances or data … from aqer
#> 3 0 FALSE invalid smonitor nom…
#> 4 1 TRUE <NA> from aqer
#> 5 2 TRUE valid but below detection limit measurement… from aqer
#> 6 3 TRUE valid but below detection limit and number … from aqer
Simple annual and monthly means of observations
Simple annual and monthly means of the daily and hourly processes have also been generated. These summaries are often useful for trend analysis or mapping.
# Get annual means
data_annual <- get_saq_simple_summaries(summary = "annual_mean")
# Glimpse tibble
glimpse(data_annual)
#> Observations: 655,362
#> Variables: 8
#> $ date <dttm> 2013-01-01, 2014-01-01, 2015-01-01, 2016-01-01, …
#> $ date_end <dttm> 2013-12-31 23:59:59, 2014-12-31 23:59:59, 2015-1…
#> $ site <chr> "ad0942a", "ad0942a", "ad0942a", "ad0942a", "ad09…
#> $ variable <chr> "co", "co", "co", "co", "co", "co", "co", "no", "…
#> $ summary_source <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ summary <int> 102, 102, 102, 102, 102, 102, 102, 102, 102, 102,…
#> $ count <dbl> 1, 8438, 8385, 8171, 8441, 8217, 5990, 1, 8310, 8…
#> $ value <dbl> 0.5000000, 0.3224579, 0.3582230, 0.3168768, 0.259…
# What was York Fishergate's (hourly) PM10 concentraion in 2017?
data_annual %>%
filter(site == "gb0682a",
lubridate::year(date) == 2017L,
variable == "pm10",
summary_source == 1L) %>%
select(date,
site,
variable,
count,
value)
#> # A tibble: 1 x 5
#> date site variable count value
#> <dttm> <chr> <chr> <dbl> <dbl>
#> 1 2017-01-01 00:00:00 gb0682a pm10 8442 23.8