Awesome
<a href="https://github.com/datazoompuc/datazoom.amazonia"><img src="https://raw.githubusercontent.com/datazoompuc/datazoom.amazonia/master/logo.png?token=AU72KNQCGY4ZR6XLARTNSXLBXCPNG" align="left" width="100" hspace="10" vspace="6"></a>
<!-- README.md is generated from README.Rmd. Please edit that file -->datazoom.amazonia
<!-- badges: start --> <!-- badges: end -->The datazoom.amazonia package facilitates access to official Brazilian Amazon data, including agriculture, deforestation, production. The package provides functions that download and pre-process selected datasets.
Installation
You can install the released version of datazoom.amazonia
from
CRAN with:
install.packages("datazoom.amazonia")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("datazoompuc/datazoom.amazonia")
<table>
<tr>
<td>
PRODES | Yearly deforestation |
DETER | Alerts on forest cover changes |
DEGRAD | Forest degradation |
Imazon | Deforestation pressure in the Amazon |
IBAMA | Environmental fines |
MapBiomas | Land cover and land use |
TerraClimate | Climate data |
SEEG | Greenhouse gas emission estimates |
CENSOAGRO | Agriculture activities |
IPS | Amazon Social Progress Index |
DATASUS | Causes of mortality and availability of hospital beds |
IEMA | Access to electricity in the Amazon region |
Population | Population |
COMEX | Brazilian international trade |
BACI | Global international trade |
PIB-Munic | Municipal GDP |
CEMPRE | Central register of companies |
PAM | Agricultural production |
PEVS | Forestry and extraction |
PPM | Livestock farming |
SIGMINE | Mining |
ANEEL | Energy development |
EPE | Energy consumption |
Legal Amazon Municipalities | Dataset with brazilian cities and whether they belong to the Legal Amazon |
The ‘googledrive’ package | Troubleshooting and information for downloads from Google Drive |
Environmental Data
PRODES
The PRODES project uses satellites to monitor deforestation in Brazil’s Legal Amazon. The raw data reports total and incremental (year-by-year) low-cut deforested area at the municipality level, going back to the year 2000.
Data is collected based on the PRODES-year, which starts at August 1st and ends on July 31st. Accordingly, 2018 deforestation data covers the period from 01/08/2017 to 31/07/2018.
INPE’s most recent data is now published at TerraBrasilis. We have refrained from updating to this new source, as it only contains detailed spatial data, rather than agregated, municipality-level data.
Options:
-
dataset:
"deforestation"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data (raw_data = FALSE)
# in portuguese (language = 'pt').
data <- load_prodes(
raw_data = FALSE,
language = "pt"
)
DETER
DETER uses satellite surveillance to detect and report changes in forest cover across the Legal Amazon and the Cerrado biome. Each data point consists of a warning, describing which type of change has affected a certain area of forest at a given date. Broadly speaking, it makes a distinction between events of deforestation, degradation and logging. The data extracted here spans from 2016 onward in the Amazon, and from 2018 onward in the Cerrado.
The raw DETER data shows one warning per row, with each row also
containing a municipality. However, many warnings actually overlap with
2 or up to 4 municipalities, which are not shown in the original data.
Therefore, when the option raw_data = FALSE
is selected, the original
spatial information is intersected with a municipalities map of Brazil,
and each warning can be split into more than one row, with each row
corresponding to a municipality.
Options:
- dataset: there are two options:
"deter_amz"
for data from the Amazon"deter_cerrado"
for data from the Cerrado
- raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
- language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data (raw_data = FALSE) from Amazonia (dataset = "deter_amz")
deter_amz <- load_deter(
dataset = "deter_amz",
raw_data = FALSE
)
DEGRAD
The DEGRAD project uses satellites to monitor degradation of forest areas. Raw data is available as simple features (sf) objects, read from shapefiles. The project was substituted in 2016 by DETER-B. Accordingly, data is available from 2007 up to 2016.
Original documentation for this data is very scarce, users beware. Some things to keep in mind are:
Event data is organized through yearly editions (DEGRAD 2007-2016). Inside a given edition however, there may be data from different years (events that happened in 2015 inside DEGRAD 2016 for example).
This package provides degradation data with municipality identification.
It does this by intersecting DEGRAD geometries with IBGE’s municipality
geometries from the year 2019. CRS metadata however is missing from the
original data source. A best effort approach is used and a CRS is
assumed
(proj4string: "+proj=longlat +ellps=aust_SA +towgs84=-66.8700,4.3700,-38.5200,0.0,0.0,0.0,0.0 +no_defs")
.
Options:
-
dataset:
"degrad"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated data (raw_data = TRUE) related to forest degradation
# from 2010 to 2012 (time_period = 2010:2012).
data <- load_degrad(
dataset = "degrad",
raw_data = FALSE,
time_period = 2010:2012
)
Imazon
Loads data categorizing each municipality by the level of deforestation pressure it faces. The categories used by Imazon have three levels, ranging from 0 to 3.
Options:
-
dataset:
"imazon_shp"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data
data <- load_imazon(raw_data = FALSE)
🔴 This function uses the googledrive
package to download data. In
case of authentication errors, see googledrive.
IBAMA
The dataset is originally from the Brazilian Institute of Environment and Renewable Natural Resources (Ibama), documenting environmental embargoes and fines at the individual level from 2005 to the present day. In addition, it is possible to download distributed and collected fines from 1994 until the present day.
The function returns either the raw data or a data frame with aggregates considering, for each time-location period, counts for total the number of infractions, infractions that already went to trial, and number of unique perpetrators of infractions. There are also two data frames regarding distributed and collected fines across municipalities
Options:
-
dataset: there are three possible choices.
"embargoed_areas"
: embargoed areas"distributed_fines"
: fines that have not been paid by individuals or corporations"collected_fines"
: fines that have been paid by individuals or corporations
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
states: specifies for which states to download the data. It is “all” by default, but can be a single state such as
"AC"
or any vector such asc("AC", "AM")
. Does not apply to the"embargoed_areas"
dataset. -
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
library(datazoom.amazonia)
# Download treated embargoes data (raw_data = FALSE) in english (language = "eng")
data <- load_ibama(
dataset = "embargoed_areas", raw_data = FALSE,
language = "eng"
)
# Download treated collected fines data from "BA"
data <- load_ibama(
dataset = "collected_fines", raw_data = FALSE,
states = "BA", language = "pt"
)
MapBiomas
The MapBiomas project gathers data reporting the type of land covering each year by area, that is, for example, the area used for a temporary crop of soybeans. It also reports the transitions in land coverage between years.
The data has yearly frequency and is available starting from the year 1989.
For all datasets, areas are measured in hectares (ha).
Options:
-
dataset: there are 7 possible choices:
"mapbiomas_cover"
: types of land cover."mapbiomas_transition"
: changes in land cover."mapbiomas_deforestation_regeneration"
: deforestation and forest regeneration."mapbiomas_irrigation"
: irrigated areas."mapbiomas_mining"
: areas used for mining."mapbiomas_water"
: areas of water surface."mapbiomas_fire"
: areas of wildfire burn scars.
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
- For dataset
"mapbiomas_cover"
, can be"municipality"
or"state"
(faster download) - For dataset
"mapbiomas_transition"
, can be"municipality"
or"state"
(faster download) - For dataset
"mapbiomas_deforestation_regeneration"
, can only be"municipality"
- For dataset
"mapbiomas_mining"
, can be"indigenous_land"
or"municipality"
- For dataset
"mapbiomas_irrigation"
, can be"state"
or"biome"
- For dataset
"mapbiomas_water"
, can be"municipality"
,"state"
or"biome"
- For dataset
"mapbiomas_fire"
, can only be"state"
- For dataset
-
language: you can choose between Portuguese
("pt")
and English("eng")
.
Examples:
# download treated Mapbiomas Cover data in English
data <- load_mapbiomas(
dataset = "mapbiomas_cover",
raw_data = FALSE,
geo_level = "municipality",
language = "eng"
)
# download treated data on mining on indigenous lands
data <- load_mapbiomas("mapbiomas_mining",
raw_data = FALSE,
geo_level = "indigenous_land"
)
TerraClimate
Spatial data on several climate variables, extracted from Climatology Lab’s TerraClimate. The table below shows all possible variables to be extracted, which are chosen through the “dataset” parameter. Data ranges from 1958 to 2020.
Netcdf files are downloaded from the THREDDS web server, as recommended for rectangular subsets of the global data.
<details> <summary> Click to see all dataset options </summary>Dataset | Code | Description | Units |
---|---|---|---|
max_temperature | tmax | Maximum 2-m Temperature | degC |
min_temperature | tmin | Minimum 2-m Temperature | degC |
wind_speed | ws | Wind Speed at 10-m | m/s |
vapor_pressure_deficit | vpd | Vapor Pressure Deficit | kPa |
vapor_pressure | vap | 2-m Vapor Pressure | kPa |
snow_water_equivalent | swe | Snow Water Equivalent at End of Month | mm |
shortwave_radiation_flux | srad | Downward Shortwave Radiation Flux at the Surface | W/m^2 |
soil_moisture | soil | Soil Moisture at End of Month | mm |
runoff | q | Runoff | mm |
precipitation | ppt | Accumulated Precipitation | mm |
potential_evaporation | pet | Reference Evapotranspiration | mm |
climatic_water_deficit | def | Climatic Water Deficit | mm |
water_evaporation | aet | Actual Evapotranspiration | mm |
palmer_drought_severity_index | PDSI | Palmer Drought Severity Index | unitless |
Options:
-
dataset: picks the variable to be read. Possible options are shown in the table above.
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
-
legal_amazon_only: if set to
TRUE
, only downloads data from the Legal Amazon region
OBS: A good internet connection is needed, because the data is heavy.
Examples:
# Downloading maximum temperature data from 2000 to 2001
max_temp <- load_climate(dataset = "max_temperature", time_period = 2000:2001)
# Downloading precipitation data only for the legal Amazon in 2010
amz_precipitation <- load_climate(
dataset = "precipitation",
time_period = 2010,
legal_amazon_only = TRUE
)
SEEG
Loads estimates of emission of greenhouse gases of Brazilian cities and states from SEEG. SEEG is the System of Estimates of Emissions and Removals of Greenhouse Gases (SEEG), an initiative of the Observatório do Clima, a network of institutions focused on climate change research in Brazil.
The data provided in SEEG’s Collection 9 is a series covering the period from 1970 to 2020, except for the Land Use Change Sector that has the series from 1990 to 2020.
Using data collected from government entities, institutes, research centers, NGOs and other institutions, the estimates are created using the methodology of the Brazilian Inventory of Anthropic Emissions and Removals of Greenhouse Gases, assembled by the Ministry of Science, Technology and Innovation (MCTI), and the directives of Intergovernmental Panel on Climate Change (IPCC)
Emissions are divided in five main sources: Agricultural and Cattle Raising, Energy, Changes in Use of Land, Industrial Processes and Residues. All greenhouse gases contained in the national inventory are considered, encompassing CO2, CH4, N2O and the HFCs, with the conversion to carbon equivalence (CO2e) also included, both in the metric of GWP (Global Warming Potential) and GTP (Global Temperature Potential).
The data is downloaded from the SEEG website in the form of one single file, so the option to select a certain range of years is not available. Also, due to the size of the file, a stable internet connection is necessary, and the function may take time to run.
Options:
-
dataset: there are six choices:
"seeg"
: provides all sectors in a same dataframe. Only works withraw_data = TRUE
"seeg_farming"
"seeg_industry"
"seeg_energy"
"seeg_land"
"seeg_residuals"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"state"
, or"municipality"
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download raw data (raw_data = TRUE) of greenhouse gases (dataset = "seeg")
# by state (geo_level = "state")
data <- load_seeg(
dataset = "seeg",
raw_data = TRUE,
geo_level = "state"
)
# Download treated data (raw_data = FALSE) of industry greenhouse gases (dataset = "seeg_industry")
data <- load_seeg(
dataset = "seeg_industry",
raw_data = FALSE,
geo_level = "state"
)
🔴 This function uses the googledrive
package to download data at the
municipality level. In case of authentication errors, see
googledrive.
CENSOAGRO
The census of agriculture collects information about agricultural establishments and the agricultural activities carried out there, covering characteristics of the producer and establishment, economy and employment in rural areas, livestock, farming and agroindustry.
Data is collected by IBGE and is available at country, state and municipality level.
Options:
-
dataset:there are 10 possible choices:
"agricultural_land_area"
: area and number of agricultural properties"agricultural_area_use"
: area of agricultural properties by use"agricultural_employees_tractors"
: number of employees and tractors in agricultural properties"agricultural_producer_condition"
: condition of agricultural producer, whether they own the land"animal_production"
: number of animals farmed, by species"animal_products"
: amount of animal products, by product type"vegetable_production_area"
: area and amount produced, by vegetable product"vegetable_production_temporary"
: amount produced, by temporary crop"vegetable_production_permanent"
: amount produced, by permanent crop"livestock_production"
: amount of bovine cattle, and number of agricultural properties
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
or"state"
. For dataset"livestock_production"
, it can also be"municipality"
-
time_period: picks the years for which the data will be downloaded:
- For datasets
"agricultural_land_area"
,"agricultural_producer_condition"
,"animal_products"
, and"vegetable_production_area"
, it can be one of 1920, 1940, 1950, 1960, 1970, 1975, 1980, 1985, 1995, or 2006. - For datasets
"vegetable_production_permanent"
and"vegetable_production_permanent"
, it can only be from 1940 onwards - For datasets
"agricultural_area_use"
,"agricultural_employees_tractors"
,"animal_production"
, it can only be from 1970 onwards - For dataset
"livestock_production"
, it can only be 2017
- For datasets
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download total land area data at the country level in year 2006
data <- load_censoagro(
dataset = "agricultural_land_area",
raw_data = TRUE,
geo_level = "country",
time_period = 2006
)
# Download temporary production crops data by state (geo_level = "state") in year 2006
# in portuguese (language = "pt")
data <- load_censoagro(
dataset = "vegetable_production_temporary",
raw_data = FALSE,
geo_level = "state",
time_period = 1996,
language = "pt"
)
Social Data
IPS
Loads information on the social and environmental performance of the Legal Amazon.
Data from the Amazon Social Progress Index, an initiative from Imazon with support from the Social Progress Imperative that measures the social and environmental progress of its locations. Namely, the 772 municipalities in the Amazon region. Survey is done at the municipal level.
Options:
-
dataset:
"all"
,"life_quality"
,"sanit_habit"
,"violence"
,"educ"
,"communic"
,"mortality"
, or"deforest"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
time_period: can be 2014, 2018, 2021, 2023 or a vector with some combination thereof
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download raw data from 2014
data <- load_ips(dataset = "all", raw_data = TRUE, time_period = 2014)
# Download treated deforest data from 2018 in portuguese
data <- load_ips(
dataset = "deforest", raw_data = FALSE,
time_period = 2018, language = "pt"
)
DATASUS
DATASUS is the IT department of SUS – the Brazilian Unified Health System. They provide data on health establishments, mortality, access to health services and several health indicators nationwide. This function allows for an easy download of several DATASUS raw datasets, and also cleans the data in a couple of datasets. The sections below explains each avaliable dataset.
Options:
-
dataset:
"datasus_sim_do"
has SIM-DO mortality data- Possible subsets of SIM-DO are
"datasus_sim_dofet"
(Fetal),"datasus_sim_doext"
(External causes),"datasus_sim_doinf"
(Children),"datasus_sim_domat"
(Maternal) "datasus_sih"
has SIH hospitalization data."datasus_cnes_lt"
has data on the number of hospital beds.- further subsets of CNES are listed later, but those only allow for the download of raw data.
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data. Only effective for SIM-DO and subsets, SIH, and CNES-LT.
-
keep_all: only applies when raw_data is FALSE. There are two options:
TRUE
: keeps all original variables, adding variable labels and possibly constructing extra variables.FALSE
: aggregates data at the municipality, thereby losing individual-level data, and only keeping aggregate measures.
-
time_period: picks the years for which the data will be downloaded
-
states: a vector of states by which to filter the data. Only works for datasets whose data is provided in separate files by state.
-
language: you can choose between Portuguese
("pt")
and English("eng")
DATASUS - SIM (System of Mortality Information)
Each original SIM data file contains rows corresponding to a declaration
of death (DO), and columns with several characteristics of the person,
the place of death, and the cause of death. The data comes from the main
SIM-DO (Declarations of Death) dataset, which goes by the option
"datasus_sim_do"
. There are also 4 subsets of SIM-DO, namely SIM-DOFET
(Fetal), SIM-DOMAT (Maternal), SIM-DOINF (Children), and SIM-DOEXT
(External Causes), with corresponding dataset options
"datasus_sim_dofet", "datasus_sim_domat", "datasus_sim_doinf", "datasus_sim_doext"
.
Note that only SIM-DO provides separate files for each state, so all
other dataset options always contain data from the whole country.
Below is an example of downloading the raw data, and also using the
raw_data = FALSE
option to obtain treated data. When this option is
selected, we create several variables for deaths from each cause, which
are encoded by their CID-10 codes. The function then returns, by
default, the aggregated data of mortality sources at the municipality
level. In this process, all the individual information such as age, sex,
race, and schooling are lost, so we also offer the option of
keep_all = TRUE
, which creates all the indicator variables for cause
of death, adds variable labels, and does not aggregate, thereby keeping
all individual-level variables.
Examples:
library(datazoom.amazonia)
# download raw data for the year 2010 in the state of AM.
data <- load_datasus(
dataset = "datasus_sim_do",
time_period = 2010,
states = "AM",
raw_data = TRUE
)
# download treated data with the number of deaths by cause in AM and PA.
data <- load_datasus(
dataset = "datasus_sim_do",
time_period = 2010,
states = c("AM", "PA"),
raw_data = FALSE
)
# download treated data with the number of deaths by cause in AM and PA
# keeping all individual variables.
data <- load_datasus(
dataset = "datasus_sim_do",
time_period = 2010,
states = c("AM", "PA"),
raw_data = FALSE,
keep_all = TRUE
)
DATASUS - CNES (National Register of Health Establishments)
Provides information on health establishments, avaliable hospital beds, and active physicians. The data is split into 13 datasets: LT (Beds), ST (Establishments), DC (Complimentary data), EQ (Equipment), SR (Specialized services), HB (License), PF (Practitioner), EP (Teams), RC (Contractual Rules), IN (Incentives), EE (Teaching establishments), EF (Philanthropic establishments), and GM (Management and goals).
Raw data is avaliable for all of them using the dataset option
datasus_cnes_lt, datasus_cnes_st
, and so on, and treated data is only
avaliable for CNES - LT. When raw_data = FALSE
is chosen, we return
data on the number of total hospital beds and the ones avaliable through
SUS, which can be aggregated by municipality (with option
keep_all = FALSE
) or keeping all original variables
(keep_all = TRUE
).
Examples:
library(datazoom.amazonia)
# download treated data with the number of avaliable beds in AM and PA.
data <- load_datasus(
dataset = "datasus_cnes_lt",
time_period = 2010,
states = c("AM", "PA"),
raw_data = FALSE
)
DATASUS - SIH (System of Hospital Information)
Contains data on hospitalizations. Treated data only gains variable labels, with no extra manipulation. Beware that this is a much heavier dataset.
Examples:
library(datazoom.amazonia)
# download raw data
data <- load_datasus(
dataset = "datasus_sih",
time_period = 2010,
states = "AM",
raw_data = TRUE
)
# download data in a single tibble, with variable labels
data <- load_datasus(
dataset = "datasus_sih",
time_period = 2010,
states = "AM",
raw_data = FALSE
)
IEMA
Data from the Institute of Environment and Water Resources (IEMA), documenting the number of people without access to eletric energy throughout the Amazon region in the year 2018.
Options:
-
dataset:
"iema"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data
data <- load_iema(raw_data = FALSE)
🔴 This function uses the googledrive
package to download data. In
case of authentication errors, see googledrive.
Population
Loads IBGE information on estimated population (2001-2006, 2008-2009, 2011-2021) or population (2007 and 2010) data. Data is available at country, state and municipality level and from 2001 to 2021.
Options:
-
dataset:
"population"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"state"
, or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated population data at the state level for 2010 to 2012
data <- load_population(
raw_data = FALSE,
geo_level = "state",
time_period = 2010:2012
)
Economic Data
COMEX
The Comex dataset gathers data extracted from Siscomex (Integrated System of Foreign Trade), which is a database containing information from all products that are imported to or exported from Brazil. Using data reported from the companies which are responsible for the process of transporting the products, the system adheres to internationally standardized nomenclatures, such as the Harmonized System and the Mercosul Common Nomenclature (which pertains to members of the Mercosul organization).
The data has a monthly frequency and is available starting from the year 1989. From 1989 to 1996, a different system of nomenclatures was adopted, but all conversions are available on a dictionary in the Comex website (https://www.gov.br/produtividade-e-comercio-exterior/pt-br/assuntos/comercio-exterior/estatisticas/base-de-dados-bruta/). Systems of nomenclature vary in the degree of detail in terms of the product involved, as well as other characteristics, such as unit and granularity of location.
Options:
-
dataset: there are four choices:
"comex_export_mun"
: selects exports data by municipality"comex_import_mun"
: selects imports data by municipality"comex_export_prod"
: selects exports data by producer"comex_import_prod"
: selects imports data by producer
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated (raw_data = FALSE) exports data by municipality (dataset = "comex_export_mun")
# from 2020 to 2021 (time_period = 2020:2021)
data <- load_br_trade(
dataset = "comex_export_mun",
raw_data = FALSE,
time_period = 2020:2021
)
# download treated(raw_data = FALSE) imports data by municipality (dataset = "comex_import_mun")
# from 2020 to 2021 (time_period = 2020:2021)
data <- load_br_trade(
dataset = "comex_import_mun",
raw_data = FALSE,
time_period = 2020:2021
)
BACI
Loads disaggregated data on bilateral trade flows for more than 5000 products and 200 countries. The data is from the CEPII and is built from data directly reported by each country to the United Nations Statistical Division (Comtrade).
As all of the data is packed into one single .zip file in the website, data on all years must be downloaded, even if not all of it is used. Therefore, downloading the data can take a long time.
Options:
-
dataset: there is one choice:
"HS92"
which follows the Harmonized System method
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated data for 2016 (takes a long time to download)
clean_baci <- load_baci(
raw_data = FALSE,
time_period = 2016
)
PIB-Munic
Loads IBGE information on gross domestic product at current prices, taxes, net of subsidies, on products at current prices and gross value added at current prices, total and by economic activity, and respective shares. Data is available at country, state and municipality level and from 2002 to 2018.
Options:
-
dataset:
"pibmunic"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"state"
, or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated municipal GDP data at the state level for 2010 to 2012
data <- load_pibmunic(
raw_data = FALSE,
geo_level = "state",
time_period = 2010:2012
)
CEMPRE
Employment, salary and firm data from IBGE’s Cadastro Central de Empresas (CEMPRE). Loads information on companies and other organizations and their respective formally constituted local units, registered with the CNPJ - National Register of Legal Entities. Data is available between 2006 and 2019.
Options:
-
dataset:
"cempre"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"state"
or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
-
sectors: defines if the data will be return separated by sectors (
sectors = TRUE
) or not (sectors = FALSE
)
Examples:
# Download raw data (raw_data = TRUE) at the country level
# from 2008 to 2010 (time_period = 2008:2010).
data <- load_cempre(
raw_data = TRUE,
geo_level = "country",
time_period = 2008:2010
)
# Download treted data (raw_data = FALSE) by state (geo_level = "state")
# from 2008 to 2010 (time_period = 2008:2010) in portuguese (language = "pt").
# In this example, data is split by sector (sectors = TRUE)
data <- load_cempre(
raw_data = FALSE,
geo_level = "state",
time_period = 2008:2010,
language = "pt",
sectors = TRUE
)
PAM
Municipal Agricultural Production (PAM, in Portuguese) is a nationwide annual survey conducted by IBGE (Brazilian Institute of Geography and Statistics) which provides information on agricultural products, such as quantity produced, area planted and harvested, average quantity of output and monetary value of such output. The products are divided in permanent and temporary farmed land, as well as dedicated surveys to the four products that yield multiple harvests a year (beans, potato, peanut and corn), which all sum to a total survey of 64 agricultural products (31 of temporary tillage and 33 of permanent tillage). Output, however, is only included in the dataset if the planted area occupies over 1 acre or if output exceeds one tonne.
Permanent farming is characterized by a cycle of long duration, whose harvests may be done multiple times across the years without the need of planting seeds again. Temporary farming, on the other hand, consists of cycles of short and medium duration, which after harvesting require planting seeds again.
The data also has multiple aggregation levels, such as nationwide, by region, mesoregion and microregion, as well as state and municipality.
The data available has a yearly frequency and is available from 1974 to the present, with the exception of the four multiple-harvest products, which are only available from 2003. More information can be found on this link (only in Portuguese).
Options:
-
dataset: See tables below
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"region"
,"state"
, or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
The datasets supported are shown in the tables below, made up of both the original databases and their narrower subsets. Note that downloading only specific crops is considerably faster.
<details> <summary> Full datasets provided by IBGE: </summary>dataset |
---|
all_crops |
temporary_crops |
permanent_crops |
corn |
potato |
peanut |
beans |
dataset | Name (pt) | Name (eng) |
---|---|---|
pineapple | Abacaxi | Pineapple |
alfafa | Alfafa Fenada | Alfafa Fenada |
cotton_herbaceous | Algodao Herbaceo (em Caroco) | Herbaceous Cotton (in Caroco) |
garlic | Alho | Garlic |
peanut_temporary | Amendoim (em Casca) | Peanuts (in Shell) |
rice | Arroz (em Casca) | Rice (in husk) |
oats | Aveia (em Grao) | Oats (in grain) |
sweet_potato | Batata Doce | Sweet potato |
potato_temporary | Batata Inglesa | English potato |
sugar_cane | Cana de Acucar | Sugar cane |
forage_cane | Cana para Forragem | Forage cane |
onion | Cebola | Onion |
rye | Centeio (em Grao) | Rye (in grain) |
barley | Cevada (em Grao) | Barley (in Grain) |
pea | Ervilha (em Grao) | Pea (in Grain) |
broad_bean | Fava (em Grao) | Broad Bean (in Grain) |
beans_temporary | Feijao (em Grao) | Beans (in Grain) |
tobacco | Fumo (em Folha) | Smoke (in Sheet) |
sunflower_seeds | Girassol (em Grao) | Sunflower (in Grain) |
jute_fiber | Juta (Fibra) | Jute (Fiber) |
linen_seeds | Linho (Semente) | Linen (Seed) |
malva_fiber | Malva (Fibra) | Malva (Fiber) |
castor_bean | Mamona (Baga) | Castor bean (Berry) |
cassava | Mandioca | Cassava |
watermelon | Melancia | watermelon |
melon | Melao | Melon |
corn_temporary | Milho (em Grao) | corn (in grain) |
ramie_fiber | Rami (Fibra) | Ramie (Fiber) |
soybean | Soja (em Grao) | Soybean (in grain) |
sorghum | Sorgo (em Grao) | Sorghum (in Grain) |
tomato | Tomate | Tomato |
wheat | Trigo (em Grao) | Wheat in grain) |
triticale | Triticale (em Grao) | Triticale (in grain) |
temporary_total | Total | Total |
dataset | Name (pt) | Name (eng) |
---|---|---|
avocado | Abacate | Avocado |
cotton_arboreo | Algodao Arboreo (em Caroco) | Arboreo cotton (in Caroco) |
acai | Acai | Acai |
olive | Azeitona | Olive |
banana | Banana (Cacho) | Banana (Bunch) |
rubber_coagulated_latex | Borracha (Latex Coagulado) | Rubber (Coagulated Latex) |
rubber_liquid_latex | Borracha (Latex Liquido) | Rubber (Liquid Latex) |
cocoa_beans | Cacau (em Amendoa) | Cocoa (in Almonds) |
coffee_total | Cafe (em Grao) Total | Coffee (in Grain) Total |
coffee_arabica | Cafe (em Grao) Arabica | Cafe (in Grao) Arabica |
coffee_canephora | Cafe (em Grao) Canephora | Cafe (in Grain) Canephora |
cashew | Caju | Cashew |
khaki | Caqui | Khaki |
cashew_nut | Castanha de Caju | Cashew Nuts |
india_tea | Cha da India (Folha Verde) | India Tea (Leaf) |
coconut | Coco da Baia | Coconut |
coconut_bunch | Dende (Cacho de Coco) | Coconut Bunch |
yerba_mate | Erva Mate (Folha Verde) | Mate Herb (Leaf) |
fig | Figo | Fig |
guava | Goiaba | Guava |
guarana_seeds | Guarana (Semente) | Guarana (Seed) |
orange | Laranja | Orange |
lemon | Limao | Lemon |
apple | Maca | Apple |
papaya | Mamao | Papaya |
mango | Manga | Mango |
passion_fruit | Maracuja | Passion fruit |
quince | Marmelo | Quince |
walnut | Noz (Fruto Seco) | Walnut (Dry Fruit) |
heart_of_palm | Palmito | Palm heart |
pear | Pera | Pear |
peach | Pessego | Peach |
black_pepper | Pimenta do Reino | Black pepper |
sisal_or_agave | Sisal ou Agave (Fibra) | Sisal or Agave (Fiber) |
tangerine | Tangerina | Tangerine |
tung | Tungue (Fruto Seco) | Tung (Dry Fruit) |
annatto_seeds | Urucum (Semente) | Annatto (Seed) |
grape | Uva | Grape |
permanent_total | Total | Total |
Examples:
# download treated data at the state level from 2010 to 2011 for all crops
data <- load_pam(
dataset = "all_crops",
raw_data = FALSE,
geo_level = "state",
time_period = 2010:2011,
language = "eng"
)
PEVS
Loads information on the amount and value of the production of the exploitation of native plant resources and planted forest massifs, as well as existing total and harvested areas of forest crops.
Data is from the Silviculture and Forestry Extraction Production (PEVS, in Portuguese), a nationwide annual survey conducted by IBGE (Brazilian Institute of Geography and Statistics). The data also has multiple aggregation levels, such as nationwide, by region, mesoregion and microregion, as well as state and municipality.
The data available has a yearly frequency and is available from 1986 to the present, with the exception of the data on total area for production, which are only available from 2013 onwards. More information can be found in this link.
Options:
-
dataset: there are three choices:
"pevs_forest_crops"
: provides data related to both quantity and value of the forestry activities. The data goes from 1986 to 2019 and it is divided by type of product."pevs_silviculture"
: provides data related to both quantity and value of the silviculture. The data goes from 1986 to 2019 and it is divided by type of product."pevs_silviculture_area"
: total existing area used for silviculture in 12/31.The data goes from 2013 to 2019 and it is divided by forestry species.
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"region"
,"state"
, or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated (raw_data = FALSE) silviculture data (dataset = 'pevs_silviculture')
# by state (geo_level = 'state') from 2012 (time_period = 2012)
# in portuguese (language = "pt")
data <- load_pevs(
dataset = "pevs_silviculture",
raw_data = FALSE,
geo_level = "state",
time_period = 2012,
language = "pt"
)
# Download raw (raw_data = TRUE) forest crops data by region
# from 2012 to 2013 in english
data <- load_pevs(
dataset = "pevs_forest_crops",
raw_data = TRUE,
geo_level = "region",
time_period = 2012:2013
)
PPM
Data on livestock inventories (e.g:cattle, pigs and hogs) in Brazilian Municipalities, as well as amount and value of animal products (e.g:output of milk, hen eggs, quail eggs, honey).
The periodicity of the survey is annual. The geographic coverage is national, with results released for Brazil, Major Regions, Federation Units, Mesoregions, Microregions and Municipalities.
The data available has a yearly frequency and is available from 1974 to the present. More information can be found in this link.
Options:
-
dataset: there are five possible choices:
-
"ppm_livestock_inventory"
: livestock herds, with the number of animals by species. -
"ppm_sheep_farming"
: total sheep reared and specific shearing data. -
"ppm_animal_origin_production"
: animal-origin products like milk, eggs, and honey, as well as wool and other derivatives. -
"ppm_cow_farming"
: dairy cow farming, with the number of milked cows, their geographic distribution, and productivity rates. -
"ppm_aquaculture"
: aquaculture activities, including fish farming, shrimp farming, and mollusk farming.
-
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
geo_level:
"country"
,"region"
,"state"
, or"municipality"
-
time_period: picks the years for which the data will be downloaded
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data (raw_data = FALSE) about aquaculture (dataset = "ppm_aquaculture")
# from 2013 to 2015 (time_period = 2013:2015) in english
# with the level of aggregation being the country (geo_level = "country").
data <- load_ppm(
dataset = "ppm_aquaculture",
raw_data = FALSE,
geo_level = "country",
time_period = 2013:2015
)
# Download raw data about sheep farming by state from 1980 to 1995 in portuguese (language = "pt")
data <- load_ppm(
dataset = "ppm_sheep_farming",
raw_data = TRUE,
geo_level = "state",
time_period = 1980:1995,
language = "pt"
)
SIGMINE
Loads information the mines being explored legally in Brazil, including their location, status, product being mined and area in square meters etc. Survey is done at municipal and state level. The National Mining Agency (ANM) is responsible for this survey.
Options:
-
dataset:
"sigmine_active"
-
raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
-
language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# Download treated data (raw_data = FALSE) in portuguese (language = "pt").
data <- load_sigmine(
dataset = "sigmine_active",
raw_data = FALSE,
language = "pt"
)
ANEEL
Loads data from the National Electrical Energy Agency (ANEEL), a Brazilian independent federal agency linked to the Ministry of Mines and Energy (MME). ANEEL works to provide favorable conditions for the Electrical Energy Market to develop with balance and for the benefit of society.
As for now, there are three different datasets available for download: the Energy Development Budget and the Energy Generation.
Energy Development Budget
The Energy Development Budget dataset showcases the Energy Development Account’s (CDE) anual budget expenses. The CDE is designed to promote the Brazilian energy development and is managed by the Electrical Energy Commercialization Chamber (CCEE).
The dataset makes available the year of the observation – from 2013 to 2022 –, the type of expense, its value in R$ (Reais) and its share over the total amount of CDE budget expenses on the year*.
*Note that ‘share_of_total’ values sum to 1 for each year available.
Energy Generation
The Energy Generation dataset showcases information about ANEEL’s Generation Informations System (SIGA). SIGA provides information about the Brazilian electrical energy generation installed capacity.
The dataset provides information at the individual venture/entity level. It contains information about the power, source, stage, type of permission, origin and final fuel with which each venture/entity operates, as well as other legal, technical and geographical information.* Operation start dates contained in the dataset go as far back as 1924 up to 2022.
* For more details on each variable, access This link and select “Manual do Usuario”.
Energy Enterprises
The Energy Enterprises dataset showcases information about distributed micro and mini generators, covered by the Regulatory Resolution nº 482/2012. The list of projects is classified by variables that make up their identification, namely: connected distributor, project code, numerical nucleus of the project code, owner name, production class, subgroup, name of the owner, number of consumer units that receive credits, connection date, type of generating unit, source, installed power, municipality, and federative unit where it is located.
The data is expressed in quantities and installed power in kW (kilowatt). The quantity corresponds to the number of distributed micro or mini generators installed in the specified period. The installed power is defined by the sum of the nominal active electric power of the generating units.
* For more details on each variable, access This link and select “Dicionário de dados”.
Options:
- dataset: there are three choices:
"energy_development_budget"
: government spending towards energy sources"energy_generation"
: energy generation by entity/corporation"energy_enterprises_distributed"
: distributed micro and mini generators
- raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
- language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated data about energy generation
clean_aneel <- load_aneel(
dataset = "energy generation",
raw_data = FALSE
)
EPE
Loads data from the Energy Research Company (EPE), a Brazilian public company that works closely with the Brazilian Ministry of Mines and Energy (MME) and other agencies to ensure the sustainable development of Brazil’s energy infrastructure. EPE’s duty on that mission is to support MME with quality research and studies in order to aid Brazil’s energy infrastructure planning.
As for now, there are two different datasets available for download: the Energy Consumption Per Class and the National Energy Balance. Both of them were obtained from the EPE website.
Energy Consumption Per Class
The Energy Consumption Per Class dataset provides monthly data about energy consumption and consumers from 2004 to 2022, for each class of energy consumption.
The different classes are Total consumption (and consumers), Industrial consumption (and consumers), Residential consumption (and consumers), Commercial consumption (and consumers), Captive consumption* and Other consumption (and consumers).**
*Note that there is no consumer data for ‘Captive’ class at all.
**There is also no consumer data for ‘Industrial’, ‘Commercial’ and ‘Other’ classes when the geographical level is ‘Subsystem’ or ‘Region’.
There are three different aggregation levels: The Region level encompasses the five Brazilian geographical regions (North, Northeast, Midwest, Southeast and South). The Subsystem level encompasses the five Brazilian Electric Subsystems (North, Northeast, Southeast/Midwest, South, Isolated Systems). The State level encompasses the 26 Brazilian States and the Federal District.
National Energy Balance
The National Energy Balance is a thorough and extensive research developed and published by EPE that contains useful data about energy consumption, generation, exportation and many more subjects.
As for now, the National Energy Balance dataset provides yearly data about energy generation per source of production. The sources can be divided into two groups: the renewable sources (hydro, wind, solar, nuclear, thermal, sugar_cane_bagasse, firewood, black_liquor) and the non-renewable sources (steam_coal, natural_gas, coke_oven_gas, fuel_oil, diesel).
The dataset has information at the Brazilian state level, including the Federal District, from 2011 to 2021 and also indicates whether the state is in the Legal Amazon or not.
Options:
- dataset: there are two choices:
"energy_consumption_per_class"
: monthly energy consumption and consumers by State, Region or Electric Subsystem"national_energy_balance"
: yearly energy generation per source, by State
- raw_data: there are two options:
TRUE
: if you want the data as it is originally.FALSE
: if you want the treated version of the data.
- geo_level: only applies to the
"energy_consumption_per_class"
dataset."state"
"subsystem"
- language: you can choose between Portuguese
("pt")
and English("eng")
Examples:
# download treated data about energy consumption at the state level
clean_epe <- load_epe(
dataset = "energy_consumption_per_class",
geo_level = "state",
raw_data = FALSE
)
Other tools
Legal Amazon Municipalities
Many of our functions use a dataset with Brazilian municipalities, their
municipality codes, whether they belong to the Legal Amazon, their
state, and some more variables. It was constructed from the IBGE
spreadsheet
with Legal Amazon municipalities, along with a data frame from the
‘geobr’ package. For more information on the columns, run
??datazoom.amazonia::municipalities
.
# load Brazilian municipalities dataset
data <- datazoom.amazonia::municipalities
<a name="googledrive"></a> The ‘googledrive’ package
For some of our functions, the original data is stored in Google Drive
and exceeds the file size limit for which direct downloads are possible.
As a result, the googledrive
package is required to download the data
though the Google Drive API and run the function.
The first time the package is called, it requires you to link your Google account and grant permissions to be able to download data through the Google Drive API.
You must tick all boxes when the permissions page opens, or else the following error will occur:
# Error in `gargle_abort_request_failed()`:
# ! Client error: (403) Forbidden
# Insufficient Permission: Request had insufficient authentication scopes.
# • domain: global
# • reason: insufficientPermissions
# • message: Insufficient Permission: Request had insufficient authentication
# scopes.
# Run `rlang::last_error()` to see where the error occurred.
For further information, click here to access the official package page.
Credits
DataZoom is developed by a team at Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Department of Economics. Our official website is at: https://www.econ.puc-rio.br/datazoom/.
To cite package datazoom.amazonia
in publications use:
Data Zoom (2023). Data Zoom: Simplifying Access To Brazilian Microdata.
https://www.econ.puc-rio.br/datazoom/english/index.html
A BibTeX entry for LaTeX users is:
@Unpublished{DataZoom2023,
author = {Data Zoom},
title = {Data Zoom: Simplifying Access To Brazilian Microdata},
url = {https://www.econ.puc-rio.br/datazoom/english/index.html},
year = {2023},
}