Awesome
Official Site
Please refer to the official site for this repository for visualizations and other relevant information: https://health.google.com/covid-19/open-data/
Repository No Longer Updated
As of September 15, 2022, we will be turning off real-time updates in this repository, and converting the repository to a retrospective one. The data will continue to be available without interruption for the foreseeable future at the existing location, but it will not be updated further. Users who wish to continue to receive updates are encouraged to inspect our data sources, or clone the code and run the data pipelines locally.
COVID-19 Open-Data
This repository attempts to assemble the largest Covid-19 epidemiological database in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more. Moreover, the data merges daily time-series, +20,000 global sources, at a fine spatial resolution, using a consistent set of region keys. All regions are assigned a unique location key, which resolves discrepancies between ISO / NUTS / FIPS codes, etc. The different aggregation levels are: The different aggregation levels are:
- 0: Country
- 1: Province, state, or local equivalent
- 2: Municipality, county, or local equivalent
- 3: Locality which may not follow strict hierarchical order, such as "city" or "nursing homes in X location"
There are multiple types of data:
- Outcome data
Y(i,t)
, such as cases, tests, hospitalizations, deaths and recoveries, for regioni
and timet
- Static covariate data
X(i)
, such as population size, health statistics, economic indicators, geographic boundaries - Dynamic covariate data
X(i,t)
, such as mobility, search trends, weather, and government interventions
The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the aggregated table.
Table | Keys<sup>1</sup> | Content | URL | Source<sup>2</sup> |
---|---|---|---|---|
Aggregated | [key][date] | Flat, compressed table with records from (almost) all other tables joined by date and/or key ; see below for more details | aggregated.csv | All tables below |
Index | [key] | Various names and codes, useful for joining with other datasets | index.csv, index.json | Wikidata, DataCommons, Eurostat |
Demographics | [key] | Various (current<sup>3</sup>) population statistics | demographics.csv, demographics.json | Wikidata, DataCommons, WorldBank, WorldPop, Eurostat |
Economy | [key] | Various (current<sup>3</sup>) economic indicators | economy.csv, economy.json | Wikidata, DataCommons, Eurostat |
Epidemiology | [key][date] | COVID-19 cases, deaths, recoveries and tests | epidemiology.csv, epidemiology.json | Various<sup>2</sup> |
Emergency Declarations | [key][date] | Government emergency declarations and mitigation policies | lawatlas-emergency-declarations.csv | LawAtlas Project |
Geography | [key] | Geographical information about the region | geography.csv, geography.json | Wikidata |
Health | [key] | Health indicators for the region | health.csv, health.json | Wikidata, WorldBank, Eurostat |
Hospitalizations | [key][date] | Information related to patients of COVID-19 and hospitals | hospitalizations.csv, hospitalizations.json | Various<sup>2</sup> |
Mobility | [key][date] | Various metrics related to the movement of people.<br/><br/>To download or use the data, you must agree to the Google Terms of Service. | mobility.csv, mobility.json | |
Search Trends | [key][date] | Trends in symptom search volumes due to COVID-19.<br/><br/>To download or use the data, you must agree to the Google Terms of Service. | google-search-trends.csv | |
Vaccination Access | [place_id] | Metrics quantifying access to COVID-19 vaccination sites.<br/><br/>To download or use the data, you must agree to the Google Terms of Service. | facility-boundary-us-all.csv | |
Vaccination Search | [key][date] | Trends in Google searches for COVID-19 vaccination information. <br/><br/> To download or use the data, you must agree to the Google Terms of Service. | Global-vaccination-search-insights.csv | |
Vaccinations | [key][date] | Trends in persons vaccinated and population vaccination rate regarding various Covid-19 vaccines.<br/><br/> | vaccinations.csv | |
Government Response | [key][date] | Government interventions and their relative stringency | oxford-government-response.csv, oxford-government-response.json | University of Oxford |
Weather | [key][date] | Dated meteorological information for each region | weather.csv | NOAA |
WorldBank | [key] | Latest record for each indicator from WorldBank for all reporting countries | worldbank.csv, worldbank.json | WorldBank |
By Age | [key][date] | Epidemiology and hospitalizations data stratified by age | by-age.csv, by-age.json | Various<sup>2</sup> |
By Sex | [key][date] | Epidemiology and hospitalizations data stratified by sex | by-sex.csv, by-sex.json | Various<sup>2</sup> |
<sup>1</sup> key
is a unique string for the specific geographical region built from a combination
of codes such as ISO 3166
, NUTS
, FIPS
and other local equivalents.
<sup>2</sup> Refer to the data sources for specifics about each data source and
the associated terms of use.
<sup>3</sup> Datasets without a date
column contain the most recently reported information for
each datapoint to date.
For more information about how to use these files see the section about using the data, and for more details about each dataset see the section about understanding the data.
Why another dataset?
There are many other public COVID-19 datasets. However, we believe this dataset is unique in the way that it merges multiple global sources, at a fine spatial resolution, using a consistent set of region keys in a way we hope facilitate ease of usage. Most importantly, we are committed to transparency regarding open, public, and licensed data sources. Lastly, the code for ingesting and merging the data is easy to understand and modify.
Explore the data
A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer: <img src="https://github.com/open-covid-19/explorer/raw/master/screenshots/explorer.png" alt="drawing" width="200"/> <br> A variety of other community contributed visualization tools are listed below.
See the COVID19 Data Block made by the Looker team: | If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset: | You can also check out the great work of @quixote79, a MapBox-powered interactive map site: |
Experience clean, clear graphs with smooth animations thanks to the work of @jmullo: | Become an armchair epidemiologist with the COVID-19 timeline simulation tool built by @LeviticusMB: | Whether you want an interactive map, compare stats or look at charts, @saadmas has you covered with a COVID-19 Daily Tracking site: |
Compare per-million data at Omnimodel thanks to @OmarJay1: | Look at responsive, comprehensive charts thanks to the work of @davidjohnstone: | Reproduction Live lets you track COVID-19 outbreaks in your region and visualise the spread of the virus over time: |
Use the data
The data is available as CSV and JSON files, which are published in Google Cloud Storage so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type.
For the purpose of making the data as easy to use as possible, there is an
aggregated table which contains the columns of all other tables joined by key
and date
. However, performance-wise, it may be better to download the data separately and join the
tables locally.
Each region has its own version of the aggregated table, so you can pull all the data for a specific region using a single endpoint, the URL for each region is:
- Data for
key
in CSV format:https://storage.googleapis.com/covid19-open-data/v3/location/${key}.csv
- Data for
key
in JSON format:https://storage.googleapis.com/covid19-open-data/v3/location/${key}.json
Each table has a full version as well as subsets with only the last day of data.
The full version is accessible at the URL described in the table above.
The subsets can be found by inserting latest
into the path. For example, the subsets of the
epidemiology table are available at the following locations:
- Time series: https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv
- Latest: https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv
Please note that the aggregated table is not compressed for the latest subset, so the URL is https://storage.googleapis.com/covid19-open-data/v3/latest/aggregated.csv.
Note that the latest
version contains the last non-null record for each key. All of the above
listed tables have a corresponding JSON version; simply replace csv
with json
in the link.
If you are trying to use this data alongside your own datasets, then you can use the Index table to get access to the ISO 3166 / NUTS / FIPS code, although administrative subdivisions are not consistent among all reporting regions. For example, for the intra-country reporting, some EU countries use NUTS2, others NUTS3 and many ISO 3166-2 codes.
You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.
BigQuery
This dataset is part of the BigQuery Public Datasets Program, so you may use BigQuery to run SQL queries directly from the online query editor free of charge.
Google Colab
You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/GoogleCloudPlatform/covid-19-open-data.
Google Sheets
You can import the data directly into Google Sheets, as long as you stay within the size limits. For instance, the following formula loads the latest epidemiology data into the current sheet:
=IMPORTDATA("https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv")
Note that Google Sheets has a size limitation, so only data from the latest
subfolder can be
imported automatically. To work around that, simply download the file and import it via the File
menu.
R
If you prefer R, then this is all you need to do to load the epidemiology data:
data <- read.csv("https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv")
Python
In Python, you need to have the package pandas
installed to get started:
import pandas
data = pandas.read_csv("https://storage.googleapis.com/covid19-open-data/v3/epidemiology.csv")
jQuery
Loading the JSON file using jQuery can be done directly from the output folder,
this code snippet loads the epidemiology table into the data
variable:
$.getJSON("https://storage.googleapis.com/covid19-open-data/v3/epidemiology.json", data => { ... }
Powershell
You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest epidemiology data for Australia:
Invoke-WebRequest 'https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv' | ConvertFrom-Csv | `
where key -eq 'AU' | select date,cumulative_confirmed,cumulative_deceased,cumulative_recovered
Understand the data
Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment in non-compatible ways, and due to the configuration of GitHub's raw file server you may run into potential caching issues.
Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.
For information about each table, see the corresponding documentation linked above.
Aggregated table
Flat table with records from all other tables joined by key
and date
. See above for links to the
documentation for each individual table. Due to technical limitations, not all tables can be
included as part of this aggregated table.
Notes about the data
For countries where both country-level and subregion-level data is available, the entry which has a
null value for the subregion level columns in the index
table indicates upper-level aggregation.
For example, if a data point has values
{country_code: US, subregion1_code: CA, subregion2_code: null, ...}
then that record will have
data aggregated at the subregion1 (i.e. state/province) level. If subregion1_code
were null, then
it would be data aggregated at the country level.
Another way to tell the level of aggregation is the aggregation_level
of the index
table, see
the schema documentation for more details about how to interpret it.
Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.
Data updates
The data for each table is updated at least daily. Individual tables, for example Epidemiology, have fresher data than the aggregated table and are updated multiple times a day. Each individual data source has its own update schedule and some are not updated in a regular interval; the data tables hosted here only reflect the latest data published by the sources.
Contribute
Technical contributions to the data extraction pipeline are welcomed, take a look at the source directory for more information.
If you spot an error in the data, feel free to open an issue on this repository and we will review it.
If you do something with this data, for example a research paper or work related to visualization or analysis, please let us know!
For Data Owners
We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.
If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please open an issue on this repository and we will happily consider your request.
Licensing
The output data files are published under the CC BY license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.
Sources of data
All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health. For a list of individual data sources, please see the documentation for the individual tables linked at the top of this page.
Running the data extraction pipeline
See the source documentation for more technical details.
Acknowledgments and collaborations
This project has been done in collaboration with FinMango, which provided great insights about the impact of the pandemic on the local economies and also helped with research and manual curation of data sources for many regions including South Africa and US states.
Stratified mortality data for US states is provided by Imperial College of London. Please refer to this list of maintainers and contributors for the individual acknowledgements.
The following persons have made significant contributions to this project:
- Oscar Wahltinez
- Kevin Murphy
- Michael Brenner
- Matt Lee
- Anthony Erlinger
- Mayank Daswani
- Pranali Yawalkar
- Zack Ontiveros
- Ruth Alcantara
- Donny Cheung
- Aurora Cheung
- Chandan Nath
- Paula Le
- Ofir Picazo Navarro
Recommended citation
Please use the following when citing this project as a source of data:
@article{Wahltinez2020,
author = "O. Wahltinez and others",
year = 2020,
title = "COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2",
note = "Work in progress",
url = {https://goo.gle/covid-19-open-data},
}