Awesome
COVID-19 Database for Research and Analysis
This repository contains tools to generate a COVID-19 database for research and analysis, and links to a pre-generated database. The database is a self-contained Sqlite database which can be used on any platform.
The program in this library can be run on your machine to download data from the Internet and assemble your own database. The process takes approximately two minutes and you can run it however often you like to obtain the latest data. Alternatively, a database is generated daily that you can download as well.
Download the database
You can download a compressed database for yourself here: covid19db.zip.
This file is automatically regenerated daily.
Example uses
This data is used in the COVID-19 in Kansas project. It has graphs automatically updated daily with a unique perspective on various data.
Using the data
Besides the Sqlite command-line tools, here are some other tips on using the data:
- The DB Browser for SQLite is a nice graphical explorer for SQLite.
- You can use SQLite in Microsoft Excel and LibreOffice. Search for information on doing so.
Please note that various included data requests or requires attribution. Please give credit to original sources of data (eg, The New York Times) and aggregators in your work.
Included data and sources
You can find a complete database schema in dbschema.rs. The views defined there are intended to be the primary way to access the database. A Rust API for sqlx
is also provided for select tables. Direct source data download URLs are in loader.rs.
Here are the sources:
nytcounties
is from the New York Times dataset. Julian dates and YYYY-MM-DD dates are added to the CSV source, along with population and a calculation of new cases/deaths. The raw data is innytcounties_raw
.cdataset
is from the COVID-19 derived datasets project, which includes data from Johns Hopkins University, the New York Times, and ECDC. This integrates the "combined" set, so you will almost certainly want to use aWHERE dataset='foo'
in every query so that you use only a single dataset.select distinct dataset from cdataset order by dataset;
will show you the available datasets. Please see the derived datasets link above for a description of the sources and the augmentation done there. Additional augmentation is done on reading in to this system:- Counties are cross-referenced with their FIPS code, which is added to the cdataset view.
- A Julian date field is added for ease of computation. It simply increases by 1 for each day, and makes date-based arithmetic simpler in many cases.
- County populations were not previously populated, and are now done so in the
factbook_population
column using the Johns Hopkins data (see below). - Counties that did not previously have a population present have the per-100k people calculations performed and added.
- The source data eliminated rows for a given dataset and location on days on which there were no new cases/deaths (all the delta values would be zero). For ease of tabulation, those rows are added back in so a given dataseries for a given location should have a row present for every day.
- The source data used NULL instead of 0 for deltas. This has been corrected to 0 in these tables.
- Location data is pulled into a separate table, and brought back in via a view, cutting the size of the table on disk in half.
- Please note that this source is no longer updating daily.
loc_lookup
is from the Johns Hopkins dataset, the bulk of which it already included above incdataset
. This table represents theUID_ISO_FIPS_LookUp_Table.csv
file, which contains county-level population data that is integrated intocdataset
or can be queried separately.rtlive
is from rt.live. Julian dates and YYYY-MM-DD dates are added to the CSV source; no other changes were made.covidtracking
is from the COVID Tracking Project data downloads. Julian dates and Y/M/D dates are added to the CSV source; no other changes were made.- The COVID Tracking Project makes a separate US file available, which aggregates data to have one row per day across the entire USA. Instead of parsing another file, there is a view
covidtracking_us
that uses the data incovidtracking
to present the same kind of view.
- The COVID Tracking Project makes a separate US file available, which aggregates data to have one row per day across the entire USA. Instead of parsing another file, there is a view
owid
is from the Our World in Data COVID-19 dataset. Julian dates and Y/M/D dates are added to the CSV source.
Additional Resources
- https://source.opennews.org/articles/comparison-four-major-covid-19-data-sources/ has an overview of sources.
These are potential future integrations:
- https://www.cdc.gov/nchs/nvss/vsrr/covid_weekly/index.htm
- https://www.cdc.gov/nchs/nvss/vsrr/covid_weekly/index.htm
- https://catalog.data.gov/dataset/covid-19-cases-summarized-by-age-group-and-gender
- https://aws.amazon.com/data-exchange/covid-19/?cards.sort-by=item.additionalFields.order&cards.sort-order=asc
- https://duckduckgo.com/?t=ffab&q=covid-19+data+set+by+age&ia=web
- https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
- https://docs.safegraph.com/docs/social-distancing-metrics
- https://www.cuebiq.com/visitation-insights-mobility-index/
- https://www.google.com/covid19/mobility/
- https://www.cdc.gov/covid-data-tracker/index.html
- https://ourworldindata.org/coronavirus-testing
Building your own database
A command like this should do it
git clone https://github.com/jgoerzen/covid19db
cd covid19db
cargo run --release
You will then get a file named covid19.db
in the working directory. Just use this with Sqlite.
With these commands, you can verify these results for yourself. If you don't already have Rust installed, see the Rust installation page.
The Rust library
It is pretty skeletal at the moment, but you can browse the docs.
Database and API stability
This is a rapidly-changing field and the data providers change their schemas on a fairly frequent basis. I attempt to mitigate impacts. If you avoid things like SELECT *
and instead name your columns explicitly you will minimize the impact on yourself in the event of API changes.
Users
This data is used by the Kansas COVID-19 Charts project and perhaps others.
Copyright and Acknowledgments
This code is Copyright (c) 2019-2020 John Goerzen
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
This repository contains only tools for obtaining data and no data itself, though the data itself may be available elsewhere on Github. If you use the data accumulated by this program, or download it, you may be required to acknowledge the source. Here are some details:
cdataset - New York Times
In general, we are making this data publicly available for broad, noncommercial public use including by medical and public health researchers, policymakers, analysts and local news media.
If you use this data, you must attribute it to “The New York Times” in any publication. If you would like a more expanded description of the data, you could say “Data from The New York Times, based on reports from state and local health agencies.”
If you use it in an online presentation, we would appreciate it if you would link to our U.S. tracking page at https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html.
If you use this data, please let us know at covid-data@nytimes.com.
See our LICENSE for the full terms of use for this data.
This license is co-extensive with the Creative Commons Attribution-NonCommercial 4.0 International license, and licensees should refer to that license (CC BY-NC) if they have questions about the scope of the license.
cdataset and loc_lookup - Johns Hopkins
- This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
- Attribute the data as the "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University" or "JHU CSSE COVID-19 Data" for short, and the url: https://github.com/CSSEGISandData/COVID-19.
- For publications that use the data, please cite the following publication: "Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1"
rtlive - rt.live
We just ask that you cite Rt.live as the source and link where appropriate.
covidtracking - COVID-19 Tracking Project
You are welcome to copy, distribute, and develop data and website content from The COVID Tracking Project at The Atlantic for all healthcare, medical, journalistic and non-commercial uses, including any personal, editorial, academic, or research purposes.
The COVID Tracking Project at The Atlantic data and website content is published under a Creative Commons CC BY-NC-4.0 license, which requires users to attribute the source and license type (CC BY-NC-4.0) when sharing our data or website content. The COVID Tracking Project at The Atlantic also grants permission for any derivative use of this data and website content that supports healthcare or medical research (including institutional use by public health and for-profit organizations), or journalistic usage (by nonprofit or for-profit organizations). All other commercial uses are not permitted under the Creative Commons license, and will require permission from The COVID Tracking Project at The Atlantic.
owid - Our World In Data
"All our research and visualizations are free to use by everyone for all purposes." source
Visualizations and text: All our charts, maps, and text is licensed under a very permissive ‘Creative Commons’ (CC) license: The CC-BY license. The BY stands for ‘by attribution’ and this means you are free to take whatever is useful for your work. You just need to provide credit to Our World in Data and our underlying sources (see below).
Harvey County Testing Data
This data is a manual import from the Kansas Department of Health and Environment and the Harvey County Health Department.