Awesome

Age-specific COVID-19 mortality data in the United-States

Updates

2021-04-02 This repository is now no longer maintained. What does this mean?

Some of the states, which required manual extraction, will no longer be updated.
The automatic extraction, for the rest of the states, will continue to be run daily and updated in the update-data branch. However, we do not guarantee the accuracy of this data as they are no longer checked.
The processed data, gathering all sources across the states, is no longer updated.

2021-01-28 Version 1 Release This is the release related to our upcoming peer-reviewed age paper, where we use age-specific mobility data to estimate the epidemic in the USA by accounting for age-specific heterogeneity.

One may directly get:

the age-specific mortality data used in the paper here
the crude estimates of the COVID-19 cases and mortality across common age strata here

Data

The user may directly find the latest update of the age-specific mortality by date, age and location in

data/processed/latest/DeathsByAge_US.csv

We aim to update the data at least once a week. The data set currently includes 44 U.S. states and 2 metropolitan areas. The locations are listed in the table below.

Usage

Docker

The easiest way for reproducibility is using docker. A Dockerfile is in the repository.

Run:

sudo apt-get install docker # for linux. For mac you can use something like brew. In any case,
# you need to install docker onto your machine 
docker build -t usaage .
docker run --rm -t -d --name usaage_container -v $(pwd):/code usaage

This will keep a docker container running in the background, which you can inspect using docker ps.

Now all the development can be done in the container and you can edit the code as usual locally (changes will be synced to the docker container since we made it share folders using the flag -v). You might need to use Remote-SSH in the VSCODE IDE for convenience. You can also just attach a shell onto the container using docker exec -it usaage_container /bin/bash

You can check that everything works by running make all in the container.

Structure Overview

The code is divided into 2 parts: First, the extraction of the COVID-19 mortality counts data from Department of Health websites. Second, the processing of the extracted data to create a complete time series of age-specific COVID-19 mortality counts for every location.

Dependencies

Data extraction

Python version >= 3.6.1
Python libraries:

fitz
PyMuPDF
pandas
pyjson
beautifulsoup4
requests
selenium

Data processing

R version >= 4.0.2
R libraries:

data.table
ggplot2 
scales
gridExtra
tidyverse
rjson
readxl
reshape2

1. Data extraction

To extract, run

$ make files

This will get you the latest data in data/$DATE.

2. Data processing

To process, run

$ Rscript scripts/process.data.R

This will get you a csv file for every state with variables age, date, daily.deaths and (state) code in data/processed/$DATE/.

More details about the data extraction

The main entry point is make files.

Scripts

make files will execute the files task in Makefile, which currently is composed only of the script ./download_files.sh. This script follows the following steps:

Set a date, $date, in the local environment
Create new folders in data and pdfs for the $date.
Run the following scripts:
- scripts/age_extraction.py to extract the locations for which data are available in CSV, XLSX or JSON format.
- a series of GET requests to the web API. They download CSVs made available by the DoH directly.
- scripts/extraction_try.py, which downloads data that are in webpage, XLSX or PDF format.
- python scripts/get_nm.py to get New Mexico data.

General procedure

Depending on the data format made available by the DoH, we do the following:

PDFs: We use fitz in order to read data within PDFs and save them to JSON or CSV format.

CSVs, XLSX, JSON: We download the data directly.

Static Webpages (HTML): We save the HTML and extract the data using BeautifulSoup, and save them in JSON format.

Dynamic Webpages (Dashboard): We use selenium to render a webpage and switch to the right page. Then, if the data is stored in the source code, we find their path or css, extract them and save them to a JSON format. Otherwise, if the webpage can be saved as a PDFs, we use BeautifulSoup to download the webpage in a PDFs format and fitz to extract the data within PDFs. If we cannot use either of the latter options, we take a screenshot of the webpage, and extract the data manually.

Screenshots/PNGs: To record the data published in the dynamic webpages

More details about the data processing

Procedure

Pre-processing adjustments

We reconstruct time series for every location and age band, therefore all extracted data need to have the same age bands. If the DoH changes the reported age bands at time $t$ and,

the old age bands can be used to find the new age bands, then we find the mortality counts by the old age bands for every data from $t$ before processing.
the old age bands cannot be used to find the new age bands, then we truncate the time series: $t$ becomes the first day of the time series and all data extracted before $t$ are ignored.

Processing stages

Read the data
- If a complete time series records of age-specific COVID-19 attributable death burden is available
  - Use only the last data available
  - Every state has its own processing function depending on the data format
- If daily snapshots of age-specific COVID-19 attributable death burden are available
  - Use every data ever extracted
  - if CSV or XLSX: the state has its own processing function
  - if JSON: common processing function
Ensure that the mortality counts are strictly increasing
- some DoH updates indicated a decreasing mortality count from one day to the next.
- In this case, we set the mortality count on the earliest day to match the mortality count on the most recent day.
Find daily deaths
- some days had missing data, usually either because no updates were reported, because the webpage failed or because the URL of the website had mutated.
- The missing daily mortality count were imputed, assuming a constant increase in daily mortality count between days with data.
Check that the reconstructed cumulative deaths on the last day match the ones reported in the latest data.

The script that acts as a spine for those four stages is utils/obtain.data.R. Functions for stage 1 are in utils/read.daily-historical.data.R and utils/read.json.data.R. Functions from stage 2, 3 are in utils/summary_functions.R. Function for stage 4 is in utils/sanity.check.processed.data.R.

Post-processing adjustments

After reconstructing the time series, we make final adjustements for analysis:

Modify the age bands boundaries from the ones declared by the Department of Health, such that they reflect the closest age bands in the set, A = { [0-4], [5-9], ..., [75-79], [80-84], [85+] }. For example, age band [0-17] becomes [0-19] and age band [61-65].
Keep only days that match closely with JHU overall mortality counts.

Both data set, adjusted and non adjusted are available, DeathsByAge_US_adj.csv and DeathsByAge_US.csv.

Data source

This table includes a complete list of all sources ever used in the data set. We acknowledge and are grateful to U.S. state Departments of Health for making the primary data available at the following sources:

State	Date record start	Link(s)	Notes
Alabama	2020-05-03	link	dashboard updated daily and replaced; no historical archive
Alaska	2020-06-09	link	metadata updated daily and replaced; no historical archive
Arizona	2020-05-13	link	dashboard updated daily and replaced; no historical archive
California	2020-05-13	link	dashboard updated daily and replaced; no historical archive
Colorado	2020-03-23	(1) link until 2020-08-20, (2) link since 2020-08-20	(1) metadata updated daily; full time series; died in 2020-08-20; (2) dashboard updated daily and replaced; no historical archive
Connecticut	2020-04-05	link	metadata updated daily; full time series
Delaware	2020-05-12	link	dashboard updated daily and replaced; no historical archive
District of Columbia	2020-04-13	link	metadata updated daily; full time series
Florida	2020-03-27	link	daily report; with historical archive
Hawaii	2020-09-18	link	dashboard updated weekly and replaced
Georgia	2020-04-27	link	metadata updated daily and replaced; no historical archive
Idaho	2020-05-13	(1) link, (2) link	dashboard updated daily and replaced; no historical archive ; (1) died on 2020-09-04
Illinois	2020-05-14	link	dashboard updated daily and replaced; no historical archive
Indiana	2020-05-13	link	dashboard updated daily and replaced; no historical archive
Iowa	2020-05-13	link	dashboard updated daily and replaced; no historical archive
Kansas	2020-05-13	link	dashboard updated Monday, Wednesday and Friday, and replaced; no historical archive
Kentucky	2020-05-13	link	dashboard updated daily and replaced; no historical archive
Louisiana	2020-05-12	link	dashboard updated daily except on Saturday and replaced; no historical archive
Maine	2020-03-12	link	metadata updated daily; full time series
Maryland	2020-05-14	link	dashboard updated daily and replaced; no historical archive
Massachusetts	2020-04-20	link until 2020-08-11 and link since	(1) daily report, with historical archive; (2) weekly report, with historical archive
Michigan	2020-03-21	(1) `data/req/michigan weekly.csv` and (2) link	(1) data requested to the DoH (2) dashboard updated daily and replaced; no historical archive
Minnesota	2020-05-21	link	weekly report, with historical archive
Mississippi	2020-04-27	link	dashboard updated daily and replaced; no historical archive
Missouri	2020-05-13	(1)link and (2)link	dashboard updated daily and replaced; no historical archive
Nevada	2020-06-07	link	dashboard updated daily and replaced; no historical archive
New Hampshire	2020-06-07	(1)link until 2021-01-08, and (2)link since 2021-01-08	dashboard updated daily and replaced; no historical archive
New Jersey	2020-05-25	link	dashboard updated daily and replaced; no historical archive
New Mexico	2020-05-25	link	daily written report; with history archive
New York City	2020-04-14	link, link since 2020-05-18, link since 2020-11-08	report / csv updated daily, with history archive
North Carolina	2020-05-20	link	dashboard updated daily and replaced; no historical archive
North Dakota	2020-05-14	link	dashboard updated daily and replaced; no historical archive
Oklahoma	2020-05-13	link	dashboard updated daily and replaced; no historical archive
Oregon	2020-06-05	link	dashboard updated dashboard updated on Monday-Friday and sometimes on Saturday and replaced; no historical archive
Pennsylvania	2020-06-07	(1)link and (2)link	dashboard updated daily and replaced; no historical archive
Rhode Island	2020-06-01	link	metadata updated weekly and replaced; no historical archive
South Carolina	2020-05-14	link	dashboard updated on Tuesday and Friday; no historical archive
Tennessee	2020-04-09	link	metadata updated daily; full time series
Texas	2020-05-06	(1) link until 2020-09-24, (2) link since 2020-09-24	metadata updated daily and replaced; no historical archive
Utah	2020-06-17	link	dashboard updated daily and replaced; no historical archive
Vermont	2020-05-13	(1) link until 2020-09-03, (2) link since 2020-09-03	dashboard updated daily and replaced; no historical archive; (1) does not report mortality by age since 2020-09-03
Virginia	2020-04-21	link	metadata updated daily; full time series
Washington	2020-06-08	link	dashboard updated daily and replaced; no historical archive
Wisconsin	2020-03-15	(1) link until 2020-10-19, (2) link since 2020-10-19	metadata updated daily; full time series
Wyoming	2020-09-22	link	dashboard updated daily and replaced; no historical archive

About

Maintainers and Contributors

Active maintainers (alphabetically)

Yu Chen - Department of Mathematics, Imperial College London
Michael Hutchinson - Department of Statistics, Oxford
Vidoushee Jogarah - Mary Lister McCammon Fellow, Department of Mathematics, Imperial College London
Mélodie Monod - Department of Mathematics, Imperial College London
Oliver Ratmann - Department of Mathematics, Imperial College London
Harrison Zhu - Department of Mathematics, Imperial College London

Contributors

Martin McManus - Department of Mathematics, Imperial College London

Licence

This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by Imperial College London on behalf of its COVID-19 Response Team. Copyright Imperial College London 2020.

Warranty

Imperial makes no representation or warranty about the accuracy or completeness of the data nor that the results will not constitute in infringement of third-party rights. Imperial accepts no liability or responsibility for any use which may be made of any results, for the results, nor for any reliance which may be placed on any such work or results.

Cite

Attribute the data as the "COVID-19 Age specific Mortality Data Repository by the Imperial College London COVID-19 Response Team", and the urls sepecified above.

Acknowledgements

We acknowledge the support of the EPSRC through the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning at Imperial and Oxford.

Funding

This research was partly funded by the The Imperial College COVID-19 Research Fund.