Home

Awesome

Run daily update Run daily update to s3

Age-specific COVID-19 mortality data in the United-States

Updates

  1. Some of the states, which required manual extraction, will no longer be updated.
  2. The automatic extraction, for the rest of the states, will continue to be run daily and updated in the update-data branch. However, we do not guarantee the accuracy of this data as they are no longer checked.
  3. The processed data, gathering all sources across the states, is no longer updated.

One may directly get:

  1. the age-specific mortality data used in the paper here
  2. the crude estimates of the COVID-19 cases and mortality across common age strata here

Data

The user may directly find the latest update of the age-specific mortality by date, age and location in

data/processed/latest/DeathsByAge_US.csv

We aim to update the data at least once a week. The data set currently includes 44 U.S. states and 2 metropolitan areas. The locations are listed in the table below.

Usage

Docker

The easiest way for reproducibility is using docker. A Dockerfile is in the repository.

Run:

sudo apt-get install docker # for linux. For mac you can use something like brew. In any case,
# you need to install docker onto your machine 
docker build -t usaage .
docker run --rm -t -d --name usaage_container -v $(pwd):/code usaage

This will keep a docker container running in the background, which you can inspect using docker ps.

Now all the development can be done in the container and you can edit the code as usual locally (changes will be synced to the docker container since we made it share folders using the flag -v). You might need to use Remote-SSH in the VSCODE IDE for convenience. You can also just attach a shell onto the container using docker exec -it usaage_container /bin/bash

You can check that everything works by running make all in the container.

Structure Overview

The code is divided into 2 parts: First, the extraction of the COVID-19 mortality counts data from Department of Health websites. Second, the processing of the extracted data to create a complete time series of age-specific COVID-19 mortality counts for every location.

Dependencies

Data extraction

fitz
PyMuPDF
pandas
pyjson
beautifulsoup4
requests
selenium

Data processing

data.table
ggplot2 
scales
gridExtra
tidyverse
rjson
readxl
reshape2

1. Data extraction

To extract, run

$ make files

This will get you the latest data in data/$DATE.

2. Data processing

To process, run

$ Rscript scripts/process.data.R

This will get you a csv file for every state with variables age, date, daily.deaths and (state) code in data/processed/$DATE/.

More details about the data extraction

The main entry point is make files.

Scripts

make files will execute the files task in Makefile, which currently is composed only of the script ./download_files.sh. This script follows the following steps:

  1. Set a date, $date, in the local environment
  2. Create new folders in data and pdfs for the $date.
  3. Run the following scripts:
    • scripts/age_extraction.py to extract the locations for which data are available in CSV, XLSX or JSON format.
    • a series of GET requests to the web API. They download CSVs made available by the DoH directly.
    • scripts/extraction_try.py, which downloads data that are in webpage, XLSX or PDF format.
    • python scripts/get_nm.py to get New Mexico data.

General procedure

Depending on the data format made available by the DoH, we do the following:

PDFs: We use fitz in order to read data within PDFs and save them to JSON or CSV format.

CSVs, XLSX, JSON: We download the data directly.

Static Webpages (HTML): We save the HTML and extract the data using BeautifulSoup, and save them in JSON format.

Dynamic Webpages (Dashboard): We use selenium to render a webpage and switch to the right page. Then, if the data is stored in the source code, we find their path or css, extract them and save them to a JSON format. Otherwise, if the webpage can be saved as a PDFs, we use BeautifulSoup to download the webpage in a PDFs format and fitz to extract the data within PDFs. If we cannot use either of the latter options, we take a screenshot of the webpage, and extract the data manually.

Screenshots/PNGs: To record the data published in the dynamic webpages

More details about the data processing

Procedure

Pre-processing adjustments

We reconstruct time series for every location and age band, therefore all extracted data need to have the same age bands. If the DoH changes the reported age bands at time $t$ and,

Processing stages

  1. Read the data

    • If a complete time series records of age-specific COVID-19 attributable death burden is available
      • Use only the last data available
      • Every state has its own processing function depending on the data format
    • If daily snapshots of age-specific COVID-19 attributable death burden are available
      • Use every data ever extracted
      • if CSV or XLSX: the state has its own processing function
      • if JSON: common processing function
  2. Ensure that the mortality counts are strictly increasing

    • some DoH updates indicated a decreasing mortality count from one day to the next.
    • In this case, we set the mortality count on the earliest day to match the mortality count on the most recent day.
  3. Find daily deaths

    • some days had missing data, usually either because no updates were reported, because the webpage failed or because the URL of the website had mutated.
    • The missing daily mortality count were imputed, assuming a constant increase in daily mortality count between days with data.
  4. Check that the reconstructed cumulative deaths on the last day match the ones reported in the latest data.

The script that acts as a spine for those four stages is utils/obtain.data.R. Functions for stage 1 are in utils/read.daily-historical.data.R and utils/read.json.data.R. Functions from stage 2, 3 are in utils/summary_functions.R. Function for stage 4 is in utils/sanity.check.processed.data.R.

Post-processing adjustments

After reconstructing the time series, we make final adjustements for analysis:

  1. Modify the age bands boundaries from the ones declared by the Department of Health, such that they reflect the closest age bands in the set, A = { [0-4], [5-9], ..., [75-79], [80-84], [85+] }. For example, age band [0-17] becomes [0-19] and age band [61-65].

  2. Keep only days that match closely with JHU overall mortality counts.

Both data set, adjusted and non adjusted are available, DeathsByAge_US_adj.csv and DeathsByAge_US.csv.

Data source

This table includes a complete list of all sources ever used in the data set. We acknowledge and are grateful to U.S. state Departments of Health for making the primary data available at the following sources:

StateDate record startLink(s)Notes
Alabama2020-05-03linkdashboard updated daily and replaced; no historical archive
Alaska2020-06-09linkmetadata updated daily and replaced; no historical archive
Arizona2020-05-13linkdashboard updated daily and replaced; no historical archive
California2020-05-13linkdashboard updated daily and replaced; no historical archive
Colorado2020-03-23(1) link until 2020-08-20, (2) link since 2020-08-20(1) metadata updated daily; full time series; died in 2020-08-20; (2) dashboard updated daily and replaced; no historical archive
Connecticut2020-04-05linkmetadata updated daily; full time series
Delaware2020-05-12linkdashboard updated daily and replaced; no historical archive
District of Columbia2020-04-13linkmetadata updated daily; full time series
Florida2020-03-27linkdaily report; with historical archive
Hawaii2020-09-18linkdashboard updated weekly and replaced
Georgia2020-04-27linkmetadata updated daily and replaced; no historical archive
Idaho2020-05-13(1) link, (2) linkdashboard updated daily and replaced; no historical archive ; (1) died on 2020-09-04
Illinois2020-05-14linkdashboard updated daily and replaced; no historical archive
Indiana2020-05-13linkdashboard updated daily and replaced; no historical archive
Iowa2020-05-13linkdashboard updated daily and replaced; no historical archive
Kansas2020-05-13linkdashboard updated Monday, Wednesday and Friday, and replaced; no historical archive
Kentucky2020-05-13linkdashboard updated daily and replaced; no historical archive
Louisiana2020-05-12linkdashboard updated daily except on Saturday and replaced; no historical archive
Maine2020-03-12linkmetadata updated daily; full time series
Maryland2020-05-14linkdashboard updated daily and replaced; no historical archive
Massachusetts2020-04-20link until 2020-08-11 and link since(1) daily report, with historical archive; (2) weekly report, with historical archive
Michigan2020-03-21(1) data/req/michigan weekly.csv and (2) link(1) data requested to the DoH (2) dashboard updated daily and replaced; no historical archive
Minnesota2020-05-21linkweekly report, with historical archive
Mississippi2020-04-27linkdashboard updated daily and replaced; no historical archive
Missouri2020-05-13(1)link and (2)linkdashboard updated daily and replaced; no historical archive
Nevada2020-06-07linkdashboard updated daily and replaced; no historical archive
New Hampshire2020-06-07(1)link until 2021-01-08, and (2)link since 2021-01-08dashboard updated daily and replaced; no historical archive
New Jersey2020-05-25linkdashboard updated daily and replaced; no historical archive
New Mexico2020-05-25linkdaily written report; with history archive
New York City2020-04-14link, link since 2020-05-18, link since 2020-11-08report / csv updated daily, with history archive
North Carolina2020-05-20linkdashboard updated daily and replaced; no historical archive
North Dakota2020-05-14linkdashboard updated daily and replaced; no historical archive
Oklahoma2020-05-13linkdashboard updated daily and replaced; no historical archive
Oregon2020-06-05linkdashboard updated dashboard updated on Monday-Friday and sometimes on Saturday and replaced; no historical archive
Pennsylvania2020-06-07(1)link and (2)linkdashboard updated daily and replaced; no historical archive
Rhode Island2020-06-01linkmetadata updated weekly and replaced; no historical archive
South Carolina2020-05-14linkdashboard updated on Tuesday and Friday; no historical archive
Tennessee2020-04-09linkmetadata updated daily; full time series
Texas2020-05-06(1) link until 2020-09-24, (2) link since 2020-09-24metadata updated daily and replaced; no historical archive
Utah2020-06-17linkdashboard updated daily and replaced; no historical archive
Vermont2020-05-13(1) link until 2020-09-03, (2) link since 2020-09-03dashboard updated daily and replaced; no historical archive; (1) does not report mortality by age since 2020-09-03
Virginia2020-04-21linkmetadata updated daily; full time series
Washington2020-06-08linkdashboard updated daily and replaced; no historical archive
Wisconsin2020-03-15(1) link until 2020-10-19, (2) link since 2020-10-19metadata updated daily; full time series
Wyoming2020-09-22linkdashboard updated daily and replaced; no historical archive

About

Maintainers and Contributors

<p float="left"> <a href="https://www.imperial.ac.uk/"> <img src="logos/IMP_ML_1CS_4CP_CLEAR%20SPACE.svg" height="100" /> </a> <a href="https://www.ox.ac.uk/"> <img src="logos/ox_brand1_pos.gif" height="100" /> </a> <a href="https://statml.io/"> <img src="logos/cropped-LOGO_512_512.svg-270x270.png" height="100" /> </a> </p>

Active maintainers (alphabetically)

Contributors

Licence

This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by Imperial College London on behalf of its COVID-19 Response Team. Copyright Imperial College London 2020.

Warranty

Imperial makes no representation or warranty about the accuracy or completeness of the data nor that the results will not constitute in infringement of third-party rights. Imperial accepts no liability or responsibility for any use which may be made of any results, for the results, nor for any reliance which may be placed on any such work or results.

Cite

Attribute the data as the "COVID-19 Age specific Mortality Data Repository by the Imperial College London COVID-19 Response Team", and the urls sepecified above.

Acknowledgements

We acknowledge the support of the EPSRC through the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning at Imperial and Oxford.

Funding

This research was partly funded by the The Imperial College COVID-19 Research Fund.