Home

Awesome

Yahoo Knowledge Graph COVID-19 Datasets

Slack

Background

The Yahoo Knowledge Graph team at Verizon Media is responsible for providing critical COVID-19 data that feeds into Yahoo properties like Yahoo News, Yahoo Finance, and Yahoo Weather. The COVID-19 datasets include country, state, and county level information updated on a rolling basis, with updates occurring approximately hourly.

The COVID-19 datasets are constructed entirely from primary (government and public agency) sources with a clear attribution of the primary sources used for each geographical region. While other aggregations of COVID-19 data are already available, we believe ours to be the only open source COVID-19 dataset that is constructed entirely from primary sources with clear attribution back to those sources. Our hope is that additional transparency will enable more accurate analysis, aiding researchers who seek to understand and prevent further spread of the disease.

Released together with the COVID-19 dataset are two other open source projects:

Datasets

The data is logically organized by region and time. Time is further organized into a snapshot of the latest updates received for all regions and the updates reported by regions for a given date. As the COVID-19 pandemic develops and local governments and agencies improve their ability to collect and present their data to the public, the schema will evolve. Please check back as sources frequently evolve.

We welcome data feeds or links to web pages that you would like us to crawl, extract, and merge into the overall stats. Feel free to submit an issue.

region-metadata

Provides general information about the regions covered in the dataset, such as geographical location and links to other public data sources.

FieldTypeDescription
idxsd:stringa unique identifier for the region
typelist of xsd:stringa list of type classifications for the region. for example: Country, StateAdminArea, CountyAdminArea, etc...
woeIdxsd:stringWhereOnEarth unique identifier for the region
wikiIdxsd:stringthe main Wikipedia page name of the country, can be used as a unique key
countryCodexsd:string2 letter country abbreviation code (ISO 3116)
stateCodexsd:string2 letter state abbreviation code (FIPS 5-2)
countyCodexsd:stringUS county code (FIPS 6-4)
labelxsd:stringthe English name of the region
latitudexsd:floatlatitude in decimal number format
longitudexsd:floatlongitude in decimal number format
populationxsd:integerthe population residing in the region
parentIdlist of xsd:stringa list of parent geopolitical regions for the region, this represents only direct parents as they exist in the dataset and not the full possible hierarchy

by-region-[DATE]

Provides detailed case counts of COVID-19 in each region on [DATE] in local time for that region. Each entry (row) in the daily file represents a single region.

Please be aware that different sources release data at different and often unpredictable frequencies. The by-region-[DATE] numbers will be updated as sources release data for the given date for their region. In some cases, data for a given region is not released until many days after that calendar date has elapsed everywhere in the world. As a result, the same by-region-[DATE] file may show different aggregate statistics for the same date depending on when the by-region-[DATE] is accessed. Generally speaking, by-region-[DATE] data more than one week old is stable.

FieldTypeDescription
regionIdxsd:stringsee id above
labelxsd:stringsee above
totalConfirmedxsd:integerthe total amount of confirmed cases of COVID-19 in the region until the given date (aggregate)
totalDeathsxsd:integerthe total amount of fatalities from COVID-19 in the region
totalRecoveredCasesxsd:integerthe total amount of people recovered from COVID-19 in the region (aggregate)
totalTestedCasesxsd:integerthe total amount of people tested for COVID-19 in the region (aggregate)
numPositiveTestsxsd:integerthe daily count of people tested positive for COVID-19
numDeathsxsd:integerthe daily count of fatalities as a result of COVID-19
numRecoveredCasesxsd:integerthe daily count of people recovered from COVID-19
diffNumPositiveTestsxsd:integerthe difference in number of positive cases found between 2 consecutive days
diffNumDeathsxsd:integerthe difference in number of deaths between 2 consecutive days
avgWeeklyConfirmedCasesxsd:float7-day moving average of daily new confirmed cases
avgWeeklyDeathsxsd:float7-day moving average of daily new deaths
referenceDatexsd:datethe date associated with the COVID-19 data according to the local timezone of the region
lastUpdatedDatexsd:datetimelast update time of the entry
dataSourcexsd:anyURIthe source attribution for the COVID-19 data in the current entry

by-region-latest

Provides the latest figures for each region.

The schema for the latest file is similar to the by-region-[DATE] above. There are 2 main differences:

Note that because different regions report at different and often unpredictable frequencies, the latest figures for one region may be many days older than the latest figures for another region. For this reason, stable by-region-[DATE] numbers are required for an accurate comparison of growth rates in different regions. Generally speaking, by-region-[DATE] data more than one week old is stable.

FieldTypeDescription
regionIdxsd:stringsee id above
labelxsd:stringsee above
totalConfirmedxsd:integerthe total amount of confirmed cases of COVID-19 in the region until the given date (aggregate)
totalDeathsxsd:integerthe total amount of fatalities from COVID-19 in the region
totalRecoveredCasesxsd:integerthe total amount of people recovered from COVID-19 in the region (aggregate)
totalTestedCasesxsd:integerthe total amount of people tested for COVID-19 in the region (aggregate)
referenceDatexsd:datethe date associated with the COVID-19 data according to the local timezone of the region
lastUpdatedDatexsd:datetimelast update time of the entry
dataSourcexsd:anyURIthe source attribution for the COVID-19 data in the current entry

Maintainers

Please contact yk-covid-19-os@verizonmedia.com with any questions.

Contributors

Thank you to everyone who contributed to this project!

License

The Yahoo Knowledge Graph COVID-19 Dataset is made available under a Creative Commons CC-BY-NC 4.0 license. No express permission from Verizon Media is required for noncommercial uses. Only compliance with the CC-BY-NC 4.0 license is required for noncommercial uses including attribution.

Verizon Media may consider granting royalty-free commercial licenses upon request. If you are interested in making commercial use of the Yahoo COVID-19 Dataset, please submit a request.