Home

Awesome

Covid-on-the-Web Dataset

Covid-on-the-Web Dataset is an RDF dataset that provides two main knowledge graphs produced by analyzing the scholarly articles of the COVID-19 Open Research Dataset (CORD-19) [1], a resource of articles about COVID-19 and the coronavirus family of viruses:

A description of the dataset, in the Turtle format, as well as examples are provided in the dataset directory.

Covid-on-the-Web Dataset is an initiative of the Wimmics team, I3S laboratory, University Côte d'Azur, Inria, CNRS.

Covid-on-the-Web Dataset v1.2 is based on CORD-19 v47.

Documentation

CORD-19 Named Entities Knowledge Graph (CORD19-NEKG)

To identify and disambiguate named entities, we used DBpedia Spotlight (links to DBpedia), Entity-fishing (links to Wikidata), and NCBO BioPortal annotator (links to ontologies in Bioportal).

Named entities were identified primarily in the articles' titles and abstracts. Entity-fishing was also used to process the articles' bodies.

The table below shows the total number of named entities extracted by each tool, as well as the corresponding number of unique URIs.

DBpediaWikidataBioportalTotal
No. named entities4,084,97966,098,77742,972,551113,156,307
No. unique URIs63,750252,150429,755745,655

CORD-19 Argumentative Knowledge Graph (CORD19-AKG)

To extract argumentative components (claims and evidences) and PICO elements, we used the Argumentative Clinical Trial Analysis platform (ACTA) [2].

Argumentative components and PICO elements were extracted from the articles' abstracts.

ACTA
No. argumentative components119,053
No. PICO elements linked to UMLS concepts515,590
No. unique UMLS concepts31,841

URIs naming scheme

Covid-on-the-Web namespace is http://ns.inria.fr/covid19/. All URIs are dereferenceable.

The dataset itslef is identified by URI http://ns.inria.fr/covid19/covidontheweb-1-2. It comes with DCAT and VOID descriptions. All articles, annotations and arguments are linked back to the dataset with property rdfs:isDefinedBy.

Article URIs are formatted as http://ns.inria.fr/covid19/paper_id where paper_id may be either the article SHA hash or its PCM identifier. Parts of an article (title, abstract and body) are also identified by URIs so that annotations of named entities can link back to the part they belong to. These URIs are formatted as

Downloading and SPARQL Querying

The dataset is downloadable as a set of RDF dumps (in Turtle syntax) from Zenodo: DOI

It can also be queried through our Virtuoso OS SPARQL endpoint https://covidontheweb.inria.fr/sparql.

You may use the Faceted Browser to look up text or URIs. As an example, you can look up article http://ns.inria.fr/covid19/d53508d43264f59007fd5e4aa8b4af026edf0bfe. Further details about how named entities are represented in RDF are given in the Data Modeling section.

The following named graphs can be queried from our SPARQL endpoint:

Named graphDescriptionNo. RDF triples
http://ns.inria.fr/covid19/graph/metadatadataset description + definition of a few properties170
http://ns.inria.fr/covid19/graph/articlesarticles metadata (title, authors, DOIs, journal etc.)3,722,381
http://ns.inria.fr/covid19/graph/entityfishingnamed entities identified by Entity-fishing in articles titles/abstracts35,049,832
http://ns.inria.fr/covid19/graph/entityfishing/bodynamed entities identified by Entity-fishing in articles bodies1,156,611,321
http://ns.inria.fr/covid19/graph/bioportal-annotatornamed entities identified by Bioportal Annotator in articles titles/abstracts104,430,547
http://ns.inria.fr/covid19/graph/dbpedia-spotlightnamed entities identified by DBpedia Spotlight in articles titles/abstracts65,359,664
http://ns.inria.fr/covid19/graph/actaargumentative components and PICO elements extracted by ACTA from articles titles/abstracts7,469,234
Total1,361,451,364

The example query below retrieves two articles that have been annotated with at least one common Wikidata entity.

select ?uri ?title1 ?title2
where {
  graph <http://ns.inria.fr/covid19/graph/articles> {
    ?paper1 a fabio:ResearchPaper; dct:title ?title1.
    ?paper2 a fabio:ResearchPaper; dct:title ?title2.
    filter (?paper1 != ?paper2)
  }
  
  graph <http://ns.inria.fr/covid19/graph/entityfishing> {
    ?a1 a oa:Annotation;
        schema:about ?paper1;
        oa:hasBody ?uri.
    ?a2 a oa:Annotation;
        schema:about ?paper2;
        oa:hasBody ?uri.
  }
} limit 10

License

See the LICENSE file.

Cite this work

When including Covid-on-the-Web data in a publication or redistribution, please cite this paper:

Franck Michel, Fabien Gandon, Valentin Ah-Kane, Anna Bobasheva, Elena Cabrio, Olivier Corby, Raphaël Gazzotti, Alain Giboin, Santiago Marro, Tobias Mayer, Mathieu Simon, Serena Villata, Marco Winckler. Covid-on-the-Web: Knowledge Graph and Services to Advance COVID-19 Research. International Semantic Web Conference (ISWC), Nov 2020, Athens, Greece. PDF

References

[1] Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R.M., Liu, Z., Merrill, W., Mooney, P., Murdick, D.A., Rishi, D., Sheehan, J., Shen, Z., Stilson, B., Wade, A.D., Wang, K., Wilhelm, C., Xie, B., Raymond, D.M., Weld, D.S., Etzioni, O., & Kohlmeier, S. (2020). CORD-19: The Covid-19 Open Research Dataset. ArXiv, abs/2004.10706.

[2] T. Mayer, E. Cabrio, and S. Villata. ACTA a tool for argumentative clinical trialanalysis. In Proceedings of the 28th International Joint Conference on ArtificialIntelligence (IJCAI), pages 6551–6553, 2019.