Awesome

Covid-on-the-Web Dataset

Covid-on-the-Web Dataset is an RDF dataset that provides two main knowledge graphs produced by analyzing the scholarly articles of the COVID-19 Open Research Dataset (CORD-19) [1], a resource of articles about COVID-19 and the coronavirus family of viruses:

the CORD-19 Named Entities Knowledge Graph describes named entities identified and disambiguated by NCBO BioPortal annotator, Entity-fishing and DBpedia Spotlight.
the CORD-19 Argumentative Knowledge Graph describes argumentative components and PICO elements (Patient/Population/Problem, Intervention, Comparison, Outcome) extracted from the articles by the Argumentative Clinical Trial Analysis platform (ACTA).

A description of the dataset, in the Turtle format, as well as examples are provided in the dataset directory.

Covid-on-the-Web Dataset is an initiative of the Wimmics team, I3S laboratory, University Côte d'Azur, Inria, CNRS.

Covid-on-the-Web Dataset v1.2 is based on CORD-19 v47.

Documentation

CORD-19 Named Entities Knowledge Graph (CORD19-NEKG)

To identify and disambiguate named entities, we used DBpedia Spotlight (links to DBpedia), Entity-fishing (links to Wikidata), and NCBO BioPortal annotator (links to ontologies in Bioportal).

Named entities were identified primarily in the articles' titles and abstracts. Entity-fishing was also used to process the articles' bodies.

The table below shows the total number of named entities extracted by each tool, as well as the corresponding number of unique URIs.

	DBpedia	Wikidata	Bioportal	Total
No. named entities	4,084,979	66,098,777	42,972,551	113,156,307
No. unique URIs	63,750	252,150	429,755	745,655

CORD-19 Argumentative Knowledge Graph (CORD19-AKG)

To extract argumentative components (claims and evidences) and PICO elements, we used the Argumentative Clinical Trial Analysis platform (ACTA) [2].

Argumentative components and PICO elements were extracted from the articles' abstracts.

	ACTA
No. argumentative components	119,053
No. PICO elements linked to UMLS concepts	515,590
No. unique UMLS concepts	31,841

URIs naming scheme

Covid-on-the-Web namespace is http://ns.inria.fr/covid19/. All URIs are dereferenceable.

The dataset itslef is identified by URI http://ns.inria.fr/covid19/covidontheweb-1-2. It comes with DCAT and VOID descriptions. All articles, annotations and arguments are linked back to the dataset with property rdfs:isDefinedBy.

Article URIs are formatted as http://ns.inria.fr/covid19/paper_id where paper_id may be either the article SHA hash or its PCM identifier. Parts of an article (title, abstract and body) are also identified by URIs so that annotations of named entities can link back to the part they belong to. These URIs are formatted as

http://ns.inria.fr/covid19/paper_id#title
http://ns.inria.fr/covid19/paper_id#abstract
http://ns.inria.fr/covid19/paper_id#body_text.

Downloading and SPARQL Querying

The dataset is downloadable as a set of RDF dumps (in Turtle syntax) from Zenodo:

It can also be queried through our Virtuoso OS SPARQL endpoint https://covidontheweb.inria.fr/sparql.

You may use the Faceted Browser to look up text or URIs. As an example, you can look up article http://ns.inria.fr/covid19/d53508d43264f59007fd5e4aa8b4af026edf0bfe. Further details about how named entities are represented in RDF are given in the Data Modeling section.

The following named graphs can be queried from our SPARQL endpoint:

Named graph	Description	No. RDF triples
http://ns.inria.fr/covid19/graph/metadata	dataset description + definition of a few properties	170
http://ns.inria.fr/covid19/graph/articles	articles metadata (title, authors, DOIs, journal etc.)	3,722,381
http://ns.inria.fr/covid19/graph/entityfishing	named entities identified by Entity-fishing in articles titles/abstracts	35,049,832
http://ns.inria.fr/covid19/graph/entityfishing/body	named entities identified by Entity-fishing in articles bodies	1,156,611,321
http://ns.inria.fr/covid19/graph/bioportal-annotator	named entities identified by Bioportal Annotator in articles titles/abstracts	104,430,547
http://ns.inria.fr/covid19/graph/dbpedia-spotlight	named entities identified by DBpedia Spotlight in articles titles/abstracts	65,359,664
http://ns.inria.fr/covid19/graph/acta	argumentative components and PICO elements extracted by ACTA from articles titles/abstracts	7,469,234
Total		1,361,451,364

The example query below retrieves two articles that have been annotated with at least one common Wikidata entity.

select ?uri ?title1 ?title2
where {
  graph <http://ns.inria.fr/covid19/graph/articles> {
    ?paper1 a fabio:ResearchPaper; dct:title ?title1.
    ?paper2 a fabio:ResearchPaper; dct:title ?title2.
    filter (?paper1 != ?paper2)
  }
  
  graph <http://ns.inria.fr/covid19/graph/entityfishing> {
    ?a1 a oa:Annotation;
        schema:about ?paper1;
        oa:hasBody ?uri.
    ?a2 a oa:Annotation;
        schema:about ?paper2;
        oa:hasBody ?uri.
  }
} limit 10

License

See the LICENSE file.

Cite this work

When including Covid-on-the-Web data in a publication or redistribution, please cite this paper:

Franck Michel, Fabien Gandon, Valentin Ah-Kane, Anna Bobasheva, Elena Cabrio, Olivier Corby, Raphaël Gazzotti, Alain Giboin, Santiago Marro, Tobias Mayer, Mathieu Simon, Serena Villata, Marco Winckler. Covid-on-the-Web: Knowledge Graph and Services to Advance COVID-19 Research. International Semantic Web Conference (ISWC), Nov 2020, Athens, Greece. PDF

References

[1] Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R.M., Liu, Z., Merrill, W., Mooney, P., Murdick, D.A., Rishi, D., Sheehan, J., Shen, Z., Stilson, B., Wade, A.D., Wang, K., Wilhelm, C., Xie, B., Raymond, D.M., Weld, D.S., Etzioni, O., & Kohlmeier, S. (2020). CORD-19: The Covid-19 Open Research Dataset. ArXiv, abs/2004.10706.

[2] T. Mayer, E. Cabrio, and S. Villata. ACTA a tool for argumentative clinical trialanalysis. In Proceedings of the 28th International Joint Conference on ArtificialIntelligence (IJCAI), pages 6551–6553, 2019.