Home

Awesome

DRSM-corpus

An annotated literature corpus for NLP studies of 'Disease Research State' based on different categories of research (DRSM stands for Disease Research State Model). This corpus reflects our current state of manually-curated research data for this task combined with curation instructions and details about the curation process.

How was this dataset collected? How has it been processed?:

The corpus was generated by manually curating titles and abstracts of primary research papers that were queried from the CZIF's knowledge graph based on searches for disease names and synonyms. The classification scheme was devised in house in consultation with external experts from external ontologies, rare disease organizations, drug companies, and other CZI team members (and is undergoing revision as we progress with this work). Curation as performed by members of an internal CZIF biocuration team.

Status: Version 1 of this curation work is now complete. Note that this project is under development and should be considered Unstable (Early, active development, and may lack sufficient end-user documentation, assistance, etc., for anything other than the earliest adopters).

V1 Corpus

We provide access to a corpus of primary research articles expressed as several *.tsv files:

We use the service provided by Centaur Labs to scale up curation for these categories. This provides a dataset with the following columns.

At present, this dataset consists of 1,144 'Gold Standard' articles labeled by our in-house curation team and 16,951 articles labeled by CentaurLabs annotators. This provides a corpus of 18,174 rare-disease primary research articles labeled for relevance and the type of research.

Provenance / Additional Data Files

We perform in-house curation to define an 'initial_gold standard' set with the following columns:

The codes are intended to reflect the foci of the paper in terms of the primary research being performed.

See this wiki page for the latest categorization used to denote different classes of disease research paper.

Note - due to the complexity of this model, we are restricting ourselves to a subset of categories in our initial work, see this wiki page .

We include all available curated data for provenance and transparency

We provide access to all curated data being used. This includes data taken across multiple curators within a team, filtered for consensus, and then checked and edited by a senior curator. This data is available as a *.tsv file (labeled 'raw_data'), with the same columns as above with three additional data columns:

V2 Corpus - Specialized Subtypes of Paper

We developed a model to determine if a given research study belongs to a broader, specialized type of paper. The types of these papers include the following categories judged to be of high priority to rare disease research:

We are currently working through datasets for each of these categories to support the development of specialized classifiers that can recognize these types of papers from their titles + abstracts alone. We have completed the data for the studies involving Quality of Life studies as shown.

The annotation schema we use for these studies conforms to the following basic design:

CodeExplanation
-1the paper is not a primary experimental study in disease
0The study does not directly investigate the phenomena of interest
1the study investigates the phenomena of interest but not as its primary contribution
2the study's primary contribution centers on investigating the phenomena of interest

The stucture of the data is as shown below:

ColumnDefinition
PMIDThe Pubmed ID of the annotated paper
Labeling_StateGold_Standard or Labeled for whether the paper was annotated in-house by CZI staff or by CentaurLabs annotators
Correct_LabelThe correct label for this document
AgreementThe agreement score generated by CentaurLabs curators
TitleThe title of the paper
AbstractThe abstract of the paper
vectora 4-value vector denoting the different weights for different categories generated by the CentaurLabs annotation process

Code of Conduct

This project adheres to the Contributor Covenant code of conduct, described in more detail here: CODE_OF_CONDUCT.md. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.

Primary Contact

Please direct any questions or feedback for this work to Gully Burns (CZIF Research Scientist) at gully.burns@chanzuckerberg.com