Home

Awesome

BLUE, the Biomedical Language Understanding Evaluation benchmark

***** New Aug 13th, 2019: Change DDI metric from micro-F1 to macro-F1 *****

***** New July 11th, 2019: preprocessed PubMed texts *****

We uploaded the preprocessed PubMed texts that were used to pre-train the NCBI_BERT models.

***** New June 17th, 2019: data in BERT format *****

We uploaded some datasets that are ready to be used with the NCBI BlueBERT codes.

Introduction

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

Tasks

CorpusTrainDevTestTaskMetricsDomain
MedSTS67575318Sentence similarityPearsonClinical
BIOSSES641620Sentence similarityPearsonBiomedical
BC5CDR-disease418242444424NERF1Biomedical
BC5CDR-chemical520353475385NERF1Biomedical
ShARe/CLEFE462810755195NERF1Clinical
DDI29371004979Relation extractionmacro F1Biomedical
ChemProt415424163458Relation extractionmicro F1Biomedical
i2b2-20103110116293Relation extractionF1Clinical
HoC1108157315Document classificationF1Biomedical
MedNLI1123213951422InferenceaccuracyClinical

Sentence similarity

BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain. Here, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.

MedSTS is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse. Please visit the website to obtain a copy of the dataset. We use the standard training and testing sets in the shared task.

Named entity recognition

BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task We use the standard training and test set in the BC5CDR shared task

ShARe/CLEF eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database Please visit the website to obtain a copy of the dataset. We use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.

Relation extraction

DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts In our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the macro-average F1-score of the four DDI types.

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions and was used in the BioCreative VI text mining chemical-protein interactions shared task We use the standard training and test sets in the ChemProt shared task and evaluate the same five classes: CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9.

i2b2 2010 shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset. The dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.

Document multilabel classification

HoC (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer We use 315 (~20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level

Inference task

MedNLI is a collection of sentence pairs selected from MIMIC-III. We use the same training, development, and test sets in Romanov and Shivade

Datasets

Some datasets can be downloaded at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1

Baselines

CorpusMetricsSOTA*ELMoBioBERTNCBI_BERT(base) (P)NCBI_BERT(base) (P+M)NCBI_BERT(large) (P)NCBI_BERT(large) (P+M)
MedSTSPearson83.668.684.584.584.884.683.2
BIOSSESPearson84.860.282.789.391.686.375.1
BC5CDR-diseaseF84.183.985.986.685.482.983.8
BC5CDR-chemicalF93.391.593.093.592.491.791.1
ShARe/CLEFEF70.075.672.875.477.172.774.4
DDIF72.962.078.878.179.479.976.3
ChemProtF64.166.671.372.569.274.465.1
i2b2 2010F73.771.272.274.476.473.373.9
HoCF81.580.082.985.383.187.385.3
MedNLIacc73.571.480.582.284.081.583.8

P: PubMed, P+M: PubMed + MIMIC-III

SOTA, state-of-the-art as of April 2019, to the best of our knowledge

Fine-tuning with ELMo

We adopted the ELMo model pre-trained on PubMed abstracts to accomplish the BLUE tasks. The output of ELMo embeddings of each token is used as input for the fine-tuning model. We retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. The learning rate was set to 0.001 with an Adam optimizer. We iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.

Fine-tuning with BERT

Please see https://github.com/ncbi-nlp/ncbi_bluebert.

Citing BLUE

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: 
               An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
}

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number K99LM013001-01.

We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available. We would like to thank Geeticka Chauhan for providing thoughtful comments.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.