Home

Awesome

BLUE Benchmark with Transformers

***** New May 14th, 2020: ouBioBERT (full) is released *****
***** New April 15th, 2020: released *****


Thank you for your interest in our research!
The biomedical language understanding evaluation (BLUE) benchmark is a collection of resources for evaluating and analyzing biomedical natural language representation models (Peng et al., 2019).
This repository provides our implementation of fine-tuning for the BLUE benchmark with 🤗/Transformers.
Our demonstration models are available now.

Preparations

  1. Download the benchmark dataset from https://github.com/ncbi-nlp/BLUE_Benchmark
  2. Save pre-trained models to your directory. For example, BioBERT, clinicalBERT, SciBERT, BlueBERT and so on.
  3. Try to use our code in utils. Examples of the command can be found in scripts.

Tips

If you download Tensorflow models, converting them into PyTorch ones comforts your fine-tuning.
Converting Tensorflow Checkpoints

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

transformers-cli convert --model_type bert \
  --tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \
  --config $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin

Our models

Abbr.CorpusWordsSizeDomain
enWEnglish Wikipedia2,200M13GBGeneral
BBooksCorpus850M5GBGeneral
sPSmall PubMed abstracts30M0.2GBBioMedical
fPFocused PubMed abstracts280M1.8GBBioMedical
oPOther PubMed abstracts2,800M18GBBioMedical

Table: List of the text corpora used for our models.

Results

TotalMedSTSBIOSSESBC5CDR diseaseBC5CDR chemicalShARe CLEFEDDIChemProti2b2HoCMedNLI
BERT (sP+B+enW)81.483.289.785.791.879.178.467.573.185.380.1
BERT-BASE54.852.134.966.576.756.135.329.851.178.267.0
BioBERT (v1.1)82.985.090.985.893.276.980.973.274.285.983.1
clinicalBERT81.282.788.084.692.578.076.967.674.386.181.4
SciBERT82.084.085.585.992.777.780.171.973.385.983.2
BlueBERT (P)82.985.388.586.293.577.781.273.574.286.282.7
BlueBERT (P+M)81.884.485.284.692.279.579.368.875.785.282.8

Table: BLUE scores of BERT (sP + B + W) compared with those of all the BERT-Base variants for the biomedical domain as of April 2020.
Bold indicates the best result of all.

TotalMedSTSBIOSSESBC5CDR diseaseBC5CDR chemicalShARe CLEFEDDIChemProti2b2HoCMedNLI
ouBioBERT83.8<br>(0.3)84.9<br>(0.6)92.3<br>(0.8)87.4<br>(0.1)93.7<br>(0.2)80.1<br>(0.4)81.1<br>(1.5)75.0<br>(0.3)74.0<br>(0.8)86.4<br>(0.5)83.6<br>(0.7)
BioBERT (v1.1)82.8<br>(0.1)84.9<br>(0.5)89.3<br>(1.7)85.7<br>(0.4)93.3<br>(0.1)78.0<br>(0.8)80.4<br>(0.4)73.3<br>(0.4)74.5<br>(0.6)85.8<br>(0.6)82.9<br>(0.7)
BlueBERT (P)82.9<br>(0.1)84.8<br>(0.5)90.3<br>(2.0)86.2<br>(0.4)93.3<br>(0.3)78.3<br>(0.4)80.7<br>(0.6)73.5<br>(0.5)73.9<br>(0.8)86.3<br>(0.7)82.1<br>(0.8)
BlueBERT (P+M)81.6<br>(0.5)84.6<br>(0.8)82.0<br>(5.1)84.7<br>(0.3)92.3<br>(0.1)79.9<br>(0.4)78.8<br>(0.8)68.6<br>(0.5)75.8<br>(0.3)85.0<br>(0.4)83.9<br>(0.8)

Table: Performance of ouBioBERT on the BLUE task.
The numbers are mean (standard deviation) on five different random seeds.
The best scores are in bold.


Table of Contents

BLUE Tasks

CorpusTrainDevTestTaskMetricsDomain
MedSTS67575318Sentence similarityPearsonClinical
BIOSSES641620Sentence similarityPearsonBiomedical
BC5CDR-disease418242444424Named-entity recognitionF1Biomedical
BC5CDR-chemical520353475385Named-entity recognitionF1Biomedical
ShARe/CLEFE462810655195Named-entity recognitionF1Clinical
DDI29371004979Relation extractionmicro F1Biomedical
ChemProt415424163458Relation extractionmicro F1Biomedical
i2b2-20103110106293Relation extractionmicro F1Clinical
HoC1108157315Document classificationF1Biomedical
MedNLI1123213951422InferenceaccuracyClinical

Sentence similarity

MedSTS

MedSTS_hist
MedSTS is a corpus of sentence pairs selected from the clinical data warehouse of Mayo Clinic and was used in the BioCreative/OHNLP Challenge 2018 Task 2 as ClinicalSTS (Wang et al., 2018).
Please visit the website or contact to the 1st author to obtain a copy of the dataset.

BIOSSES

BIOSSES_hist
BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain (Soğancıoğlu et al., 2017).

Known problems

The BIOSSES dataset is very small, therefore it causes unstable performance of fine-tuning.

Named-entity recognition

Known problems

There are some irregular patterns:

conlleval.py appears to count them as different phrases.
Then, we manage this problem by the following method on evaluation:

  1. Example:
index012345678910111213141516171819
y_trueOOBIOOBIOIOBIIIIO
y_predOOBIOOBIOOOBIIIOI
  1. skip blank line and concat all the tags in the sentence into a one-dimensional array.
index0123467891011131415171819
y_trueOOBIOOBIOIOBIIIIO
y_predOOBIOOBIOOOBIIIOI
  1. get the token index of phrases that start with B.
  1. calculate metrics: utils/metrics/ner.py
y_true = set(y_true)
y_pred = set(y_pred)

TP = len(y_true & y_pred)           # 1: {2_3}
FN = len(y_true) - TP               # 2: {7_8_10, 13_14_15_17_18}
FP = len(y_pred) - TP               # 2: {7_8, 13_14_15_17_19}
prec = TP / (TP + FP)               # 1 / (1 + 2) = 0.33
rec = TP / (TP + FN)                # 1 / (1 + 2) = 0.33
fb1 = 2 * rec * prec / (rec + prec) # = 0.33

BC5CDR-disease

tag of tokensTrainDevTest
starting with B418242444424
starting with I000
I next to O000
Total418242444424

BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task (Li et al., 2016).

BC5CDR-chemical

tag of tokensTrainDevTest
starting with B520353475385
starting with I201
I next to O000
Total520553475386

An example of starting with I: test.tsv#L78550-L78598<a id="startingwithi"></a>

  
Compound	10510854	553	O  
7e	-	562	O  
,	-	564	O  
5	-	566	B  
-	-	567	I  
{	-	568	I  
2	-	569	I  
-	-	570	I  
// -------------
1H	-	637	I  
-	-	639	I  
  
indol	10510854	641	I  
-	-	646	I  
2	-	647	I  
-	-	648	I  
one	-	649	I  
,	-	652	O  
// -------------

ShARe/CLEFE

tag of tokensTrainDevTest
starting with B462810655195
starting with I6117
I next to O517110411
Total515111765623

ShARe/CLEFE eHealth Task 1 Corpus is a collection of 299 clinical free-text notes from the MIMIC II database (Suominen et al.,2013).
Please visit the website and sign up to obtain a copy of the dataset.
An example of I next to O: Test.tsv#L112-L118<a id="inexttoo"></a>
You'd better check out these original files, too:

The	00176-102920-ECHO_REPORT	426	O
left	-	430	B
atrium	-	435	I
is	-	442	O
moderately	-	445	O
dilated	-	456	I
.	-	463	O

Relation extraction

DDI

classTrainDevTestnote
DDI-advise633193221a recommendation or advice regarding a drug interaction is given.<br>e.g. UROXATRAL should not be used in combination with other alpha-blockers.
DDI-effect1212396360DDIs describing an effect or a pharmacodynamic (PD) mechanism.<br>e.g. In uninfected volunteers, 46% developed rash while receiving SUSTIVA and clarithromycin. <br> Chlorthalidone may potentiate the action of other antihypertensive drugs.
DDI-int1464296a DDI appears in the text without providing any additional information. <br>e.g. The interaction of omeprazole and ketoconazole has been established.
DDI-mechanism946373302drug-drug interactions (DDIs) described by their pharmacokinetic (PK) mechanism. <br>e.g. Grepafloxacin may inhibit the metabolism of theobromine.
DDI-false1584262404782
Total2937<br>+158421004<br>+6240979<br>+4782

DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts (Herrero-Zazo et al., 2013).

ChemProt

classTrainDevTestnote
CPR:3768550665UPREGULATOR|ACTIVATOR|INDIRECT_UPREGULATOR
CPR:4225110941661DOWNREGULATOR|INHIBITOR|INDIRECT_DOWNREGULATOR
CPR:5173116195AGONIST|AGONIST-ACTIVATOR|AGONIST-INHIBITOR
CPR:6235199293ANTAGONIST
CPR:9727457644SUBSTRATE|PRODUCT_OF|SUBSTRATE_PRODUCT_OF
false15306940413485
Total4154<br>+153062416<br>+94043458<br>+13485

ChemProt comprises 1,820 PubMed abstracts with chemical–protein interactions and was used in the BioCreative VI text mining chemical–protein interac-tions shared task (Krallinger et al, 2017).

i2b2 2010

classTrainDevTestnote
PIP75501448Medical problem indicates medical problem.
TeCP1588338Test conducted to investigate medical problem.
TeRP99302060Test reveals medical problem.
TrAP88321732Treatment is administered for medical problem.
TrCP1840342Treatment causes medical problem.
TrIP510152Treatment improves medical problem.
TrNAP620112Treatment is not administered because of medical problem.
TrWP240109Treatment worsens medical problem.
false190508636707They are in the same sentence, but do not fit into one of the above defined relationships.
Total3110<br>+1905010<br>+866293<br>+36707

i2b2 2010 shared task collection comprises 170 documents for training and 256 for testing (Uzuner et al., 2011).

Known problems

The development dataset is very small, then it is difficult to determine the best model.

Document multilabel classification

HoC

labelTrainDevTest
045871138
11483345
21641435
32133052
42643470
556358150
62383980
759692145
872386184
934655119

Labels: (IM) Activating invasion & metastasis, (ID) Avoiding immune destruction, (CE) Deregulating cellular energetics, (RI) Enabling replicative immortality, (GS) Evading growth suppressors, (GI) Genome instability & mutation, (A) Inducing angiogenesis, (CD) Resisting cell death, (PS) Sustaining proliferative signaling, (TPI) tumor promoting inflammation
Note: This table shows the number of each label on the sentence level, rather than on the abstract level.

HoC (the Hallmarks of Cancers corpus) comprises 1,580 PubMed publication abstracts manually annotated using ten currently known hallmarks of cancer (Baker et al., 2016).

Inference task

MedNLI

classTrainDevTest
contradiction3744465474
entailment3744465474
neutral3744465474
Total1123213951422

MedNLI is a collection of sentence pairs selected from MIMIC-III (Romanov and Shivade, 2018).
Please visit the website and sign up to obtain a copy of the dataset.

Total score

Following the practice in Peng et al. (2019), we use a macro-average of Pearson scores and F1-scores to determine a pre-trained model's position.
The results are above.

Citing

If you use our work in your research, please kindly cite the following papers:
the original paper of the BLUE Benchmark

Our research

@misc{2005.07202,
Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
Year = {2020},
Eprint = {arXiv:2005.07202},
}

Funding

This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).

Acknowledgments

We are grateful to the authors of BERT to make the data and codes publicly available. We thank the NVIDIA team because their implementation of BERT for PyTorch enabled us to pre-train BERT models on our local machine. We would also like to thank Yifan Peng and shared-task organizers for publishing the BLUE benchmark.

References