Home

Awesome

Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations

This repository provides datasets, baselines, and results for the paper titled 'Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations'. The following provides the basis and we will actively update the repository.

Benchmarks

The datasets are biomedical natural language processing (BioNLP) benchmarks commonly adopted for benchmarking BioNLP lanuage models. It consists of the following:

  1. The sampled testset: under each dataset, there is a sample file consists of 200 samples from the testing set. This is used to evaluate the accuracy of BioNLP language models in this study. For instance, the HoC sampled file provides the 200 samples from the HoC dataset.
  2. The original full dataset: the original complete train, dev, test sets under the full_set folder prepared by the existing studies.
    1. The train and dev files are used to fine-tune a PubMedBERT model as a baseline
    2. The train file was used to randomly select samples for one-shot learning

Prompts

A prompt sample is also provided under each benchmark.

Results

Sampled datasetEvaluation metricFine-tuned PubMedBERT (min-max)Zero-shot GPT-3One-shot GPT-3Zero-shot GPT-4One-shot GPT-4
BC5CDR-chemicalEntity-level F10.9028-0.93500.29250.18030.74430.8207
NCBI-diseaseEntity-level F10.8336-0.89860.24050.12730.56730.4837
ChemProtMacro F10.6653-0.78320.57430.61910.66180.6543
DDI2013Macro F10.6673-0.80230.33490.34400.63250.6558
HoCLabel-wise macro F10.6991-0.89150.65720.69320.74740.7402
LitCovidLabel-wise macro F10.8024-0.87240.63900.65310.67460.6839
PubMedQAPearson correlation0.2237-0.36760.35530.30110.43740.5361
BIOSSESMacro F10.6870-0.93320.87860.91940.88320.8922
Sampled datasetEvaluation metricFine-tuned BARTZero-shot GPT-3One-shot GPT-3Zero-shot GPT-4One-shot GPT-4
PubMedROUGE-10.44890.06080.23200.39970.4054
MS2ROUGE-10.20790.17310.12110.18770.1919
CochranePLSFlesch-Kincaid score12.642513.050513.175512.000113.1217
PLOSFlesch-Kincaid score14.656014.060513.918513.219013.2415

NCBI's Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI.

The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional.

More information about NCBI's disclaimer policy is available.

Acknowledgment

This study is supported by the following National Institutes of Health grants: R01AG078154, 1K99LM01402, and the Intramural Research Program of the National Library of Medicine (NLM).