Awesome
Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations
This repository provides datasets, baselines, and results for the paper titled 'Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations'. The following provides the basis and we will actively update the repository.
Benchmarks
The datasets are biomedical natural language processing (BioNLP) benchmarks commonly adopted for benchmarking BioNLP lanuage models. It consists of the following:
- The sampled testset: under each dataset, there is a sample file consists of 200 samples from the testing set. This is used to evaluate the accuracy of BioNLP language models in this study. For instance, the HoC sampled file provides the 200 samples from the HoC dataset.
- The original full dataset: the original complete train, dev, test sets under the full_set folder prepared by the existing studies.
- The train and dev files are used to fine-tune a PubMedBERT model as a baseline
- The train file was used to randomly select samples for one-shot learning
Prompts
A prompt sample is also provided under each benchmark.
Results
Sampled dataset | Evaluation metric | Fine-tuned PubMedBERT (min-max) | Zero-shot GPT-3 | One-shot GPT-3 | Zero-shot GPT-4 | One-shot GPT-4 |
---|---|---|---|---|---|---|
BC5CDR-chemical | Entity-level F1 | 0.9028-0.9350 | 0.2925 | 0.1803 | 0.7443 | 0.8207 |
NCBI-disease | Entity-level F1 | 0.8336-0.8986 | 0.2405 | 0.1273 | 0.5673 | 0.4837 |
ChemProt | Macro F1 | 0.6653-0.7832 | 0.5743 | 0.6191 | 0.6618 | 0.6543 |
DDI2013 | Macro F1 | 0.6673-0.8023 | 0.3349 | 0.3440 | 0.6325 | 0.6558 |
HoC | Label-wise macro F1 | 0.6991-0.8915 | 0.6572 | 0.6932 | 0.7474 | 0.7402 |
LitCovid | Label-wise macro F1 | 0.8024-0.8724 | 0.6390 | 0.6531 | 0.6746 | 0.6839 |
PubMedQA | Pearson correlation | 0.2237-0.3676 | 0.3553 | 0.3011 | 0.4374 | 0.5361 |
BIOSSES | Macro F1 | 0.6870-0.9332 | 0.8786 | 0.9194 | 0.8832 | 0.8922 |
Sampled dataset | Evaluation metric | Fine-tuned BART | Zero-shot GPT-3 | One-shot GPT-3 | Zero-shot GPT-4 | One-shot GPT-4 |
---|---|---|---|---|---|---|
PubMed | ROUGE-1 | 0.4489 | 0.0608 | 0.2320 | 0.3997 | 0.4054 |
MS2 | ROUGE-1 | 0.2079 | 0.1731 | 0.1211 | 0.1877 | 0.1919 |
CochranePLS | Flesch-Kincaid score | 12.6425 | 13.0505 | 13.1755 | 12.0001 | 13.1217 |
PLOS | Flesch-Kincaid score | 14.6560 | 14.0605 | 13.9185 | 13.2190 | 13.2415 |
NCBI's Disclaimer
This tool shows the results of research conducted in the Computational Biology Branch, NCBI.
The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional.
More information about NCBI's disclaimer policy is available.
Acknowledgment
This study is supported by the following National Institutes of Health grants: R01AG078154, 1K99LM01402, and the Intramural Research Program of the National Library of Medicine (NLM).