Awesome
<h3 align="center"> <p>BioSyn <a href="https://github.com/dmis-lab/BioSyn/blob/master/LICENSE"> <img alt="GitHub" src="https://img.shields.io/badge/License-MIT-yellow.svg"> </a> </h3> <div align="center"> <p><b>Bio</b>medical Entity Representations with <b>Syn</b>onym Marginalization </div> <div align="center"> <img alt="BioSyn Overview" src="https://github.com/dmis-lab/BioSyn/blob/master/images/biosyn_demo.gif" width="600px"> </div>We present BioSyn for learning biomedical entity representations. You can train BioSyn with the two main components described in our paper: 1) synonym marginalization and 2) iterative candidate retrieval. Once you train BioSyn, you can easily normalize any biomedical mentions or represent them into entity embeddings.
Updates
- [Mar 17, 2022] Checkpoints of BioSyn for normalizing gene type are released. The BC2GN data used for the gene type has been pre-processed by Tutubalina et al., 2020.
- [Oct 25, 2021] Trained models are uploaded in Huggingface Hub(Please check out here). Other than BioBERT, we also train our model using another pre-trained model SapBERT, and obtain better performance than as described in our paper.
Requirements
$ conda create -n BioSyn python=3.7
$ conda activate BioSyn
$ conda install numpy tqdm scikit-learn
$ conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch
$ pip install transformers==4.11.3
Note that Pytorch has to be installed depending on the version of CUDA.
Datasets
Datasets consist of queries (train, dev, test, and traindev), and dictionaries (train_dictionary, dev_dictionary, and test_dictionary). Note that the only difference between the dictionaries is that test_dictionary includes train and dev mentions, and dev_dictionary includes train mentions to increase the coverage. The queries are pre-processed with lowercasing, removing punctuations, resolving composite mentions and resolving abbreviation (Ab3P). The dictionaries are pre-processed with lowercasing, removing punctuations (If you need the pre-processing codes, please let us know by openning an issue).
Note that we use development (dev) set to search the hyperparameters, and train on traindev (train+dev) set to report the final performance.
TAC2017ADR
dataset cannot be shared because of the license issue. Please visit the website or see here for pre-processing scripts.
Train
The following example fine-tunes our model on NCBI-Disease dataset (train+dev) with BioBERTv1.1.
MODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease
CUDA_VISIBLE_DEVICES=1 python train.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_dictionary_path ${DATA_DIR}/train_dictionary.txt \
--train_dir ${DATA_DIR}/processed_traindev \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--epoch 10 \
--train_batch_size 16\
--initial_sparse_weight 0\
--learning_rate 1e-5 \
--max_length 25 \
--dense_ratio 0.5
Note that you can train the model on processed_train
and evaluate it on processed_dev
when you want to search for the hyperparameters. (the argument --save_checkpoint_all
can be helpful. )
Evaluation
The following example evaluates our trained model with NCBI-Disease dataset (test).
MODEL_NAME_OR_PATH=./tmp/biosyn-biobert-ncbi-disease
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease
python eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dictionary_path ${DATA_DIR}/test_dictionary.txt \
--data_dir ${DATA_DIR}/processed_test \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--max_length 25 \
--save_predictions \
--score_mode hybrid
Result
The predictions are saved in predictions_eval.json
with mentions, candidates and accuracies (the argument --save_predictions
has to be on).
Following is an example.
{
"queries": [
{
"mentions": [
{
"mention": "ataxia telangiectasia",
"golden_cui": "D001260",
"candidates": [
{
"name": "ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
},
{
"name": "ataxia telangiectasia syndrome",
"cui": "D001260|208900",
"label": 1
},
{
"name": "ataxia telangiectasia variant",
"cui": "C566865",
"label": 0
},
{
"name": "syndrome ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
},
{
"name": "telangiectasia",
"cui": "D013684",
"label": 0
}]
}]
},
...
],
"acc1": 0.9114583333333334,
"acc5": 0.9385416666666667
}
Inference
We provide a simple script that can normalize a biomedical mention or represent the mention into an embedding vector with BioSyn.
Trained models
NCBI-Disease
Model | Acc@1/Acc@5 |
---|---|
biosyn-biobert-ncbi-disease | 91.1/93.9 |
biosyn-sapbert-ncbi-disease | 92.4/95.8 |
BC5CDR-Disease
Model | Acc@1/Acc@5 |
---|---|
biosyn-biobert-bc5cdr-disease | 93.2/96.0 |
biosyn-sapbert-bc5cdr-disease | 93.5/96.4 |
BC5CDR-Chemical
Model | Acc@1/Acc@5 |
---|---|
biosyn-biobert-bc5cdr-chemical | 96.6/97.2 |
biosyn-sapbert-bc5cdr-chemical | 96.6/98.3 |
BC2GN-Gene
Model | Acc@1/Acc@5 |
---|---|
biosyn-biobert-bc2gn | 90.6/95.6 |
biosyn-sapbert-bc2gn | 91.3/96.3 |
Predictions (Top 5)
The example below gives the top 5 predictions for a mention ataxia telangiectasia
. Note that the initial run will take some time to embed the whole dictionary. You can download the dictionary file here.
MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease
python inference.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dictionary_path ${DATA_DIR}/test_dictionary.txt \
--use_cuda \
--mention "ataxia telangiectasia" \
--show_predictions
Result
{
"mention": "ataxia telangiectasia",
"predictions": [
{"name": "ataxia telangiectasia", "id": "D001260|208900"},
{"name": "ataxia telangiectasia syndrome", "id": "D001260|208900"},
{"name": "telangiectasia", "id": "D013684"},
{"name": "telangiectasias", "id": "D013684"},
{"name": "ataxia telangiectasia variant", "id": "C566865"}
]
}
Embeddings
The example below gives an embedding of a mention ataxia telangiectasia
.
MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease
python inference.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--use_cuda \
--mention "ataxia telangiectasia" \
--show_embeddings
Result
{
"mention": "ataxia telangiectasia",
"mention_sparse_embeds": array([0.05979538, 0., ..., 0., 0.], dtype=float32),
"mention_dense_embeds": array([-7.14258850e-02, ..., -4.03847933e-01,],dtype=float32)
}
Demo
How to run web demo
Web demo is implemented on Tornado framework. If a dictionary is not yet cached, it will take about couple of minutes to create dictionary cache.
MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease
python demo.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--use_cuda \
--dictionary_path ./datasets/ncbi-disease/test_dictionary.txt
Citations
@inproceedings{sung2020biomedical,
title={Biomedical Entity Representations with Synonym Marginalization},
author={Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},
booktitle={ACL},
year={2020},
}