Awesome

On-the-fly Definition Augmentation of LLMs for Biomedical NER

This repository contains code to run the inference and evaluation of NER as described in our NAACL 2024 paper: On-the-fly Definition Augmentation of LLMs for Biomedical NER

Code Setup

This code was developed in python 3.9 using the libraries listed in environment.yml. The easiest way to run this code is to set up a conda environment using the .yml file via the following command:

conda env create -f environment.yml

Activate the conda environment using the command: conda activate fsdar

In addition to environment setup, you will need to dowbnload datasets from huggingface (open source) download here.

Datasets and Splits

Our paper evaluates the performance of these models on NER inference on the following datasets:

CDR
CHEMPROT
MEDM
NCBI
PICO
CHIA

These datasets are curated by Fries et al.. Note that data directory contains the document id for the subsampled test split for each dataset. Please run the script to save the subsampled datasets before running retrieval and fewshot_retrieval.

Additionally for CHIA dataset, we create train, validation and test splits with the most recent ones in the test set. We also release the document ids for each of these splits in data/chia.

To run Definition Augmentation, please make sure you generte subsampled data using the Article IDs from data and then change the paths.

Inference with open-sourced and closed-sourced models

To run inference:

make run TYPE=$TYPE MODEL=$MODEL DATASET=$DATASET

The following are different inference settings:

TYPE: zeroshot, fewshot, zeroshot_def_aug, fewshot_def_aug <br> MODEL: openai, llama, claude <br> DATATSET: cdr, chemprot, ncbi, medm, pico, chia

Fewshot

To create the shots run

fewshot/shot_selection for each dataset and save these samples in DATA_DIR. Use this to run the fewshot_def_aug.

Evaluation

To process the evaluation scrips, there are two different formats (JSON/CODE) which can be done using the following command:

make run OUTPUT_TYPE=$OUTPUT_TYPE DATASET=$DATASET

The following are different evaluation settings:

OUTPUT_TYPE: eval_code, eval_json <br> DATATSET: cdr, chemprot, ncbi, medm, pico, chia

Finetuned Model

Data Formatting

To create the data with he 5 shots we have used to fo few-shot run finetuing_data/make_data.py. These files follow the CoNLL 2003 format and consist of four space-separate columns. Each word must be placed in a separate line, with the four columns containing the word itself, its POS tag, its syntactic chunk tag, and its named entity tag. After each sentence, there must be an empty line. An example sentence would look as follows:

Acute           NN O O
low             NN O B-DIS
back            NN O I-DIS
pain            NN O I-DIS
during          NN O O
intravenous     NN O O
administration  NN O O
of              NN O O
amiodarone      NN O B-CHE
.               NN O O

Running the model

To finetune a Flan-XL model, run the following command

python peft_llm_trainer.py --model_name_or_path google/flan-t5-xl --output_dir <OUTDIR/> --train_file <TRAIN_PATH/> --validation_file <VAL_PATH> --test_file <TEST_PATH/> --do_train --do_eval --do_predict --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 3e-5 --num_train_epoch 5 --save_steps 10 --logging_steps 10 --load_best_model_at_end --predict_with_generate --eval_steps 10 --evaluation_strategy steps

where TRAIN_PATH, VAL_PATH and TEST_PATH are where the CONLL format files are saved.

Evaluation

Run the following command with the updated path to the output.

python finetuning/eval.py with the correct output paths.

If you face any issues with the code, models, or with reproducing our results, please contact monicam@allenai.org or raise an issue here.

If you find our code useful, please cite the following paper:

@misc{munnangi2024onthefly,
      title={On-the-fly Definition Augmentation of LLMs for Biomedical NER}, 
      author={Monica Munnangi and Sergey Feldman and Byron C Wallace and Silvio Amir and Tom Hope and Aakanksha Naik},
      year={2024},
      eprint={2404.00152},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}