Awesome
True Few-Shot BioIE: GPT-3 In-Context vs. Small PLM Fine-Tuning
This repository provides the pipeline used in our work to benchmark GPT-3 in-context learning and BERT-sized model fine-tuning on biomedical information extraction tasks (NER and relation extraction) under the true few-shot setting.
We run GPT-3 through the OpenAI API and use the HuggingFace library to fine-tune small PLMs.
Installation
Run the following commands to create a conda environment with the required packages.
conda create -n few-shot-bioIE python=3.9 pip
conda activate few-shot-bioIE
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Data
Data for our experiments was obtained from the BLURB Benchmark. To download and process the necessary datasets we use a modified version of their data processing scripts.
To download all IE datasets in BLURB, we first run the following script:
bash download_BLURB_data.sh
This should create the raw_data
directory under root containing all the raw datasets except for ChemProt. To download ChemProt, please register and download from this link. as described in BLURB.
Place the file ChemProt_Corpus.zip
under the raw_data
directory and run the following preprocessing script:
bash preprocess_BLURB_data.sh
This should yield a training, dev and test set for 5 NER datasets (BC5CDR-disease
,BC5CDR-chem
,BC2GM
,JNLPBA
and NCBI-disease
) and 3 RE datasets (DDI
, ChemProt
and GAD
).
In order to include new datasets into this system, check the format of the NER and RE datasets after this step. Following the data format in any of the dataset specific directories should be all that is needed.
Running Experiments
Understanding Configurations
Under the configs
directory, we find directories for plms
and gpt3
, each of these containing .yaml configuration
files intended for use with the WandB library. We have included one example configuration for each (task, method)
combination configs/plms/0
and configs/gpt3/0
are NER specific configuration files while configs/plms/1
and
configs/gpt3/1
are RE specific. We also include all configuration files necessary for reproducibility which
we discuss in a later section. For further details about all hyperparameters tested please refer to
our paper.
The configuration files configs/sample_ner_config.yaml
and configs/sample_re_config.yaml
contain the settings necessary
to run the main pipeline which manages 1) the creation of small training datasets, 2) the PLM fine-tuning true-shot hyperparameter
search and 3) the true-few shot prompt selection process for GPT-3. This configuration
file is most useful for running this process for many PLM fine-tuning models or datasets since in our work we evaluate
multiple small PLMs. Configurations which have already been run with some training dataset will not be re-run to
prevent wasted resources.
The following is the configuration directory structure for reference:
configs
plms
{config_num}
config.yaml (task specific script,
all hyperparameters)
gpt3
{config_num}
config.yaml (dataset_name,
model_name,
overall_instructions,
sent_intro,
retrieval_message,
sampling strategy (knn module, random (seed)),
context size)
sample_ner_config.yaml
sample_re_config.yaml
main_ner_config_plms.yaml (PLM Hyperparameter Search for NER in our paper)
main_re_config_plms.yaml (PLM Hyperparameter Search for RE in our paper)
To run an experiment using the example configuration, make sure to look over the main pipeline configuration files and understand each parameter set.
Running the Main Pipeline
Running all our scripts requires WandB. Be sure to login to WandB using wandb login
and follow their instructions.
There are two main ways to run our main pipeline.
- If all parameter choices in a specific configuration can be tested with all others we leverage the WandB hyperparameter sweep procedure.
- On the other hand, if not all parameter combinations should be tested, we loop over a set list of specified configurations and run them individually.
To make sure everything is in working order, we recommend starting by running the sample configurations with the following commands (be sure to specify which GPUs are available in your system):
cd src
GPUS={Comma-separated list of available GPUs}
CUDA_VISIBLE_DEVICES=$GPUS python run_sample_configs_ner.py
CUDA_VISIBLE_DEVICES=$GPUS python run_sample_configs_re.py
To confirm that the sample scripts run smoothly, output directories for BC5CDR-disease
and DDI
should be created
under the outputs
directory. Refer to the Output Structure
for more details on the expected output.
To test that the WandB hyperparameter sweep procedure is also working properly, delete the directories created under the
outputs
directory and run the following command (be sure to edit the .yaml file to specify which GPUs are available in
your system):
wandb sweep ../configs/sample_ner_config.yaml
The previous command will create a sweep and output a sweep name of the form
{user_name}/{project_name}/{sweep_id}
. This sweep name must be used to run a
WandB agent as follows:
CUDA_VISIBLE_DEVICES=$GPUS wandb agent {user_name}/{project_name}/{sweep_id}
The same output files should be created under outputs/BC5CDR-disease
as in the previous section.
Any .yaml configuration file can be used in the way just described.
Few-Shot Benchmarking for BLURB IE (Reproducibility)
To reproduce the PLM fine-tuning results presented in our paper, run through the
previous WandB procedure with the .yaml configuration files configs/main_ner_config_plms.yaml
and configs/main_re_config_plms.yaml
. The hyperparameter choices used can be found both in our paper
and in the configuration files under configs/plms/
(NER hyperparameters from 2
to 6
and RE
hyperparameters from 7
to 11
).
To reproduce our true-few shot benchmarking of the 175B GPT-3 model,
be sure to first change the model
field in the configuration files under
configs/gpt3/
from 2
to 9
from ada
to davinci
(Note that running the
largest GPT-3 model can be quite expensive). To carry out the benchmarking run
the following script:
CUDA_VISIBLE_DEVICES=$GPUS python benchmark_gpt3_in_context.py
Output Structure
outputs
data_name
plms
experiment_num (new number every time)
config.yaml
subset_config.{subset_num}.json (copy of the subset dataset configuration)
cv.config.yaml
grid.config.yaml
cv.{sweep_id}
{run_id}
cv.params.p
fold_id
all_results.json
train.tokens.{epoch}.txt
train.predictions.{epoch}.txt
train.labels.{epoch}.txt
train.metrics.{epoch}_results.json
dev.tokens.{epoch}.txt
dev.predictions.{epoch}.txt
dev.labels.{epoch}_results.json
dev.metrics.{epoch}_results.json
dev.tokens.{epoch}.txt
dev.labels.{epoch}.txt
dev.predictions.{epoch}.txt
dev.metrics
grid.{sweep_id}
{run_id}
grid.params.p
train.tokens.{epoch}.txt
train.predictions.{epoch}.txt
train.labels.{epoch}.txt
dev.tokens.{epoch}.txt
dev.labels.{epoch}.txt
dev.predictions.{epoch}.txt
train.metrics.{epoch}_results.json
dev.metrics.{epoch}_results.json
test.metrics.{epoch}_results.json
dev.metrics
test.metrics
cv_all_results.csv
grid_all_results.csv
main_table.csv
gpt-3
experiment_num (new number every time)
subset_config.{subset_num}.json (copy of the subset dataset configuration)
config.yaml (subset_num,
model_name,
fine_tuning,
overall_instructions,
sent_intro,
retrieval_message,
sampling strategy (knn module, random (seed)),
context size)
* (Don't re-run anything in GPT-3, all outputs must be recycled)
cv (Directory containing DataFrames for every HP configuration)
run_num
params.json
subset.gpt3.csv
gpt3.output.csv
dev.metrics
dev.metrics
subset.cv_best.gpt3.csv (Training Dataframe)
test.cv_best.gpt3.csv (Test Dataframe)
test.cv_best.gpt3.output.csv (GPT-3 Outputs)
test.metrics (All Test Results (Best & Worse)
dev.cost.summary (Cost and compute time)
Testing GPT-3 In-Context Learning Alone
Two Jupyter notebooks under the src
directory can be used directly to test GPT-3 on manually designed prompts for
NER and RE tasks. Use these only after running the example configurations above since some configuration files
used in these scripts are created by those first runs. The Jupyter notebooks are the following:
src/GPT-3 NER Run Script.ipynb
src/GPT-3 RE Run Script.ipynb