Home

Awesome

SciFive

PWC PWC PWC PWC PWC PWC PWC PWC PWC

PRs Welcome arXiv

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

🎉 UPDATE Jan 2023

We are migrating SciFive into BioT5X: Pretrained T5X Transformer for Biomedical Text Generation and Classification that use T5X and Flaxformer

📝 Our example BioT5X Fine-tunning notebook for the BLURB Tasks finetunning_biot5x_blurb.ipynb

🤗 HuggingFace

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-Pubmed")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-Pubmed")

sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text = sentence + " </s>"

encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

The following table contains pretrained SciFive checkpoints.

ModelSizeStepConfigCheckpoint
SciFive Pubmedbase & large1194600 & 1196500T5 configsgs://scifive/models/pubmed/{size}/
SciFive Pubmed+PMCbase & large1200000T5 configsgs://scifive/models/pubmed_pmc/{size}/
SciFive PMCbase & large1200000T5 configsgs://scifive/models/pmc/{size}/

gsutil URI for Pretrain data:

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊  Expected Results

<!-- ## 🤵&nbsp; Team --> <!-- * <b>The National Institutes of Health:</b><br/> | James Anibal | Long Phan | Alec Peltekian | Erol Bahadiroglu | |:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:| | <img width=120/ src="https://faes.org/sites/default/files/james_anibal.png"> | <img width=120/ src="https://media-exp1.licdn.com/dms/image/C4E03AQFqMmKjyQRtAQ/profile-displayphoto-shrink_400_400/0/1594192915473?e=1628726400&v=beta&t=9rPFc2GnImXXDtPoXxoS0432LjybyWJVL0b_fn6aLew"> | <img width=120/ src="https://media-exp1.licdn.com/dms/image/C4E03AQGIjDegQmApcQ/profile-displayphoto-shrink_200_200/0/1573082873285?e=1628121600&v=beta&t=kuXiDY3qIzmAAqDvZugOgCAcFlaGEw4fRbJf1pAdMPY"> | <img width=120/ src="https://media-exp1.licdn.com/dms/image/C4D03AQGygdk5u9F1HA/profile-displayphoto-shrink_200_200/0/1522727407036?e=1628121600&v=beta&t=Z_4O17wxhWnatS7Vye0VekyIJiKBMOvpdyCyO3pIaVY"> | -->

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}