Home

Awesome

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

News

Table of Contents

Important Tables and Figures

Fig. 2. The organizational framework for the content. Section III, Section IV, Section V are technology details, while Section II, Section VI and Section VI are more valued for Healthcare professionals Alt text


<br> <br>

LLM Information

Model NameBasePara. (B)FeaturesDateLink
GatorTronTransformer0.345, 3.9, 8.9Training from scratch06/2022https://github.com/uf-hobi-informatics-lab/GatorTron
Codex-MedGPT-3.5175CoT, Zero-shot07/2022https://github.com/vlievin/medical-reasoning
GalacticaTransformer1.3, 6.4, 30, 120Reasoning, Multidisciplinary11/2022https://galactica.org
Med-PaLMFlan-PaLM/PaLM540CoT, Self-consistency12/2022-
GPT-4-MedGPT-4-no specialized prompt crafting03/2023-
DeID-GPTGPT-4-De-identifying03/2023https://github.com/yhydhx/ChatGPT-API
ChatDoctorLLaMA7Retrieve online, external knowledge03/2023https://github.com/Kent0n-Li/ChatDoctor
DoctorGLMChatGLM6Extra prompt designer04/2023https://github.com/xionghonglin/DoctorGLM
MedAlpacaLLaMA7, 13Adapt to Medicine04/2023https://github.com/kbressem/medAlpaca
BenTsaoLLaMA7Knowledge graph04/2023https://github.com/SCIR-HI/ Huatuo-Llama-Med-Chinese
PMC-LLaMALLaMA7Adapt to Medicine04/2023https://github.com/chaoyi-wu/PMC-LLaMA
Visual Med-AlpacaLLaMA7multimodal generative model, Self-Instruct04/2023https://github.com/cambridgeltl/visual-med-alpaca
BianQue~ChatGLM6Chain of Questioning04/2023https://github.com/scutcyr/BianQue
Med-PaLM 2PaLM 2340Ensemble refinement, CoT, Self-consistency05/2023-
GatorTronGPTGPT-35, 20Training from scratch for medicine05/2023https://github.com/uf-hobi-informatics-lab/GatorTronGPT
HuatuoGPTBloomz7Reinforced learning from AI feedback05/2023https://github.com/FreedomIntelligence/HuatuoGPT
ClinicalGPTBLOOM7multi-round dialogue consultations06/2023-
MedAGIMiniGPT-4-multimodal, AGI06/2023https://github.com/JoshuaChou2018/MedAGI
LLaVA-MedLLaVA13multimodal, self-instruct, curriculum learning06/2023https://github.com/microsoft/LLaVA-Med
OphGLMChatGLM6multimodal, Ophthalmology LLM06/2023https://github.com/ML-AILab/OphGLM
SoulChatChatGLM6Mental Healthcare06/2023https://github.com/scutcyr/SoulChat
Med-FlamingoFlamingo80Bmultimodal, Few-Shot generative medical VQA07/2023https://github.com/snap-stanford/med-flamingo

<br> <br>

PLM Information

TABLE I BRIEF SUMMARIZATION OF EXISTING PLMS FOR HEALTHCARE.

Model NameBasePara. (B)FeaturesDateLink
BioBERTBERT0.34Biomedical Adaption05/2019https://github.com/naver/biobert-pretrained
BlueBERTBERT0.34Biomedical Benchmark06/2019https://github.com/ncbi-nlp/BLUE\_Benchmark
MIMIC-BERTBERT0.34Clinical Concept Extraction08/2019-
BioFLAIR~BERT0.34Less Computationally Intensive08/2019https://github.com/zalandoresearch/flair
Bio-ELECTRA-smallELECTRA0.03Training From Scratch03/2020-
AlphaBERTBERT0.11Character-level04/2020https://github.com/wicebing/AlphaBERT.git
Spanish-bertBERT-Spanish04/2020-
GreenCovidSQuADBERTBERT0.34CPU-only, CORD-1904/2020https://github.com/npoe/covid-qa
BEHRTTransformer-Training From Scratch04/2020https://github.com/deepmedicine/BEHRT
BioMed-RoBERTaRoBERTa0.11Biomedical Adaption05/2020https://github.com/allenai/dont-stop-pretraining
RadBERT~BERT-RadCore Radiology Reports05/2020-
CT-BERT~BERT0.34COVID-1905/2020https://github.com/digitalepidemiologylab/covid-twitter-bert
French-BERTBERT0.11French Language Models06/2020-
FS-/RAD-/GER-BERTBERT0.11Chest Radiograph Reports07/2020https://github.com/fast-raidiology/bertfor-radiology
Japanese-BERTBERT10.11Japanese Clinical Narrative07/2020ai-health.m.u-tokyo.ac.jp/home/research/uth-bert
MC-BERTBERT0.11Chinese Biomedical Benchmark08/2020https://github.com/alibabaresearch/ChineseBLUE
BioALBERT-nerALBERT0.18Biomedical NER09/2020https://github.com/usmaann/BioALBERT
BioMegatronMegatron1.2Training From Scratch10/2020https://github.com/NVIDIA/NeMo
CharacterBERTBERT0.11Character-CNN module10/2020https://github.com/helboukkouri/character-bert
ClinicalBertBERT0.11For Predicting Hospital Readmission11/2020https://github.com/kexinhuang12345/clinicalBERT
Clinical XLNetXLNet0.11Temporal Information11/2020https://github.com/lindvalllab/clinicalXLNet
Bio-LMRoBERTa0.34Biomedical Adaption11/2020https://github.com/facebookresearch/bio-lm
BioBERTptBERT0.11Portuguese Clinical11/2020https://github.com/HAILab-PUCPR/BioBERTpt
RoBERTa-MIMICRoBERTa0.11Clinical Concept Extraction12/2020https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER
Clinical KB-ALBERTALBERT0.03Introducing Medical KB12/2020https://github.com/noc-lab/clinical-kb-bert
CHMBERTBERT0.11Chinese Medical, Cloud Computing01/2021-
PubMedBERTBERT0.11Training From Scratch01/2021https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
ouBioBERTBERT0.11Up-sampling, Amplified Vocabulary02/2021https://github.com/sy-wada/blue\_benchmark\_with\_transformers
BERT-EHRBERT-Depression,Chronic Disease Prediction03/2021https://github.com/lanyexiaosa/brltm
AraBERTBERT0.11Arabic Language03/2021https://github.com/aub-mind/araBERT
ABioNERBERT0.11Arabic NER03/2021-
ELECTRAMedELECTRA0.11Biomedical Adaption04/2021https://github.com/gmpoli/electramed
KeBioLMPubMedBERT0.11Introducing Medical KB04/2021https://github.com/GanjinZero/KeBioLM
SINA-BERTBERT0.11Persian Language04/2021-
Med-BERTBERT0.11Stay Length Prediction05/2021https://github.com/ZhiGroup/MedBERT
GalénRoBERTa0.11Spanish Language05/2021https://github.com/guilopgar/ClinicalCodingTransformerES
SCIFIVE~T50.77Biomedical Text Generation05/2021https://github.com/justinphan3110/SciFive
BioELECTRAELECTRA0.34Training From Scratch06/2021https://github.com/kamalkraj/BioELECTRA
UmlsBERTBERT0.11Introducing Medical KB06/2021https://github.com/gmichalo/UmlsBERT
MedGPTGPT-21.5Temporal Modelling07/2021-
MentalBERTBERT0.11Mental Healthcare10/2021https://huggingface.co/mental
CODERmBERT0.34Cross-lingual, Introducing Medical KB02/2022https://github.com/GanjinZero/CODER
BioLinkBERT~BERT0.34PubMed with Citation Links03/2022https://github.com/michiyasunaga/LinkBERT
BioALBERTALBERT0.03Biomedical Adaption04/2022https://github.com/usmaann/BioALBERT
BioBART~BART0.4Biomedical NLG04/2022https://github.com/GanjinZero/BioBART
SAPBERTBERT0.11Self-Alignment Pretraining10/2022https://github.com/cambridgeltl/sapbert
VPPBART0.14Soft prompt, Biomedical NER03/2023https://github.com/KaiHe-better/VPP
KADBERT-Multimodal, Chest Radiology Images03/2023https://github.com/xiaoman-zhang/KAD

<br> <br>

TABLE II SUMMARIZATION OF TRAINING DATA AND EVALUATION TASKS FOR EXISTING PLMS FOR HEALTHCARE.

Model NameMethodTraining DataEval task
BioBERTFTPubMed, PMCBiomedical NER, RE, QA
BlueBertFTPubMed, MIMIC-IIIBLUE
MIMIC-BERTFTMIMIC-IIIBiomedical NER
BioFLAIR~FTPubMedBio NER
Bio-ELECTRA-smallPTPubMedBiomedical NER
AlphaBERTFTDischarge diagnosesExtractive Summarization Task
Spanish-bertFTSpanishSpanish Clinical Case Corpus
GreenCovidSQuADBERTFTCORD19, PubMed, PMCNER, QA
BEHRTPTCPRD, HESDisease Prediction
BioMed-RoBERTaFTBIOMEDCHEMPROT, RCT
RadBERT~FTRadiology Report CorpusReport Coding, Summarization
CT-BERT~FTTweetCOVID-19 Text Classification
French-BERTFTFrench clinical documentsDEFT challenge
FS-/RAD-/GER-BERTFT,PTUnstructured radiology reportsChest Radiograph Reports Classification
Japanese-BERTFTJapanese EHRSymptoms Classification
MC-BERTFTChinese EHRChinese Biomedical Evaluation benchmark
BioALBERT-nerFTPubMed, PMCBiomedical NER
BioMegatronPTPubMedbiomedical NER, RE, QA
CharacterBERTBertOpenWebText, MIMIC-III, PMCMedical NER, NLI, RE, SS
ClinicalBertFTMIMIC-IIIHospital Readmission Prediction
Clinical XLNetFTMIMIC-IIIPMV, Mortality
Bio-LMFTPubMed, PMC, MIMIC-III18 Biomedical NLP Tasks
BioBERTptFTPrivate clinical notes, WMT16SemClinBr
RoBERTa-MIMICFTi2b2 2010, 2012, n2c2 2018i2b2 2010, 2012, N2C2 2018
Clinical KB-ALBERTFTMIMIC-III, UMLSMedNLI, i2b2 2010, 2012
CHMBERTFTMedical text dataDisease Prediction
PubMedBERTPTPubMedBLURB
ouBioBERTFTPubMed, WikipediaBLUE
BERT-EHRFTGeneral EHRMyocardial Infarction, Breast Cancer, Liver Cirrhosis
AraBERTPTArabic Wikipedia, OSIANArabic SA, NER, QA
ABioNERFTArabic scientific literatureArabic NER
ELECTRAMedFTPubMedBiomedical NER, RE, and QA
KeBioLMFTPubMedBLURB
SINA-BERTFTOnline Persian sourcePersian QA, SA
Med-BERTFTGeneral EHRDisease prediction
GalénFTPrivate clinical casesCodiEsp-D, CodiEsp-P, Cantemist-Coding tasks
SCIFIVE~T5PubMed, PMCBiomedical NER, RE, NIL, QA
BioELECTRAPTPubMed, PMCBLURB, BLUE
UmlsBERTFTMIMIC-IIIMedNLI, i2b2 2006,2010, 2012, 2014
MedGPTFTMIMIC-III, private EHRsDisorder Prediction
MentalBERTFTRedditDepression Stress, Suicide Detection,
CODERFTUMLSMCSM, Medical RE
BioLinkBERT~FTPubMedBLURB, USMLE
BioALBERTFTPubMed, PMC, MIMIC-III6 BioNLP Tasks
BioBART~FTPubMedBiomedical EL, NER, QA, Dialogue, Summarization
SAPBERTFTUMLSMEL
VPPFTPubMedBiomedical NER
KADFTMIMIC-CXRPadChest, ChestXray14, CheXpert and ChestX-Det10

<br> <br>

Availble Training Data

DataTypesizeLink
MIMIC-IIIEHR58,976 hospital admissions for 38,597 patientshttps://mimic.mit.edu/docs/iii/
MIMIC-IVEHRcovering a decade of admissions between 2008 and 2019https://mimic.mit.edu/docs/iv/
CPRDEHRover 2,000 primary care practices and include 60 million patientshttps://cprd.com/data
PubMedScientific Literature35M citations and abstracts of biomedical literaturehttps://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
PMCScientific Literature8 million full-text article recordshttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk
RCTScientific Literature4,528 abstracthttps://github.com/bwallace/RCT-summarization-data
MS$\hat{~}$2Scientific Literature470,402 abstracthttps://github.com/allenai/ms2/
CDSRScientific Literature7,805 abstracthttps://github.com/qiuweipku/Plain\_language\_summarization
SumPubMedScientific Literature33,772 abstracthttps://github.com/vgupta123/sumpubmed
The PileScientific Literature825 GB English texthttps://pile.eleuther.ai/
S2ORCScientific Literature63,709 abstracthttps://github.com/jbshp/GenCompareSum
CORD-19Scientific Literature1M papershttps://github.com/allenai/cord19
MeQSumMedical Question Summarization1000 instanceshttps://github.com/abachaa/MeQSum
CHQ-SumMedical Question Summarization1507 instanceshttps://github.com/shwetanlp/Yahoo-CHQ-Summ
UMLSKnowledge Base2M entities for 900K conceptshttps://www.nlm.nih.gov/research/umls/index.html
COMETAWeb Data (social media)800K Reddit postshttps://github.com/cambridgeltl/cometa
MedDialogDialogue3.66 million conversationshttps://github.com/UCSD-AI4H/COVID-Dialogue
CovidDialogDialogue603 consultationshttps://github.com/UCSD-AI4H/COVID-Dialogue
Medical FlashcardsDialogue33955 instanceshttps://github.com/kbressem/medalpaca
WikidocDialogue67704 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc
Wikidoc Patient InformationDialogue5942 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc\_patient\_information
MEDIQADialogue2208 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_wikidoc\_patient\_information
CORD-19Dialogue1056660 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_cord19
MMMLUDialogue3787 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_mmmlu
Pubmed CausalDialogue2446 instanceshttps://huggingface.co/datasets/medalpaca/medical\_meadow\_pubmed\_causal
ChatDoctorDialogue215000 instanceshttps://github.com/Kent0n-Li/ChatDoctor
Alpaca-EN-ANEnglish Instructions52K instructionshttps://github.com/tatsu-lab/stanford\_alpaca/blob/main/alpaca\_data.json
Alpaca-CH-ANChinese Instructions52K instructionshttps://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/tree/main/data
ShareGPTConversations61653 long conversationshttps://huggingface.co/datasets/philschmid/sharegpt-raw
WebTextWeb Data40 GB of texthttps://commoncrawl.org/the-data/get-started/
OpenWebTextWeb Data38 GB of texthttps://skylion007.github.io/OpenWebTextCorpus/
Colossal Clean Crawled CorpusWeb Data806 GB of texthttps://www.tensorflow.org/datasets/catalog/c4
OpenIEHR, Multimodel3.7 million images from about 1.2 million papershttps://openi.nlm.nih.gov/faq\#collection
U-XrayMultimodel3,955 reports and 7,470 imageshttps://openi.nlm.nih.gov/
ROCOMultimodel81,000 radiology images and corresponding captionshttps://github.com/razorx89/roco-dataset
MedICaTMultimodel17,000 images includes captionshttps://github.com/allenai/medicat
PMC-OAMultimodel1.6M image-caption pairshttps://huggingface.co/datasets/axiong/pmc\_oa\_beta
CheXpertMultimodel224,316 chest radiographs with associated reportshttps://aimi.stanford.edu/chexpert-chest-x-rays
PadChestMultimodel160,000 images with related texthttp://bimcv.cipf.es/bimcv-projects/padchest/
MIMIC-CXRMultimodel227,835 imaging studies for 64,588 patientshttps://mimic.mit.edu/docs/iv/modules/cxr/
PMC-15MMultimodel15 million Figure-caption
pairshttps://arxiv.org/abs/2303.00915
OpenPathMultimodel208,414 pathology images related descriptionshttps://laion.ai/blog/laion-5b/

The Statistics of Computation Cost

TABLE VIII THE STATISTICS OF COMPUTATION COST FOR EXISTING HEALTHCARE LLM.

Model NameTotal data sizeepochBatch sizeGPU typeGPU numberGPU time
Visual Med-Alpaca54k data points3128A100-80G42.51 hours
GatorTron\textgreater 90 billion words10-A1009926 days
Galactica---A100-80G128-
ChatDoctor100k conversations3192A10063 hours
DoctorGLM3.5G14A100-80G18 hours
PMC-LLaMA75B tokens5128A10087 days
Visual Med-Alpaca44.8MB* (without images)-128A100-80G42.51 hours
BianQue 1.09 million samples1-RTX 4090816 days
GatorTronGPT277B tokens1,120/560A100-80G56026 days
HuatuoGPT226,042 instances3128A1008-
LLaVA-Med15 million figure-caption pairs--A100815 hours
Med-Flamingo1.3M image-caption pairs-400A100-80G86.75 days

<br> <br>

TABLE IX ESTIMATED FLOPS AND TRAINING TOKENS FOR DIFFERENT MODEL SIZES.

ParametersFLOPsFLOPs (in Gopher unit)Tokens
400 Million1.92e+191/29, 9688.0 Billion
1 Billion1.21e+201/4, 76120.2 Billion
10 Billion1.23e+221/46205.1 Billion
67 Billion5.76e+2311.5 Trillion
175 Billion3.85e+246.73.7 Trillion
280 Billion9.90e+2417.25.9 Trillion
520 Billion3.43e+2559.511.0 Trillion
1 Trillion1.27e+26221.321.2 Trillion
10 Trillion1.30e+2822515.9216.2 Trillion

Citation

@misc{he2023survey,
      title={A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics}, 
      author={Kai He and Rui Mao and Qika Lin and Yucheng Ruan and Xiang Lan and Mengling Feng and Erik Cambria},
      year={2023},
      eprint={2310.05694},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}