Home

Awesome

Awesome-Foundation-Models-for-Advancing-Healthcare

Awesome

[NEWS.20240405] The related survey paper has been released.

[NOTE] If you have any questions, please don't hesitate to contact us.

Foundation model, which is pre-trained on broad data and is able to adapt to a wide range of tasks, is advancing healthcare. It promotes the development of healthcare artificial intelligence (AI) models, breaking the contradiction between limited AI models and diverse healthcare practices. Much more widespread healthcare scenarios will benefit from the development of a healthcare foundation model (HFM), improving their advanced intelligent healthcare services.

This repository is a collection of AWESOME things about Foundation models in healthcare, including language foundation models (LFMs), vision foundation models (VFMs), bioinformatics foundation models (BFMs), and multimodal foundation models (MFMs). Feel free to star and fork.

<p align="center"><img width="100%" src="figs/main.png" /></p>

This repository provides the improment advicement of current healthcare foundation models based on the following paper:

Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions 中译版<br/> Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, Hao Chen<br/> SMART Lab, The Hong Kong University of Science and Technology<br/> <br/>

If you find this repository is useful for you, please cite our paper:

@misc{he2024foundation,
      title={Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions}, 
      author={Yuting He and Fuxiang Huang and Xinrui Jiang and Yuxiang Nie and Minghao Wang and Jiguang Wang and Hao Chen},
      year={2024},
      eprint={2404.03264},
      archivePrefix={arXiv},
      primaryClass={cs.CY}
}

Contents

Related survey

2024

2023

Methods

LFM methods

2024

2023

2022

2021

2020

2019

VFM methods

2024

2023

2022

2021

2020

2019

BFM methods

2024

2022

2021

MFM methods

2024

2023

2022

2021

Datasets

LFM datasets

Dataset NameText TypesScaleTaskLink
PubMedLiterature18B tokensLanguage modeling*
MedC-ILiterature79.2B tokensDialogue*
GuidelinesLiterature47K instancesLanguage modeling*
PMC-PatientsLiterature167K instancesInformation retrieval*
MIMIC-IIIHealth records122K instancesLanguage modeling*
MIMIC-IVHealth record299K instancesLanguage modeling*
eICU-CRDv2.0Health record200K instancesLanguage modeling*
EHRsHealth record82B tokensNamed entity recognition, Relation extraction, Semantic textual similarity, Natural language inference, Dialogue-
MD-HERHealth record96K instancesDialogue, Question answering-
IMCS-21Dialogue4K instancesDialogue*
Huatuo-26MDialogue26M instancesQuestion answering*
MedInstruct-52kDialogue52K instancesDialogue*
MASH-QADialogue35K instancesDialogue*
MedQuADDialogue47K instancesDialogue*
MedDGDialogue17K instancesDialogue*
CMExamDialogue68K instancesDialogue*
cMedQA2Dialogue108K instancesDialogue*
CMtMedQADialogue70K instancesDialogue*
CliCRDialogue100K instancesDialogue*
webMedQADialogue63K instancesDialogue*
ChiMedDialogue1.59B tokensDialogue*
MedDialogDialogue20K instancesDialogue*
CMDDialogue882K instancesDialogue*
BianqueCorpusDialogue2.4M instancesDialogue*
MedQADialogue4K instancesDialogue*
HealthcareMagicDialogue100K instancesDialogue*
iCliniqDialogue10K instancesDialogue*
CMeKG-8KDialogue8K instancesDialogue*
Hybrid SFTDialogue226K instancesDialogue*
VariousMedQADialogue54K instancesDialogue*
Medical MeadowDialogue160K instancesDialogue*
MultiMedQADialogue193K instancesDialogue-
BiMed1.3MDialogue250K instancesDialogue*
OncoGPTDialogue180K instancesDialogue*

VFM datasets

Dataset NameModalityScaleTaskLink
LIMUCEndoscopy1043 videos (11276 frames)Detection*
SUNEndoscopy1018 videos (158,690 frames)Detection*
Kvasir-CapsuleEndoscopy117 videos (4,741,504 frames)Detection*
EndoSLAMEndoscopy1020 videos (158,690 frames)Detection, Registration*
LDPolypVideoEndoscopy263 videos (895,284 frames)Detection*
HyperKvasirEndoscopy374 videos (1,059,519 frames)Detection*
CholecT45Endoscopy45 videos (90489 frames)Segmentation, Detection*
DeepLesionCT slices (2D)32,735 imagesSegmentation, Registration*
LIDC-IDRI3D CT1,018 volumesSegmentation*
TotalSegmentator 3D CT1,204 volumesSegmentation*
TotalSegmentatorv2 3D CT1,228 volumesSegmentation*
AutoPET 3D CT, 3D PET1,214 PET-CT pairsSegmentation*
ULS 3D CT38,842 volumesSegmentation*
FLARE 2022 3D CT2,300 volumesSegmentation*
FLARE 2023 3D CT4,500 volumesSegmentation*
AbdomenCT-1K 3D CT1,112 volumesSegmentation*
CTSpine1K 3D CT1,005 volumesSegmentation*
CTPelvic1K 3D CT1,184 volumesSegmentation*
MSD 3D CT, 3D MRI1,411 CT, 1,222 MRISegmentation*
BraTS21 3D MRI2,040 volumesSegmentation*
BraTS2023-MEN 3D MRI1,650 volumesSegmentation*
ADNI 3D MRI-Clinical study*
PPMI 3D MRI-Clinical study*
ATLAS v2.0 3D MRI1,271 volumesSegmentation*
PI-CAI 3D MRI1,500 volumesSegmentation*
MRNet 3D MRI1,370 volumesSegmentation*
Retinal OCT-C8 2D OCT24,000 volumesClassification*
Ultrasound Nerve Segmentation US11,143 imagesSegmentation*
Fetal Planes US12,400 imagesClassification*
EchoNet-LVH US12,000 videosDetection, Clinical study*
EchoNet-Dynamic US10,030 videosFunction assessment*
AIROGS CFP113,893 imagesClassification*
ISIC 2020 Dermoscopy33,126 imagesClassification*
LC25000 Pathology25,000 imagesClassification*
DeepLIIF Pathology1,667 WSIsClassification*
PAIP Pathology2,457 WSIsSegmentation*
TissueNetPathology1,016 WSIsClassification*
NLST3D CT, Pathology26,254 CT, 451 WSIsClinical study*
CRCPathology100k imagesClassification*
MURAX-ray40,895 imagesDetection*
ChestX-ray14X-ray112,120 imagesDetection*
SNOWSynthetic pathology20K image tilesSegmentation*

BFM datasets

Dataset NameModalityScaleTaskLink
CellxGene CorpusscRNA-seqover 72M scRNA-seq dataSingle cell omics study*
NCBI GenBankDNA3.7B sequencesGenomics study*
SCPscRNA-seqover 40M scRNA-seq dataSingle cell omics study*
GencodeDNAGenomics study*
10x GenomicsscRNA-seq, DNASingle cell omics and genomics study*
ABC AtlasscRNA-seqover 15M scRNA-seq dataSingle cell omics study*
Human Cell AtlasscRNA-seqover 50M scRNA-seq dataSingle cell omics study*
UCSC Genome BrowserDNAGenomics study*
CPTACDNA, RNA, protein-Genomics and proteomics study*
Ensembl ProjectProteinProteomics study*
RNAcentral databaseRNA36M sequencesTranscriptomics study*
AlphaFold DBProtein214M structuresProteomics study*
PDBeProteinProteomics study*
UniProtProteinover 250M sequencesProteomics study*
LINCS L1000Small molecules1,000 genes with 41k small moleculesDisease research, drug response*
GDSCSmall molecules1,000 cancer cells with 400 compoundsDisease research, drug response*
CCLEBioinformatics study*

MFM datasets

Dataset NameModalitiesScaleTaskLink
MIMIC-CXRX-ray, Medical report377K images, 227K textsVision-Language Learning*
PadChestX-ray, Medical report160K images, 109K textsVision-Language Learning*
CheXpertX-ray, Medical report224K images, 224K textsVision-Language Learning*
ImageCLEF2018Multimodal, Captions232K images, 232K textsImage captioning*
OpenPathPathology, Tweets208K images, 208K textsVision-Language learning*
PathVQAPathology, QA4K images, 32K QA pairsVQA*
Quilt-1MPathology Images, Mixed-source text1M images, 1M textsVision-Language learning*
PatchGastricADC22Pathology, Captions991 WSIs, 991 textsImage captioning*
PTB-XLECG, Medical report21K records, 21K textsVision-Language learning*
ROCOMultimodal, Captions87K images, 87K textsVision-Language learning*
MedICaTMultimodal, Captions217K images, 217K textsVision-Language learning*
PMC-OAMultimodal, Captions1.6M images, 1.6M textsVision-Language learning*
ChiMed-VLMultimodal, Medical report580K images, 580K textsVision-Language learning*
PMC-VQAMultimodal, QA149K images, 227K QA pairsVQA*
SwissProtCLAPProtein Sequence, Text441K protein sequence, 441K textsProtein-Language learning*
Duke Breast Cancer MRIGenomic, MRI images, Clinical data922 patientsMultimodal learning*
I-SPY2MRI images, Clinical data719 patientsMultimodal learning*

Large-scale comprehensive databases

DatabaseDiscriptionLink
CGGAChinese Glioma Genome Atlas (CGGA) database contains clinical and sequencing data of over 2,000 brain tumor samples from Chinese cohorts.*
UK BiobankUK Biobank is a large-scale biomedical database and research resource containing de-identified genetic, lifestyle and health information and biological samples from half a million UK participants.*
TCGAThe Cancer Genome Atlas program (TCGA) molecularly characterizes over 20,000 primary cancer, matches normal samples spanning 33 cancer types, and generates over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data.*
TCIAThe Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts a large publicly available archive of medical images of cancer.*

Other resources

Lectures and tutorials

Blogs

Related awesome repositories