Home

Awesome

Data-Centric Foundation Models in Computational Healthcare

:fire::fire::fire: A survey on data-centric foundation models in computational healthcare

Project Page | Paper [arXiv]

Last updated: 2024/10/08

:pencil: If you find this repo helps, please kindly cite our survey, thanks!

@article{zhang2024data,
  title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
  author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
  journal={arXiv preprint arXiv:2401.02458},
  year={2024}
}

In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.

:book: Contents


Healthcare and Medical Foundation Models

A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.

Language Models

ModelSubfieldPaperCodeBasePre-Training Data
MMedLM 2MedicineTowards Building Multilingual Language Model for MedicineGithubInternLM 2MMedC*
BiMediXMedicineBiMediX: Bilingual Medical Mixture of Experts LLMGithubMixtralBiMed1.3M*
Me LLaMAMedicineMe LLaMA: Foundation Large Language Models for Medical ApplicationsGithubLLaMA 2*
BioMistralBiomedicineBioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains-MistralPMC
PULSEMedicine-GithubInternLM*
MeditronMedicineMeditron-70B: Scaling Medical Pretraining for Large Language ModelsGithubLLaMA 2GAP-Replay*
TaiyiBiomedicineTaiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical TasksGithubQwenBigBio + CBLUE
BioMedGPTBiomedicineBioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicineGithubLLaMA 2S2ORC
Clinical LLaMA-LoRAClinicParameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain-LLaMAMIMIC-IV
Med-PaLM 2ClinicTowards Expert-Level Medical Question Answering with Large Language ModelsGooglePaLM 2MedQA
PMC-LLaMAMedicinePMC-LLaMA: Towards Building Open-source Language Models for MedicineGithubLLaMAMedC
MedAlpacaMedicineMedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training DataGithubLLaMAMedical Meadow
BenTsao (HuaTuo)BiomedicineHuaTuo: Tuning LLaMA Model with Chinese Medical KnowledgeGithubLLaMACMeKG
ChatDoctorMedicineChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain KnowledgeGithubLLaMAHealthCareMagic*
Clinical-T5ClinicClinical-T5: Large Language Models Built Using Mimic Clinical TextPhysioNetT5MIMIC-III + MIMIC-IV
Med-PaLMClinicLarge Language Models Encode Clinical KnowledgeGooglePaLMMedQA
BioGPTBiomedicineBioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and MiningGithubGPT-2PubMed
BioLinkBERTBiomedicineLinkbert: Pretraining Language Models with Document LinksGithubBERTPubMed
PubMedBERTBiomedicineDomain-Specific Language Model Pretraining for Biomedical Natural Language ProcessingMicrosoftBERTPubMed
BioBERTBiomedicineBioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text MiningGithubBERTPubMed + PMC
BlueBERTBiomedicineAn Empirical Study of Multi-Task Learning on BERT for Biomedical Text MiningGithubBERTPubMed + MIMIC-III
Clinical BERTClinicPublicly Available Clinical BERT EmbeddingsGithubBERTMIMIC-III
SciBERTBiomedicineSciBERT: A Pretrained Language Model for Scientific TextGithubBERTSemantic Scholar

Vision Models

ModelSubfieldPaperCodeBasePre-Training Data
Prov-GigaPathPathologyA Whole-Slide Foundation Model for Digital Pathology from Real-World DataGithub-Prov-Path*
BEPHPathologyA Foundation Model for Generalizable Cancer Diagnosis and Survival Prediction from Histopathological ImagesGithubBEiTv2*
(No name)RadiologyFoundation Model for Cancer Imaging BiomarkersGithubSimCLR*
VISION-MAERadiologyVISION-MAE: A Foundation Model for Medical Image Segmentation and Classification-MAE*
RudolfVPathologyRudolfV: A Foundation Model by Pathologists for Pathologists-DINOv2*
PathoDuetPathologyPathoDuet: Foundation Models for Pathological Slide Analysis of H&E and ICH StainsGithubMoCo v3TCGA + HyReCo + BCI
UNIPathologyA General-Purpose Self-Supervised Model for Computational Pathology-DINOv2Mass-100K
REMEDISRadiologyRobust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic ImagingGithubSimCLRMIMIC-IV + CheXpert
VirchowPathologyVirchow: A Million-Slide Digital Pathology Foundation Model-DINOv2*
RETFoundRetinopathyA Foundation Model for Generalizable Disease Detection from Retinal ImagesGithubMAE*
CTransPathPathologyTransformer-Based Unsupervised Contrastive Learning for Histopathological Image ClassificationGithub-TCGA + PAIP
HIPTPathologyScaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised LearningGithubDINOTCGA

Vision-Language Models

ModelSubfieldPaperCodeBasePre-Training Data
Uni-MedMedicineUni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE-CLIP + LLaMA 2*
RadFoundRadiologyExpert-Level Vision-Language Foundation Model for Real-World Radiology and Comprehensive Evaluation--RadVLCorpus*
PRISMPathologyPRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology-CoCa*
Med-GeminiMedicineCapabilities of Gemini Models in Medicine-Gemini*
EchoCLIPCardiologyVision-Language Foundation Model for Echocardiogram InterpretationGithubCLIP*
ChemDFMChemistryChemDFM: Dialogue Foundation Model for Chemistry-LLaMAPubMed + USPTO
CheXagentRadiologyCheXagent: Towards a Foundation Model for Chest X-Ray InterpretationGithubBLIP-2CheXinstruct*
SATRadiologyOne Model to Rule them All: Towards Universal Segmentation for Medical Images with Text PromptsGithub-SAT-DS*
PathChatPathologyA Foundational Multimodal Vision Language AI Assistant for Human Pathology-LLaVAPathChatInstruct*
Qilin-Med-VLRadiologyQilin-Med-VL: Towards Chinese Large Vision-Language Model for General HealthcareGithubLLaVAChi-Med-VL*
CXR-CLIPRadiologyCXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-trainingGithubCLIPMIMIC-CXR + CheXpert + ChestX-ray14
MaCoRadiologyEnhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning-MAE + CLIPMIMIC-CXR
PathLDMPathologyPathLDM: Text conditioned Latent Diffusion Model for HistopathologyGithubLatent DiffusionTCGA-BRCA + GPT-3.5
RadFMRadiologyTowards Generalist Foundation Model for RadiologyGithub-MedMD*
KADRadiologyKnowledge-Enhanced Visual-Language Pre-Training on Chest Radiology ImagesGithubCLIPMIMIC-CXR + UMLS
Med-FlamingoMedicineMed-Flamingo: A Multimodal Medical Few-Shot LearnerGithubFlamingoMTB + PMC-OA
CONCHPathologyA Visual-Language Foundation Model for Computational PathologyGithubCoCaPubMed + PMC
QuiltNetPathologyQuilt-1M: One Million Image-Text Pairs for HistopathologyGithubCLIPQuilt-1M*
PathAsstPathologyPathAsst: Redefining Pathology through Generative Foundation AI Assistant for PathologyGithubCLIPPathCap + PathInstruct*
PLIPPathologyA Visual-Language Foundation Model for Pathology Image Analysis Using Medical TwitterHuggingfaceCLIPOpenPath*
MI-ZeroPathologyVisual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology ImagesGithubCLIPARCH
LLaVA-MedBiomedicineLLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayGithubLLaVAPMC-15M + GPT-4
MedVInTBiomedicinePMC-VQA: Visual Instruction Tuning for Medical Visual Question AnsweringGithub-PMC-VQA*
PMC-CLIPBiomedicinePMC-CLIP: Contrastive Language-Image Pre-Training Using Biomedical DocumentsGithubCLIPPMC-OA*
BiomedCLIPBiomedicineLarge-Scale Domain-Specific Pretraining for Biomedical Vision-Language ProcessingHuggingfaceCLIPPMC-15M*
MedKLIPRadiologyMedKLIP: Medical Knowledge Eenhanced Language-Image Pre-TrainingGithubCLIPMIMIC-CXR
MedCLIPMedicineMedCLIP: Contrastive Learning from Unpaired Medical Images and TextGithubCLIPCheXpert + MIMIC-CXR
CheXzeroRadiologyExpert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised LearningGithubCLIPMIMIC-CXR
PubMedCLIPRadiologyDoes CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?GithubCLIPROCO

Protein and Molecule Models

ModelSubfieldPaperCodeBasePre-Training Data
nach0Moleculesnach0: Multimodal Natural and Chemical Languages Foundation ModelGithubT5*
MoleculeSTMDrugMulti-modal Molecule Structure-text Model for Text-based Retrieval and EditingGithubCLIPPubChem
AlphaMissenseProteomicsAccurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissenseGithubAlphaFoldPDB + UniRef
GETGenomicsGET: A Foundation Model of Transcription across Human Cell TypesHuggingfaceTransformer*
GIT-MolMoleculesGIT-Mol: A Multi-Modal Large Language Model for Molecular Science with Graph, Image, and TextGithubT5 + BLIP-2PubChem
ESM-2ProteomicsEvolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language ModelGithubTransformerUniRef
AlphaFold 2ProteomicsHighly Accurate Protein Structure Prediction with AlphaFoldGithub-PDB + Uniclust30

Other Models

ModelSubfieldPaperCodeBasePre-Training Data
OmniNANucleotide sequenceOmniNA: A Foundation Model for Nucleotide Sequences-LLaMANCBI
LaBraMEEGLarge Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI-Transformer*
Neuro-GPTEEGNeuro-GPT: Developing A Foundation Model for EEG--TUH EEG

Datasets for Foundation Model

Text

Dataset (Paper)DescriptionLink
MedBench (arXiv)A Chinese medical LLM benchmark with 300,901 Chinese questions covering 43 clinical specialties, combined with an automatic evaluation systemOfficial site
MMedBench (arXiv)A multilingual medical QA benchmark, where questions are categorized into 21 topicsGithub
MMedC (arXiv)A multilingual medical corpus containing over 25.5B tokensGithub
BiMed1.3M (arXiv)An English and Arabic bilingual dataset of 1.3M samples of medical QA and chatGithub
GAP-Replay (arXiv)48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replayGithub
Huatuo-26M (arXiv)26M Chinese medical QA pairsGithub
Medical Meadow (arXiv)16M medical QA pairs collected from 9 sourcesGithub
MultiMedQA (Nature)6 existing and 1 online-collected medical QA datasetNature
BigBio (Nature)126+ biomedical NLP datasets covering 13 task categories and 10+ languagesGithub
MedMCQA (MLR)194K multiple-choice questions covering 2.4K healthcare topicsOfficial site
MedQA-USMLE (MDPI)61,097 multiple choice questions based on USMLE in three languagesGithub
CBLUE (arXiv)A Chinese biomedical language understanding evaluation benchmark with 18 datasetsOfficial site
BLURB (arXiv)13 biomedical NLP datasets in 6 tasksOfficial site
PubMedQA (arXiv)1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instancesOfficial site
BLUE (arXiv)5 language tasks with 10 biomedical and clinical text datasetsGithub
webMedQA (BMC)63,284 real-world Chinese medical questions with over 300K answersGithub
MedMentions (arXiv)4,392 papers annotated by experts with mentions of UMLS entitiesGithub
MIMIC-III (Nature)Critical care data for over 40,000 patientsOfficial site
ClinicalTrials.govAn online database of clinical research studies, including clinical trials and observational studiesOfficial site

Imaging

Dataset (Paper)DescriptionLink
Mass-100K (arXiv)100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types-
RETFound (Nature)Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scansNature
AbdomenAtlas-8K (arXiv)8,448 CT volumes with per-voxel annotated eight abdominal organsGithub
Med-MNIST v2 (Nature)12 2D and 6 3D datasets for biomedical image classificationOfficial site
EchoNet-Dynamic (Nature)10,030 expert-annotated echocardiogram videosOfficial site
CheXpert (arXiv)224,316 chest radiographs of 65,240 patientsOfficial site
Kather Colon Dataset (PMC)100K histological images of human colorectal cancer and healthy tissueZenodo
DeepLesion (PMC)32K CT scans with annotations and semantic labels from radiological reportsNIH
ChestXray-NIHCC (arXiv)100K radiographs with labels from more than 30,000 patientsNIH
ISICAn archive containing 23K skin lesion images with labels & ImagingOfficial site

Genomics

Dataset (Paper)DescriptionLink
1000 Genomes Project (Nature)A comprehensive catalog of human genetic variationsOfficial site
ENCODE (Nature)A platform of genomics data and encyclopedia with integrative-level and ground-level annotationsNIH
dbSNP (NIH)A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletionsNIH

Drug

Dataset (Paper)DescriptionLink
DrugChat (arXiv)143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBLGithub
PubChem (NIH)A collection of 900+ sources of chemical information dataNIH
DrugBank (NIH)A web-enabled structured database of molecular information about drugsOfficial site
ChEMBL (NIH)20M bioactivity measurements for 2.4M distinct compounds and 15K protein targetsOfficial site

Mulit-Modal

Dataset (Paper)DescriptionLink
RadGenome-Chest CT (arXiv)A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs-
OmniMedVQA (arXiv)131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets-
SAT-DS (arXiv)11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLSGithub
PathChatInstruct (arXiv)257,004 instructions of pathology-specific queries with image and text-
Chi-Med-VL (arXiv)580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in ChineseGithub
MedMD (arXiv)15.5M 2D scans and 180k 3D radiology scans with textual descriptionsGithub
OpenPath (Nature)208,414 pathology images paired with natural language descriptionsHuggingface
Quilt-1M (arXiv)1M image-text pairs for histopathologyGithub
Med-MMHL (arXiv)Human- and LLM-generated misinformation detection datasetGithub
Mol-Instructions (arXiv)148K molecule-oriented, 505K protein-oriented, and biomolecular text instructionsHuggingface
PathInstruct (arXiv)180K samples of LLM-generated instruction-following dataGithub
PMC-VQA (arXiv)227K VQA pairs of 149K images of various modalities or diseasesGithub
PMC-OA (arXiv)1.6M fine-grained biomedical image-text pairsGithub
PathCap (arXiv)142K pathology image-caption pairs from various sourcesGithub
SwissProtCLAP (arXiv)441K text-protein sequence pairsGithub
MIMIC-IV (Nature)Clinical information for hospital stays of over 60,000 patientsOfficial site
MIMIC-CXR (Nature)227,835 chest imaging studies with free-text reports for 65,379 patientsPhysioNet
TCGAA landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer typesOfficial site