Awesome

Medical_NLP

医疗NLP领域评测/比赛，数据集，论文和预训练模型资源汇总。

Summary of medical NLP evaluations/competitions, datasets, papers and pre-trained models.

News

🟡2024/11/14 新增 4. VLM数据集、5.3 医疗VLM、5.4 医疗VLM Benchmark，后续将重点维护 Medical VLM 方向相关资源汇总，repo由Rongsheng Wang维护。
🟡2024/11/14之前 由于Cris Lee2021年离开医疗NLP领域，此repo现由Xidong Wang, Ziyue Lin, Jing Tang继续维护。

1. 评测
- 1.1 中文医疗基准测评：CMB / CMExam / PromptCBLUE
- 1.2 英文医疗基准测评:
2. 比赛
- 2.1 正在进行的比赛
- 2.2 已经结束的比赛
3. LLM 数据集
- 3.1 中文
- 3.2 英文
4. VLM 数据集 🔥
5. 开源预训练模型
- 5.1 医疗PLM
- 5.2 医疗LLM
- 5.3 医疗VLM 🔥
- 5.4 医疗VLM Benchmark 🔥
6. 相关论文
7. 开源工具包
8. 工业级产品解决方案
9. blog分享
10. 友情链接

1. 评测

1.1 中文医疗基准测评：CMB / CMExam / PromptCBLUE

CMB
- 地址：https://github.com/FreedomIntelligence/CMB
- 来源：各个临床医学工种各阶段考试；临床复杂病例问诊
CMExam
- 地址：https://github.com/williamliujl/CMExam
- 来源：执业医师资格考试往年题
PromptCBLUE
- 地址：https://github.com/michael-wzhu/PromptCBLUE
- 来源：CBLUE
PromptCBLUE
- 地址：https://github.com/CBLUEbenchmark/CBLUE
- 来源：CHIP会议往届的学术评测比赛和阿里夸克医疗搜索业务的数据集组成
MedBench
- 地址：https://arxiv.org/abs/2312.12806
- 来源：包含来自执医考试和报告的40,041个问题，覆盖各个专科。

1.2 英文医疗基准测评:

MultiMedBench
- 简介：是一种源自Google的大型多模态生成模型

2. 比赛

2.1 正在进行的比赛

医学搜索Query相关性判断
- 地址：https://tianchi.aliyun.com/competition/entrance/532001/introduction
- 来源：阿里天池

2.2 已经结束的比赛

2.2.1 英文比赛

BioNLP Workshop 2023 共享任务
- 地址：https://aclweb.org/aclwiki/BioNLP_Workshop#SHARED_TASKS_2023
- 来源：BioNLP Workshop
MedVidQA 2023
- 地址：https://medvidqa.github.io/index.html
- 来源：美国国立卫生研究院
MEDIQA-2021
- 地址：https://sites.google.com/view/mediqa2021
- 来源：NAACL-BioNLP 2021 workshop
ICLR-2021-医疗对话生成与自动诊断国际竞赛
- 地址：https://mlpcp21.github.io/pages/challenge
- 来源：ICLR 2021 workshop

2.2.2 中文比赛

影像学NLP —— 医学影像诊断报告生成
- 地址：https://gaiic.caai.cn/ai2023/
- 来源：2023全球人工智能技术创新大赛赛道一
非标准化疾病诉求的简单分诊挑战赛2.0
- 地址：http://challenge.xfyun.cn/topic/info?type=disease-claims-2022&ch=ds22-dw-sq03
- 来源：科大讯飞
第八届中国健康信息处理大会(CHIP2022)测评任务
- 地址：http://cips-chip.org.cn/
- 来源：CHIP2022
科大讯飞-医疗实体及关系识别挑战赛
- 地址：http://www.fudan-disc.com/sharedtask/imcs21/index.html
- 来源：科大讯飞
“肝”柔相济，大模型开创肝病医患交互服务新格局
- 地址：http://www.fudan-disc.com/sharedtask/imcs21/index.html](https://www.dcic-china.com/competitions/10090
- 来源：数字中国建设峰会组委会

3. LLM数据集

3.1 中文

Huatuo-26M
- 地址：https://github.com/FreedomIntelligence/Huatuo-26M
- 简介：Huatuo-26M 是迄今为止最大的中医问答数据集。
中文医疗对话数据集
- 地址：https://github.com/Toyhom/Chinese-medical-dialogue-data
- 简介：包含六个科室的医学问答数据
CBLUE
- 地址：https://github.com/CBLUEbenchmark/CBLUE
- 简介：涵盖了医学文本信息抽取（实体识别、关系抽取）
cMedQA2 (108K)
- 地址：https://github.com/zhangsheng93/cMedQA2
- 简介：中文医药方面的问答数据集，超过10万条
xywy-KG(294K三元组)
- 地址：https://github.com/baiyang2464/chatbot-base-on-Knowledge-Graph
- 简介：44.1K实体 294.1K 三元组
39Health-KG (210K三元组)
- 地址：https://github.com/zhihao-chen/QASystemOnMedicalGraph
- 简介：包括15项信息，其中7类实体，约3.7万实体，21万实体关系。
Medical-Dialogue-System
- 地址：https://github.com/UCSD-AI4H/Medical-Dialogue-System
- MedDialog 数据集（中文）包含医生和病人之间的对话（中文）。该数据集有 110 万条对话和 400 万条语句。数据还在不断增长，未来有更多的对话将被添加进来。
Chinese medical dialogue data
- 地址：https://github.com/Toyhom/Chinese-medical-dialogue-data
- 该数据集含有包括男科，儿科，妇产科，内科，外科，肿瘤科在内的六个不同科室总计792099条数据。
Yidu-S4K
- 地址：http://openkg.cn/dataset/yidu-s4k
- 简介：命名实体识别, 实体及属性抽取
Yidu-N7K
- 地址：http://openkg.cn/dataset/yidu-n7k
- 简介：临床语标准化
中文医药方面的问答数据集
- 地址：https://github.com/zhangsheng93/cMedQA2
- 简介：医疗问答
中文医患问答对话数据
- 地址：https://github.com/UCSD-AI4H/Medical-Dialogue-System
- 简介：医疗问答
CPubMed-KG (4.4M三元组)
- 地址：https://cpubmed.openi.org.cn/graph/wiki
- 简介：中华医学会高质量全文期刊数据
中文医学知识图谱 CMeKG (1M三元组)
- 地址：http://cmekg.pcl.ac.cn/
- 简介：CMeKG（Chinese Medical Knowledge Graph）
CHIP历年测评 (官方测评)
- 地址：http://cips-chip.org.cn/2022/callforeval ; http://www.cips-chip.org.cn/2021/ ; http://cips-chip.org.cn/2020/
- 简介：CHIP历年测评 (官方测评)
瑞金医院糖尿病数据集 (糖尿病)
- 地址：https://tianchi.aliyun.com/competition/entrance/231687/information
- 简介：瑞金医院糖尿病数据集 (糖尿病)
天池新冠肺炎问句匹配比赛 (新冠)
- 地址：https://tianchi.aliyun.com/competition/entrance/231776/information
- 简介：本次大赛数据包括：脱敏之后的医疗问题数据对和标注数据。

3.2 英文

MedMentions
- 地址：https://github.com/chanzuckerberg/MedMentions
- 简介：基于Pubmed摘要的生物医学实体链接数据集
webMedQA
- 地址：https://github.com/hejunqing/webMedQA
- 简介：医疗问答
COMETA
- 地址：https://www.siphs.org/
- 简介：社交媒体中的医疗实体链接数据。发表于EMNLP2020
PubMedQA
- 地址：https://arxiv.org/abs/1909.06146
- 简介：基于Pubmed提取的医学问答数据集
MediQA
- 地址：https://sites.google.com/view/mediqa2021
- 简介：文本概括
ChatDoctor Dataset-1
- 地址：https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view?usp=sharing
- 简介：来自 HealthCareMagic.com 的 10 万条病人与医生之间的真实对话
ChatDoctor Dataset-2
- 地址：https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view?usp=sharing
- 简介：来自 icliniq.com 的 10k 条病人与医生之间的真实对话
BioInstruct
- 地址：https://github.com/bio-nlp/BioInstruct
- 简介：超过25,000条为生物医学任务量身定制的指令，包括但不限于问答（QA）、信息提取（IE）和文本生成
Visual Med-Alpaca Data
- 地址：https://github.com/cambridgeltl/visual-med-alpaca/tree/main/data
- 简介：用于Visual Med-Alpaca训练的数据，源自 BigBio, ROCO and GPT-3.5-Turbo
CheXpert Plus
- 地址：https://github.com/Stanford-AIMI/chexpert-plus
- 简介：放射学领域公开发布的最大文本数据集，共有 3600 万个文本tokens，均配有 DICOM 格式的高质量图像，以及涵盖各种临床和社会群体的大量图像和患者元数据，以及许多病理标签和 RadGraph注释

4. VLM数据集

Dataset	Paper	Github	Keywords
MedTrinity-25M	link	link	`25 million images`、`10 modalities`、`65 diseases`、`VQA`、`EN`
LLaVA-Med	link	link	`630k images`、`VQA`、`EN`
Chinese-LLaVA-Med	-	link	`60k images`、`VQA`、`ZH`
HuatuoGPT-Vision	link	link	`647k images`、`VQA`、`EN`
MedVidQA	link	link	`7k videos`、`VQA`、`EN`
ChiMed-VL	link	link	`1M images`、`VQA`、`EN`、`ZH`
RadFM	link	link	`16M images`、`5000 diseases`、`VQA`、`EN`、`2D/3D`
BiomedParseData	link	link	`6.8 million image-mask-description`、`45 biomedical image segmentation datasets`、`9 modalities`、`EN`、`2D`
OmniMedVQA	link	link	`118,010 images`、`12 modalities`、`2D`、`20 human anatomical regions`
PreCT	link	link	`160K volumes`、`42M slices`、`3D`、`CT`
GMAI-VL-5.5M	link	link	`5.5m image and text`、`219 specialized medical imaging datasets`、`2D`、`VQA`
SA-Med2D-20M	link	link	`4.6 million 2D medical images and 19.7 million corresponding masks`、`2D`、`EN`
IMIS-Bench	link	link	`6.4 million images, 273.4 million masks (56 masks per image), 14 imaging modalities, and 204 segmentation targets`、`EN`

5. 开源预训练模型

5.1 医疗PLM

BioBERT：
- 地址：https://github.com/naver/biobert-pretrained
- 简介：BioBERT是一种生物医学领域的语言表示模型，专门用于生物医学文本挖掘任务，如生物医学命名实体识别、关系提取、问答等。

BlueBERT：
- 地址：https://github.com/ncbi-nlp/BLUE_Benchmark
- 简介：BLUE基准包括5个不同的生物医学文本挖掘任务和10个语料库。BLUE基准依赖于预先存在的数据集，因为它们已被BioNLP社区广泛用作共享任务。这些任务涵盖了各种文本类型(生物医学文献和临床笔记)、数据集大小和难度，更重要的是，突出了常见的生物医学文本挖掘挑战。
BioFLAIR：
- 地址：https://github.com/flairNLP/flair
- 简介：Flair是一个强大的NLP库，能将最先进的自然语言处理(NLP)模型应用于文本，例如命名实体识别(NER)，情感分析，词性标记(PoS)，对生物医学数据的特殊支持，语义消歧和分类，支持快速增长的语言数量。Flair 还是一个文本嵌入库，一个基于PyTorch的自然语言处理框架。
COVID-Twitter-BERT：
- 地址：https://github.com/digitalepidemiologylab/covid-twitter-bert
- 简介：COVID-Twitter-BERT（简称CT-BERT）是一种基于Transformer的模型，它在关于COVID-19主题的大量推特消息上进行了预训练。v2模型是在9700万条推文（12亿训练样本）上训练的。
bio-lm (Biomedical and Clinical Language Models)
- 地址：https://github.com/facebookresearch/bio-lm
- 简介：这项工作评估了用于生物医学和临床自然语言处理任务的许多模型，并训练了一些性能更佳的新模型。
BioALBERT
- 地址：https://github.com/usmaann/BioALBERT
- 简介：这是一种针对大型领域特定(生物医学)语料库训练的生物医学语言表示模型，专为生物医学文本挖掘任务而设计。

5.2 医疗LLM

5.2.1 多语言医疗大模型

ApolloMoE：
- 地址：https://github.com/FreedomIntelligence/ApolloMoE
- 简介：通过语言家族专家的混合，有效地实现 50 种语言医学LLM的民主化
Apollo：
- 地址：https://github.com/FreedomIntelligence/Apollo
- 简介：轻量级多语言医学LLM，将医疗人工智能普及到 60亿人群
MMedLM：
- 地址：https://github.com/MAGIC-AI4Med/MMedLM
- 简介：第一个开源的多语言医学语言模型

5.2.2 中文医疗大语言模型

BenTsao：
- 地址：https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese
- 简介：BenTsao以LLaMA-7B为基础，经过中文医学指令精调/指令微调得到。研究人员通过医学知识图谱和GPT3.5 API构建了中文医学指令数据集，并在此基础上对LLaMA进行了指令微调，提高了LLaMA在医疗领域的问答效果。
BianQue：
- 地址：https://github.com/scutcyr/BianQue
- 简介：一个经过指令与多轮问询对话联合微调的医疗对话大模型，以ClueAI/ChatYuan-large-v2作为底座，使用中文医疗问答指令与多轮问询对话混合数据集进行微调。
SoulChat：
- 地址：https://github.com/scutcyr/SoulChat
- 简介：灵心以ChatGLM-6B作为初始化模型，经过百万规模心理咨询领域中文长文本指令与多轮共情对话数据联合指令微调，提升模型的共情能力、引导用户倾诉能力以及提供合理建议的能力。
DoctorGLM：
- 地址：https://github.com/xionghonglin/DoctorGLM
- 简介：一个基于 ChatGLM-6B的中文问诊大模型。该模型通过中文医疗对话数据集进行微调，实现了包括lora、p-tuningv2等微调及部署。
HuatuoGPT：
- 地址：https://github.com/FreedomIntelligence/HuatuoGPT
- 简介：华佗GPT是一个经过中文医学指令精调/指令微调(Instruct-tuning)得到的一个GPT-like模型。该模型是专门为医疗咨询设计的中文LLM，它的训练数据包含了从ChatGPT处蒸馏得到的数据和来自医生的真实数据，在训练过程中加入了RLHF的反馈。
HuatuoGPT-II：
- 地址：https://github.com/FreedomIntelligence/HuatuoGPT-II
- 简介：华佗GPT2采用了创新的领域适应方法，大大提高了其医学知识和对话能力。它在多个医疗基准测试中表现出了一流的性能，尤其是在专家评估和新医学执业资格考试中超越了 GPT-4。

5.2.3 英文医疗大语言模型

GatorTron：
- 地址：https://github.com/uf-hobi-informatics-lab/GatorTron
- 简介：一个医疗健康领域的早期大模型，致力于研究使用非结构化的电子健康病例的系统是如何从有数十亿参数的医疗大模型中获益。
Codex-Med：
- 地址：https://github.com/vlievin/medical-reasoning
- 简介：致力于研究GPT-3.5模型回答和推理实际医疗问题的能力。使用了医疗测试数据集USMLE和MedMCQA，医疗阅读理解数据集PubMedQA。
Galactica：
- 地址：https://galactica.org/
- 简介：Galactica致力于解决科学领域的信息过载问题，储存合并了包括医疗医疗健康领域在内的科学知识。Galactica在大型论文语料库，参考文献的基础上训练而成，尝试发现不同领域研究之间的潜在关系。
DeID-GPT：
- 地址：https://github.com/yhydhx/ChatGPT-API
- 简介：一个创新的的支持GPT4的去识别框架，可以自动识别和删除识别信息。
ChatDoctor：
- 地址：https://github.com/Kent0n-Li/ChatDoctor
- 简介：一个利用医疗领域基础知识，基于LLaMA进行微调得到的医疗对话模型。
MedAlpaca：
- 地址：https://github.com/kbressem/medAlpaca
- 简介：MedAlpaca采用了一种开源策略，致力于解决医疗系统中的隐私问题。该模型基于70亿和130亿参数量的LLaMa构建。
PMC-LLaMA：
- 地址：https://github.com/chaoyi-wu/PMC-LLaMA
- 简介： PMC-LLaMA是一个开源语言模型，通过对LLaMA-7B在总计480万篇生物医学学术论文上进行调质，进一步灌输医学知识，以增强其在医学领域的能力。
Visual Med-Alpaca：
- 地址：https://github.com/cambridgeltl/visual-med-alpaca
- 简介： Visual Med-Alpaca是一个开源的、参数高效的生物医学基础模型，可以与医学的“视觉专家”集成，用于多模式生物医学任务。该模型基于LLaMa-7B架构构建，使用由GPT-3.5-Turbo和人类专家共同策划的指令集进行训练。
GatorTronGPT：
- 地址：https://github.com/uf-hobi-informatics-lab/GatorTronGPT
- 简介：GatorTronGPT 是一个医疗生成大语言模型。该模型基于GPT-3构建，含有50亿或200亿参数。该模型使用了含有2770亿单词的，由临床和英语文本组成的庞大语料库。
MedAGI：
- 地址：https://github.com/JoshuaChou2018/MedAGI
- 简介：MedAGI一个范例，以最低的成本将领域特定的医疗语言模型统一起来，为实现医疗通用人工智能提供了一条可能的途径。
LLaVA-Med：
- 地址：https://github.com/microsoft/LLaVA-Med
- 简介：LLaVA- med使用通用领域LLaVA进行初始化，然后以课程学习方式进行持续训练(首先是生物医学概念对齐，然后是全面的指令调整)。
Med-Flamingo：
- 地址：https://github.com/snap-stanford/med-flamingo
- 简介：Med-Flamingo是一个视觉语言模型，专门设计用于处理包含图像和文本的交错多模态数据。以Flamingo为基础，Med-Flamingo通过对不同医学学科的多种多模式知识来源进行预训练，进一步增强了在这些医学领域的能力。

5.3 医疗VLM

Model	Paper	Github
MedVInT	link	link
Med-Flamingo	link	link
LLaVA-Med	link	link
Qilin-Med-VL	link	link
RadFM	link	link
MedDr	link	link
HuatuoGPT-Vision	link	link
BiomedGPT	link	link
Med-MoE	link	link
R-LLaVA	link	-
Med-2E3	link	-
GMAI-VL	link	link

5.4 医疗VLM Benchmark

Benchmark	Paper	Github
GMAI-MMBench	link	link
OmniMedVQA	link	link
MMMU	link	link
MultiMedEval	link	link
WorldMedQA-V	link	-

6. 相关论文

6.1 后ChatGPT时代可能有帮助的论文

大型语言模型编码临床知识论文地址：https://arxiv.org/abs/2212.13138
ChatGPT在USMLE上的表现：使用大型语言模型进行 AI 辅助医学教育的潜力论文地址：https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000198
对 ChatGPT 的医疗建议进行（图灵）测试论文地址：https://arxiv.org/abs/2301.10035
Toolformer：语言模型可以自学使用工具论文地址：https://arxiv.org/abs/2302.04761
检查你的事实并再试一次：利用外部知识和自动反馈改进大型语言模型论文地址：https://arxiv.org/abs/2302.12813
GPT-4 在医学挑战问题上的能力论文地址：https://arxiv.org/abs/2303.13375

6.2 综述类文章

生物医学领域的预训练语言模型：系统调查论文地址
医疗保健深度学习指南论文地址 nature medicine发表的综述
医疗保健领域大语言模型综述论文地址

6.3 特定任务文章

电子病历相关文章

Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records 论文地址
MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records 论文地址

医学关系抽取

Leveraging Dependency Forest for Neural Medical Relation Extraction 论文地址

医学知识图谱

Learning a Health Knowledge Graph from Electronic Medical Records 论文地址

辅助诊断

Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence 论文地址

医疗实体Linking（标准化）

Medical Entity Linking using Triplet Network 论文地址
A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization 论文地址
Deep Neural Models for Medical Concept Normalization in User-Generated Texts 论文地址

6.4 会议索引

ACL2020医学领域相关论文列表

A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization 论文地址
Biomedical Entity Representations with Synonym Marginalization 论文地址
Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical Domain 论文地址
MIE: A Medical Information Extractor towards Medical Dialogues 论文地址
Rationalizing Medical Relation Prediction from Corpus-level Statistics 论文地址

AAAI2020 医学NLP相关论文列表

On the Generation of Medical Question-Answer Pairs 论文地址
LATTE: Latent Type Modeling for Biomedical Entity Linking 论文地址
Learning Conceptual-Contextual Embeddings for Medical Text 论文地址
Understanding Medical Conversations with Scattered Keyword Attention and Weak Supervision from Responses 论文地址
Simultaneously Linking Entities and Extracting Relations from Biomedical Text without Mention-level Supervision 论文地址
Can Embeddings Adequately Represent Medical Terminology? New Large-Scale Medical Term Similarity Datasets Have the Answer! 论文地址

EMNLP2020 医学NLP相关论文列表

Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text 论文地址
MedDialog: Large-scale Medical Dialogue Datasets 论文地址
COMETA: A Corpus for Medical Entity Linking in the Social Media 论文地址
Biomedical Event Extraction as Sequence Labeling 论文地址
FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction 论文地址论文解析:FedED:用于医学关系提取的联邦学习(基于融合蒸馏)
Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition 论文地址
A Knowledge-driven Generative Model for Multi-implication Chinese Medical Procedure Entity Normalization 论文地址
BioMegatron: Larger Biomedical Domain Language Model 论文地址
Querying Across Genres for Medical Claims in News 论文地址

7. 开源工具包

分词工具：PKUSEG 项目地址项目说明：北京大学推出的多领域中文分词工具，支持选择医学领域。

8. 工业级产品解决方案

9. blog分享

10. 友情链接

11. reference

@misc{medical_NLP_github,
  author = {Xidong Wang, Ziyue Lin and Jing Tang, Rongsheng Wang, Benyou Wang},
  title = {Medical NLP},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/FreedomIntelligence/Medical_NLP}}
}