Home

Awesome

Chinese-NLP-Corpus

Collections of Chinese NLP corpus

Open Domain

Corpus for open domain, including: law, social media, comments

Word Segmentation and Part-of-Speech

NameDescriptionLink
ZhuXian(诛仙)小说《诛仙》的POS和分词标注数据zhuxian
CNLC国家语言委员会的数据,train: dev: test=8: 1: 1CNLC

* the url in the table is out-of-date, you can find the data from the following reference.
Reference:https://github.com/hankcs/multi-criteria-cws/tree/master/data
the details of the corpus

Named Entity Recognition (NER)

NameDescriptionLink
MSRA中文NER任务最常用数据之一MSRA
People's Daily中文NER任务最常用数据之二People's Daily
Weibo Data中文NER任务最常用数据之三Weibo

Text Classification

NameDescriptionLinknotes
CAIL20182018中国‘法研杯’法律智能挑战赛(任务:罪名预测、法条推荐、刑期预测)的数据,数据集共包括268万刑法法律文书,共涉及183条罪名,202条法条,刑期长短包括0-25年、无期、死刑。CAIL2018比赛官网, github
CSL - Classification中文科学文献数据集(CSL)中,选取自然科学相关学报的论文摘要根据国家自然科学基金进行学科分类。CSL - Classification

Sentiment Analysis and Rating

NameDescriptionLinknotes
ChnSentiCorp_htl_all7000多条酒店评论数据,5000多条正面评论,2000多条负面评论ChnSentiCorp_htl_all
waimai_10k某外卖平台收集的用户评价,正面4000条,负面约8000waimai_10k
online_shopping_10_cats10个类别(书籍、平板、手机、水果、洗发水、热水器、蒙牛、衣服、计算机、酒店),共6万多条评论数据,正、负面评论各约3万online_shopping_10_cats
weibo_senti_100k10万多条,带情感标注的新浪微博,正负面评论约各5万weibo_senti_100k参考页面,这个数据集里包含大量emoji,效果可能与emoji相关,训练之前最好去除emoji
simplifyweibo_4_moods36万多条,带情感标注的新浪微博,包含4种情感,其中喜悦约20万条,愤怒、厌恶、低落各约5万simplifyweibo_4_moods
dmsc_v228部电影,超70万用户,超 200万条评分/评论数据dmsc_v2
yf_dianping24万家餐馆,54万用户,440万条评论/评分数据yf_dianping
yf_amazon52万件商品,1100多个类目,142万用户,720万条评论/评分数据yf_amazon
ez_douban5万多部电影(3万多有电影名称,2万多没有电影名称),2.8万用户,280万条评分数据ez_douban

Other Github Repo

DescriptionLinknotes
Chinese NLP Corpushttps://github.com/SophonPlus/ChineseNlpCorpus
awesome-chinese-nlp/Corpus 中文语料https://github.com/crownpku/Awesome-Chinese-NLP#corpus-中文语料
Large Scale Chinese Corpus for NLPhttps://github.com/brightmart/nlp_chinese_corpus
中文自然语言处理数据集https://github.com/InsaneLife/ChineseNLPCorpus
funNLPhttps://github.com/fighting41love/funNLP

Medical Domain

Collect corpus for Chinese medical domain, including medical terminology, QA, clinical NER

Bechmark

NameDescriptionLinknotes
ChineseBLUEthe Chinese Biomedical Language Understanding Evaluation benchmark by alibabaChineseBLUEConceptualized Representation Learning for Chinese Biomedical Text Mining

Word Segmentation

NameDescriptionLinknotes
AMTTL医学语言的分词数据集,来源应该是医学论坛,所以数据还是偏向open,与医学文本中的语言描述有差异。AMTTLAdaptive Multi-Task Transfer Learning for Chinese Word Segmentation in Medical Text

Clinical NER

NameDescriptionLinknotes
CNMER中文医学实体识别数据集,实体包括身体部位、症状体征、检查、疾病以及治疗。CNMER应该是CCKS2017的数据。
CNMER识别疾病和诊断、解剖部位、影像检查、实验室检验、手术和药物6种命名实体CCKS2018数据
CNMER识别中文医学命名实体CCKS2019数据来自OpenKG的分享

Question Answer (QA)

NameDescriptionLinknotes
cMedQA医学在线论坛的数据,包含5.4万个问题,及对应的约10万个回答。cMedQAChinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs
cMedQA2cMedQA的扩展版,包含约10万个医学相关问题,及对应的约20万个回答。cMedQA2Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection
webMedQA又一个医学在线问答数据集,包含6万个问题和31万个回答,而且包含问题的类别。webMedQAApplying deep matching networks to Chinese medical question answering: A study and a dataset

Others

NameDescriptionLinknotes
medical-booksOpen sourece medical books in LaTeXmedical-books
awesome_Chinese_medical_NLP中文医学NLP公开资源整理awesome_Chinese_medical_NLP
Chinese_medical_NLP医疗NLP领域(主要关注中文)评测数据集与论文等相关资源。Chinese_medical_NLP