Home

Awesome

Introduction

Code for Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.

Usage

N-gram and trained BERT classifier cannot be public since privacy policy.

Use in command lines

python -m graces -s 饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。
python -m graces -f ./input.txt -o ./output.txt

Import from python

import graces
graces.cut("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。") # Segment a single sentence
graces.cut_k("饮食可,睡眠可,大便不规律,小便正常,体重无明显减轻。", k=8) # Segment a single sentence with fixed word count k.
graces.cut_file("./input.txt", "./output.txt") # Segment a file

Data

We ask MD students to construct coarse and fine level word segmentation on EHRs for validation. We do not use data for training!

Citation

If you find our codes or data useful, please cite:

@article{YUAN2020103542,
title = "Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition",
journal = "Journal of Biomedical Informatics",
volume = "110",
pages = "103542",
year = "2020",
issn = "1532-0464",
doi = "https://doi.org/10.1016/j.jbi.2020.103542",
url = "http://www.sciencedirect.com/science/article/pii/S1532046420301702",
author = "Zheng Yuan and Yuanhao Liu and Qiuyang Yin and Boyao Li and Xiaobin Feng and Guoming Zhang and Sheng Yu",
}