Home

Awesome

jieba

"结巴"中文分词:做最好的Python中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

News

支持国产开源软件,为结巴分词投一票,谢谢:-), 投票地址: https://code.csdn.net/2013ossurvey

Feature

在线演示

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

网站代码:https://github.com/fxsjy/jiebademo

Python 2.x 下的安装

Python 3.x 下的安装

Algorithm

功能 1):分词

代码示例( 分词 )

#encoding=utf-8
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print "Full Mode:", "/ ".join(seg_list)  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print "Default Mode:", "/ ".join(seg_list)  # 精确模式

seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print ", ".join(seg_list)

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print ", ".join(seg_list)

Output:

【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学

【精确模式】: 我/ 来到/ 北京/ 清华大学

【新词识别】:他, 来到, 了, 网易, 杭研, 大厦    (此处,“杭研”并没有在词典中,但是也被Viterbi算法识别出来了)

【搜索引擎模式】: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

功能 2) :添加自定义词典

功能 3) :关键词提取

代码示例 (关键词提取)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

功能 4) : 词性标注

功能 5) : 并行分词

其他词典

  1. 占用内存较小的词典文件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

  2. 支持繁体分词更好的词典文件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

下载你所需要的词典,然后覆盖jieba/dict.txt 即可或者用jieba.set_dictionary('data/dict.txt.big')

模块初始化机制的改变:lazy load (从0.28版本开始)

jieba采用延迟加载,"import jieba"不会立即触发词典的加载,一旦有必要才开始加载词典构建trie。如果你想手工初始jieba,也可以手动初始化。

import jieba
jieba.initialize()  # 手动初始化(可选)

在0.28之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:

jieba.set_dictionary('data/dict.txt.big')

例子: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

分词速度

常见问题

1)模型的数据是如何生成的?https://github.com/fxsjy/jieba/issues/7

2)这个库的授权是? https://github.com/fxsjy/jieba/issues/2

更多问题请点击:https://github.com/fxsjy/jieba/issues?sort=updated&state=closed

Change Log

https://github.com/fxsjy/jieba/blob/master/Changelog

jieba

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Features

Usage

Algorithm

Function 1): cut

Code example: segmentation

#encoding=utf-8
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print "Full Mode:", "/ ".join(seg_list)  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print "Default Mode:", "/ ".join(seg_list)  # 默认模式

seg_list = jieba.cut("他来到了网易杭研大厦")
print ", ".join(seg_list)

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print ", ".join(seg_list)

Output:

[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学

[Accurate Mode]: 我/ 来到/ 北京/ 清华大学

[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦    (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)

[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在

, 日本, 京都, 大学, 日本京都大学, 深造

Function 2): Add a custom dictionary

Function 3): Keyword Extraction

Code sample (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Using Other Dictionaries

It is possible to supply Jieba with your own custom dictionary, and there are also two dictionaries readily available for download:

  1. You can employ a smaller dictionary for a smaller memory footprint: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

  2. There is also a bigger file that has better support for traditional characters (繁體): https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

By default, an in-between dictionary is used, called dict.txt and included in the distribution.

In either case, download the file you want first, and then call jieba.set_dictionary('data/dict.txt.big') or just replace the existing dict.txt.

Initialization

By default, Jieba employs lazy loading to only build the trie once it is necessary. This takes 1-3 seconds once, after which it is not initialized again. If you want to initialize Jieba manually, you can call:

import jieba
jieba.initialize()  # (optional)

You can also specify the dictionary (not supported before version 0.28) :

jieba.set_dictionary('data/dict.txt.big')

Segmentation speed

Online demo

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)