Awesome
SudachiPy
SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
Warning
This repository is for 0.5.* version of SudachiPy, 0.6* and above are developed as Sudachi.rs.
TL;DR
$ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪 名詞,固有名詞,地名,一般,*,* 高輪
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
駅 名詞,普通名詞,一般,*,*,* 駅
EOS
$ echo "空缶空罐空きカン" | sudachipy -a
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
EOS
Setup
You need SudachiPy and a dictionary.
Step 1. Install SudachiPy
$ pip install sudachipy
Step 2. Get a Dictionary
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
Usage: As a command
There is a CLI command sudachipy
.
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
[-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-s string sudachidict type
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Output
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a
option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for the user dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
EOS
echo "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
EOS
Usage: As a Python package
Here is an example;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular Tokenization
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']
# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
(With 20200330
core
dictionary. The results may change when you use other versions)
Dictionary Edition
**WARNING: sudachipy link
is no longer available in SudachiPy v0.5.2 and later. **
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default.
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
.
The dictionary files are not in the package itself, but it is downloaded upon installation.
Dictionary option: command line
You can specify the dictionary with the tokenize option -s
.
$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full
Dictionary option: Python package
You can specify the dictionary with the Dicionary()
argument; config_path
or dict_type
.
class Dictionary(config_path=None, resource_dir=None, dict_type=None)
config_path
- You can specify the file path to the setting file with
config_path
(See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail). - If the dictionary file is specified in the setting file as
systemDict
, SudachiPy will use the dictionary.
- You can specify the file path to the setting file with
dict_type
- You can also specify the dictionary type with
dict_type
. - The available arguments are
small
,core
, orfull
. - If different dictionaries are specified with
config_path
anddict_type
, a dictionary defineddict_type
overrides those defined in the config path.
- You can also specify the dictionary type with
from sudachipy import tokenizer
from sudachipy import dictionary
# default: sudachidict_core
tokenizer_obj = dictionary.Dictionary().create()
# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()
# The dictionary specified by `dict_type` will be set.
tokenizer_obj = dictionary.Dictionary(dict_type="core").create() # sudachidict_core (same as default)
tokenizer_obj = dictionary.Dictionary(dict_type="small").create() # sudachidict_small
tokenizer_obj = dictionary.Dictionary(dict_type="full").create() # sudachidict_full
# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()
Dictionary in The Setting File
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
User Dictionary
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
{
"userDict" : ["relative/path/to/user.dic"],
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary path (default: system core dictionary path)
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
Customized System Dictionary
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
For Developers
Cython Build
$ python setup.py build_ext --inplace
Code Format
Run scripts/format.sh
to check if your code is formatted correctly.
You need packages flake8
flake8-import-order
flake8-buitins
(See requirements.txt
).
Test
Run scripts/test.sh
to run the tests.
Contact
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!