Awesome
TinySegmenter
TinySegmenter -- Super compact Japanese tokenizer was originally created by (c) 2008 Taku Kudo for javascript under the terms of a new BSD licence. For details, see here
tinysegmenter for python2.x was written by Masato Hagiwara. for his information see here
This tinysegmenter is modified for python3.x and python2.x for distribution by Tatsuro Yasukawa. Additionaly, this tinysegmenter is modified for being more faster - thanks to @chezou, @cocoatomo and @methane.
See info about tinysegmenter
Installation
pip install tinysegmenter3
Usage
import tinysegmenter
statement = '私はpython大好きStanding Engineerです.'
tokenized_statement = tinysegmenter.tokenize(statement)
print(tokenized_statement)
# ['私', 'は', 'python', '大好き', 'Standing', ' Engineer', 'です', '.']
Test Text
The test text (in the tests
directory) was The Time Machine by H.G. Wells, translated to Japanese by Hiroo Yamagata under the CC BY-SA 2.0 License.
How to run Test
Install requirements from requirements.txt
by
pip install -r requirements.txt
then run this:
./runtests.sh