Awesome

Hungarian (and a little bit English) raw text tokenisation

License: GNU LGPL

2003-2004 (c) Németh László

2013- (c) Zséder Attila

make
make install

Need

Need

huntoken <input_raw_text >xml_output

-h, --help: help
-r: only sentence boundary detection
-x: processing without hun_abbrev filter
-b: break long sentences (need for tokenising long (>4000 characters) sentences!!!)
-n: output without XML header and footer
-e: tokenize English (set English abbrevations)
-v, --version: version

See flex sources, and huntoken shell program.