Home

Awesome

Vitk -- A Vietnamese Text Processing Toolkit


NOTE: This repos is now obsolete. Interested programmers should consider to use the new repo vlp (github.com/phuonglh/vlp) We have preferred using Scala instead of Java since 2016.

NOTE: Early 2018, a new, updated and lightweight toolkit, which does not use Apache Spark is available at vnTokenizer and vnTagger, and its online demo website.

This is the third release of a Vietnamese text processing toolkit, which is called "Vitk", developed by Phuong LE-HONG at College of Science, Vietnam National University in Hanoi.

There are some toolkits for Vietnamese text processing which are already published. However, most of them are not readily scalable for large data processing. This toolkit aims at the ability of processing big text data. For this reason, it uses Apache Spark as its core platform. Apache Spark is a fast and general engine for large scale data processing. Therefore, Vitk is a fast cluster computing toolkit.

If you don't want to use Apache Spark, you should download and use a standalone Vietnamese tokenizer or tagger from their website, only one JAR file is needed to run the program. vnTokenizer 5.1 and vnTagger

Despite of its name, this toolkit supports processing in various natural languages providing that suitable underlying models or linguistic resources are available for the different languages. The toolkit is packaged with models and resources for processing Vietnamese. The users can build models for other languages using the underlying tools.

Some examples:

Tools

Currently, Vitk consists of three fundamental tools for text processing:

The word segmentation tool is specific to the Vietnamese language. The other tools are general and can be trained to parse any language. We are working to develop and integrate more fundamental tools to Vitk such as named entity recognition, constituency parsing, opinion mining, etc.

Setup and Compilation

Running

Data Files

Data files used by Vitk are specified in sub-directories of the directory dat, corresponding to its integrated tools.

These folders can contain data specific to a natural language in use. Each language is specified further by a sub-directory whose name is an abbreviation of the language name, for example vi for Vietnamese, en for English, fr for French, etc.

Vitk can run as an application on a stand-alone cluster mode or on a real cluster. If it is run on a cluster, it is required that all machines in the cluster are able to access the same data files, which are normally located in a shared directory readable by all the machines.

If you use a Unix-like operating system (Unix/Linux/MacOS), it is easy to share or "export" a directory via a network file system (NFS). By default, Vitk searches for data files in the directory /export/dat/. Therefore, you need to copy the sub-directories dat/* into that directory, so that you have some folders as follows:

If you run Vitk on a stand-alone cluster mode, it is sufficient to create the data folders specified above on your single machine. The NFS stuffs can be ignored.

Usage

Vitk is an Apache Spark application, you run it by submitting the main JAR file vn.vitk-3.0.jar to Apache Spark. The main class of the toolkit is vn.vitk.Vitk which selects the desired tool by following arguments provided by the user.

The general arguments of Vitk are as follows:

Note that if you are processing very large texts, for a better performance, you should consider to set appropriate options of the spark-submit command, in particular, --executor-memory and --driver-memory. See more on submitting Apache Spark applications.

In addition to the general arguments above, a specific tool of Vitk requires its own arguments. The usage of each tool is described in their corresponding page as follows:

  1. How to run word segmentation
  2. How to run part-of-speech tagging
  3. How to run dependency parsing

You can also import the source code of Vitk to your favorite IDE (Eclipse, Netbeans, etc), compile and run from source, for example, launch the class vn.vitk.tok.Tokenizer for word segmentation, providing appropriate arguments as described above.

Documentation

The algorithms used by the tools of Vitk can be found in some related scientific publications. However, some of the main methods implemented in Vitk have been, and will be described in a more accessible way by blog posts. For example, the word segmentation method is described in:

Contribution Guidelines

Contact

Any bug reports, suggestions and collaborations are welcome. I am reachable at: