Home

Awesome

Low-Resource POS-Tagging: 2014

Author: Dan Garrette (dhg@cs.utexas.edu)

This is a rewritten version of the code used in the papers:

Learning a Part-of-Speech Tagger from Two Hours of Annotation
Dan Garrette and Jason Baldridge
In Proceedings of NAACL 2013

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
Dan Garrette, Jason Mielens, and Jason Baldridge
In Proceedings of ACL 2013

This archive contains code, written in Scala, for training and tagging using the approach described in the papers. You do not need to have Scala installed in order to run the software since Scala runs on the Java Virtual Machine (JVM). Thus, if you have Java installed, you should be able to run the system as described below.

Setting things up

Getting the code

Clone the project:

$ git clone https://github.com/dhgarrette/low-resource-pos-tagging-2014.git

The rest of these instructions assume starting from the low-resource-pos-tagging-2014 directory.

Compile the project

$ ./compile

NOTE: You will need to be connected to the internet the first time you run this since it will need to download several libraries that are required by the code.

Running the system

$ ./run OPTIONS

Training data. If rawFile is given, toksupFile or typesupFile (or both) must be given.

Model serialization file. Required if training data is not given.

Data to run the tagger on.

Additional options.

For example:

$ ./run --rawFile data/raw.txt --toksupFile data/toksup.txt --typesupFile data/typesup.txt --modelFile data/model.ser --memmCutoff 10
$ ./run --modelFile data/model.ser --inputFile data/input.txt --outputFile data/output.txt
$ ./run --modelFile data/model.ser --evalFile data/eval.txt

Note: You should set the JAVA_OPTS environment variable to increase the available memory:

export JAVA_OPTS="-Xmx4g"

Data Format

Unannotated files (rawFile, inputFile) should be whitespace-separated tokens, one sentence per line:

the man chases a cat .
the dog chases a man .

Annotated files (toksupFile, typesupFile, evalFile) should be whitespace-separated tokens, one sentence per line, where each token is word|tag:

the|D man|N sees|V the|D dog|N .|.
the|D dog|N runs|V .|.

Universal Tagset Mappings for Malagasy and Kinyarwanda

For those interested in using Universal POS Tags, please use this mapping, created by Long Duong:

Kinyarwanda

NoKinyarwanda TagUniversal TagDescription
1,PUNCTComma character
2.PUNCTDot character
3ADJADJAdjective
4ADVADVAdverb
5CCONJConjunction
6CCCONJConjunction
7DTDETDeterminer
8NNOUNNoun
9PREPADPPreposition
10VVERBVerb
11XXForeign words

Malagasy

NoMalagasy TagUniversal TagDescription
1,PUNCComma character
2:PUNCSemi column character
3.PUNCDot character
4...PUNCEllipsis
5"PUNCQuotation character
6@-@PUNCDash character
7ADJADJAdjective
8ADVADVAdverb
9CCONJConjunction
10CONJCONJConjunction
11DTDETDeterminer
12FOCDETFocus Marker (similar to determiner)
13-LRB-PUNCLeft Round Bracket
14NNOUNNoun
15NEGADVNegation
16PCLPRTParticle
17PNNOUNProper noun
18PREPADPPreposition
19PROPRONPronoun
20-RRB-PUNCRight Round Bracket
21TVERBPassive verb
22VVERBNormal Verb
23XXForeign root