Home

Awesome

HunTag3 - A sequential tagger for NLP combining the Scikit-learn/LinearRegressionClassifier linear classifier and Hidden Markov Models

Based on training data, HunTag3 can perform any kind of sequential sentence
tagging and has been used for NP chunking and Named Entity Recognition for English and Hungarian.

HungTag3 is the official successor of HunTag project. A previous version of the code is stored at https://github.com/ppke-nlpg/HunTag3 . (See git tags for past major milestones.)

Highlights

Requirements

Installation

The authors recommend using HunTag3 in emtsv the new version of e-magyar language processing system. This module is called emChunk/emNER.

Standalone installation

Download and install the wheel package or use this repository with the following commands:

Data format

Features

The flexibility of Huntag comes from the fact that it will generate any kind of features from the input data given the appropriate python functions (please refer to features.py and the config files).
Several dozens of features used regularly in NLP tasks are already implemented in the file features.py, however the user is encouraged to add any number of her own.

Once the desired features are implemented, a data set and a configuration file containing the list of feature functions to be used are all HunTag needs (not counting the data) to perform training and tagging.

Config file

The configuration file lists the features that are to be used for a given task. The feature file may start with a command specifying the default radius for features. This is non-mandatory. Example:

default: cutoff: 1 # 1 if not set radius: 5 # -1 if not set
There are three type of features:

For each feature mandatory fields are the following:

See configs folder for examples on the format.

Usage

HunTag may be run in any of the following modes (see the Makefile for overview and python3 -m huntag --help for details):

train and train-featurize

Used to train a model or just featurize given a training corpus with a set of feature functions. When run in TRAIN mode, HunTag creates three files, one containing the model and two listing features and labels and the integers they are mapped to when passed to the learner. With the --model option set to NAME, the three files will be stored under NAME.model, NAME.featureNumbers.gz and NAME.labelNumbers.gz respectively.

 cat TRAINING_DATA | python3 -m huntag train OPTION
 or
 python3 -m huntag train -i TRAINING_DATA OPTIONS  

Mandatory options:

Non-mandatory options:

transmodel-train

Used to train a transition model (from a bigram or trigram language model) using a given field of the training data

 cat TRAINING_DATA | python3 -m huntag transmodel-train OPTIONS
 or  
 python3 -m huntag transmodel-train -i TRAINING_DATA OPTIONS  

Mandatory options:

Non-mandatory options:

tag or tag-featurize

Used to tag or just featurize the input. Given a maxent model providing the value P(l|w) for all labels l and words (set of feature values) w, and a transition model supplying P(l|l0) for all pairs of labels, HunTag will assign to each sentence the most likely label sequence.

 cat INPUT | python3 -m huntag tag OPTIONS
 or  
 python3 -m huntag tag -i INPUT OPTIONS

Mandatory options:

Non-mandatory options:

most-informative-features

Generates a feature ranking by counting label probabilities (for each label) and frequency per feature (correlations with labels) and sort them in decreasing order of confidence and frequency. This output is usefull for inspecting features quality.

cat TRAINING_DATA | python3 -m huntag most-informative-features OPTIONS > modelName.most_informative_features
or  
python3 -m huntag most-informative-features -i TRAINING_DATA  OPTIONS

Mandatory options:

Non-mandatory options:

tag --print-weights N

Usefull for inspecting feature weights (per label) assigned by the MaxEnt learner. (As name suggests, training must happen before tagging.)
Negative weights mean negative correlation, which is also usefull.

 python3 -m huntag tag --print-weights N OPTIONS > modelName.modelWeights
 or  
 python3 -m huntag tag --print-weights N OPTIONS -o modelName.modelWeights  

Mandatory options:

Non-mandatory options:

train-featurize and tag-feturize

This options generate suitable input for CRFsuite from training and tagging data. Model name is required as the features and labels are translated to numbers and back. CRFsuite use its own bigram model.

Usage examples

A 100 token long example can be found in the git repository for clarifying the format to be used.

Basic usage: train-tag

# train
cat input.txt | python3 -m huntag train --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml
# transmodel-train
cat input.txt | python3 -m huntag transmodel-train --model=modelName  # --trans-model-order [2 or 3, default: 3]
# tag
cat input.txt | python3 -m huntag tag --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml ## Featurizing input (eg. for CRFsuite)

Advanced usage (for example with CRFsuite):

# train-featurize
cat input.txt | python3 -m huntag train-featurize --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.CRFsuite.train
# tag-featurize
cat input.txt | python3 -m huntag tag-featurize --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.CRFsuite.tag

Debuging features:

# most-informative-features
cat input.txt | python3 -m huntag most-informative-features --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.most_informative_features 
# tag FeatureWeights
cat input.txt | python3 -m huntag tag --print-weights 100 --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelNam.modelWeights

Models

Pretrained models for Hungarian are available in the models directory.

Authors

HunTag3 is a massive overhaul, cleanup and functional extension of the original HunTag idea and codebase. HunTag3 was created by Balázs Indig with contributions from Márton Miháltz.

HunTag was created by Gábor Recski and Dániel Varga. It is a reimplementation and generalization of a Named Entity Recognizer built by Dániel Varga and Eszter Simon.

The patch for Liblinear (to lower memory usage) was created by Attila Zséder. See link for deatils: http://www.csie.ntu.edu.tw/~cjlin/liblinear/faqfiles/python_datastructures.html

License

HunTag3 is made available under the GNU Lesser General Public License v3.0. If you received HunTag3 in a package that also contain the Hungarian training corpora for named-entity recognition or chunking task, then please note that these corpora are derivative works based on the Szeged Treebank, and they are made available under the same restrictions that apply to the original Szeged Treebank

Reference

This tool is also integrated into the e-magyar language processing system. It is called emChunk/emNER.

If you use the tool, please cite the following paper:

István Endrédy and Balázs Indig (2015): HunTag3: a general-purpose, modular sequential tagger -- chunking phrases in English and maximal NPs and NER for Hungarian
In: Zygmunt Vetulani; Joseph Mariani (eds.) 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. (2015.11.27-2015.11.30, Poznań, Poland)
"Poznań: Uniwersytet im. Adama Mickiewicza w Poznaniu" 558 p. ISBN:978-83-932640-8-7 pp. 213-218.

@inproceedings{HunTag3,  
 title       = {{HunTag3:} a general-purpose, modular sequential tagger -- chunking phrases in {English and maximal NPs and NER for Hungarian}}, author      = {Endr\'edy, Istv\'an and Indig, Bal\'azs}, booktitle   = {7th {L}anguage \& {T}echnology {C}onference, {Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC '15)}}, year        = {2015}, month       = {November},  publisher   = {Pozna\'n: {U}niwersytet im. {Adama Mickiewicza w Poznaniu}},  
 isbn        = {978-83-932640-8-7},
 pages       = {213-218},
 address     = {{P}ozna\'n, {P}oland}}  

If you use some specialized version for Hungarian, please also cite the following paper:

Dóra Csendes, János Csirik, Tibor Gyimóthy and András Kocsor (2005): The Szeged Treebank. In: Text, Speech and Dialogue. Lecture Notes in Computer Science Volume 3658/2005, Springer: Berlin. pp. 123-131.

@inproceedings{Csendes:2005,  
 author={Csendes, D{\'o}ra and Csirik, J{\'a}nos and Gyim{\'o}thy, Tibor and Kocsor, Andr{\'a}s},
 title={The {S}zeged {T}reebank}, booktitle={Lecture Notes in Computer Science: Text, Speech and Dialogue},
 year={2005},
 pages={123-131},
 publisher={Springer}}