Home

Awesome

Warning: This repository might not contain the newest version of the source code. The development is continued at https://github.com/dlt-rilmta/HunTag3


HunTag3 - A sequential tagger for NLP combining the Scikit-learn/LinearRegressionClassifier linear classifier and Hidden Markov Models

Based on training data, HunTag3 can perform any kind of sequential sentence
tagging and has been used for NP chunking and Named Entity Recognition for English and Hungarian.

HungTag3 is the official successor of HunTag project.
(See git tags for past major milestones.)

Highlights

Requirements

Installation

The authors recommend using HunTag3 in emtsv the new version of e-magyar language processing system. This module is called emChunk/emNER.

Standalone installation

Data format

Features

The flexibility of Huntag comes from the fact that it will generate any kind of features from the input data given the appropriate python functions (please refer to features.py and the config files).
Several dozens of features used regularly in NLP tasks are already implemented in the file features.py, however the user is encouraged to add any number of her own.

Once the desired features are implemented, a data set and a configuration file containing the list of feature functions to be used are all HunTag needs (not counting the data) to perform training and tagging.

Config file

The configuration file lists the features that are to be used for a given task. The feature file may start with a command specifying the default radius for features. This is non-mandatory. Example:

default: cutoff: 1 # 1 if not set radius: 5 # -1 if not set
There are three type of features:

For each feature mandatory fields are the following:

See configs folder for examples on the format.

Usage

HunTag may be run in any of the following modes (see startHuntag.sh for overview and huntag_main.py --help for details):

train and train-featurize

Used to train a model or just featurize given a training corpus with a set of feature functions. When run in TRAIN mode, HunTag creates three files, one containing the model and two listing features and labels and the integers they are mapped to when passed to the learner. With the --model option set to NAME, the three files will be stored under NAME.model, NAME.featureNumbers.gz and NAME.labelNumbers.gz respectively.

 cat TRAINING_DATA | python3 huntag_main.py train OPTION
 or
 python3 huntag_main.py train -i TRAINING_DATA OPTIONS  

Mandatory options:

Non-mandatory options:

transmodel-train

Used to train a transition model (from a bigram or trigram language model) using a given field of the training data

 cat TRAINING_DATA | python3 huntag_main.py transmodel-train OPTIONS
 or  
 python3 huntag_main.py transmodel-train -i TRAINING_DATA OPTIONS  

Mandatory options:

Non-mandatory options:

tag or tag-featurize

Used to tag or just featurize the input. Given a maxent model providing the value P(l|w) for all labels l and words (set of feature values) w, and a transition model supplying P(l|l0) for all pairs of labels, HunTag will assign to each sentence the most likely label sequence.

 cat INPUT | python3 huntag_main.py tag OPTIONS
 or  
 python3 huntag_main.py tag -i INPUT OPTIONS

Mandatory options:

Non-mandatory options:

most-informative-features

Generates a feature ranking by counting label probabilities (for each label) and frequency per feature (correlations with labels) and sort them in decreasing order of confidence and frequency. This output is usefull for inspecting features quality.

cat TRAINING_DATA | python3 huntag_main.py most-informative-features OPTIONS > modelName.most_informative_features
or  
python3 huntag_main.py most-informative-features -i TRAINING_DATA  OPTIONS

Mandatory options:

Non-mandatory options:

tag --print-weights N

Usefull for inspecting feature weights (per label) assigned by the MaxEnt learner. (As name suggests, training must happen before tagging.)
Negative weights mean negative correlation, which is also usefull.

 python3 huntag.py tag --print-weights N OPTIONS > modelName.modelWeights
 or  
 python3 huntag.py tag --print-weights N OPTIONS -o modelName.modelWeights  

Mandatory options:

Non-mandatory options:

train-featurize and tag-feturize

This options generate suitable input for CRFsuite from training and tagging data. Model name is required as the features and labels are translated to numbers and back. CRFsuite use its own bigram model.

Usage examples

A 100 token long example can be found in the git repository for clarifying the format to be used.

Basic usage: train-tag

# train
cat input.txt | python3 huntag_main.py train --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml
# transmodel-train
cat input.txt | python3 huntag_main.py transmodel-train --model=modelName  # --trans-model-order [2 or 3, default: 3]
# tag
cat input.txt | python3 huntag_main.py tag --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml ## Featurizing input (eg. for CRFsuite)

Advanced usage (for example with CRFsuite):

# train-featurize
cat input.txt | python3 huntag_main.py train-featurize --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.CRFsuite.train
# tag-featurize
cat input.txt | python3 huntag_main.py tag-featurize --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.CRFsuite.tag

Debuging features:

# most-informative-features
cat input.txt | python3 huntag_main.py most-informative-features --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelName.most_informative_features 
# tag FeatureWeights
cat input.txt | python3 huntag_main.py tag --print-weights 100 --model=modelName --config-file=configs/maxnp.szeged.emmorph.yaml > modelNam.modelWeights

Models

Pretrained models for Hungarian are available in the models directory.

Authors

HunTag3 is a massive overhaul, cleanup and functional extension of the original HunTag idea and codebase. HunTag3 was created by Balázs Indig with contributions from Márton Miháltz.

HunTag was created by Gábor Recski and Dániel Varga. It is a reimplementation and generalization of a Named Entity Recognizer built by Dániel Varga and Eszter Simon.

The patch for Liblinear (to lower memory usage) was created by Attila Zséder. See link for deatils: http://www.csie.ntu.edu.tw/~cjlin/liblinear/faqfiles/python_datastructures.html

License

HunTag3 is made available under the GNU Lesser General Public License v3.0. If you received HunTag3 in a package that also contain the Hungarian training corpora for named-entity recognition or chunking task, then please note that these corpora are derivative works based on the Szeged Treebank, and they are made available under the same restrictions that apply to the original Szeged Treebank

Reference

This tool is also integrated into the e-magyar language processing system. It is called emChunk/emNER.

If you use the tool, please cite the following paper:

István Endrédy and Balázs Indig (2015): HunTag3: a general-purpose, modular sequential tagger -- chunking phrases in English and maximal NPs and NER for Hungarian
In: Zygmunt Vetulani; Joseph Mariani (eds.) 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. (2015.11.27-2015.11.30, Poznań, Poland)
"Poznań: Uniwersytet im. Adama Mickiewicza w Poznaniu" 558 p. ISBN:978-83-932640-8-7 pp. 213-218.

@inproceedings{HunTag3,  
 title       = {{HunTag3:} a general-purpose, modular sequential tagger -- chunking phrases in {English and maximal NPs and NER for Hungarian}}, author      = {Endr\'edy, Istv\'an and Indig, Bal\'azs}, booktitle   = {7th {L}anguage \& {T}echnology {C}onference, {Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC '15)}}, year        = {2015}, month       = {November},  publisher   = {Pozna\'n: {U}niwersytet im. {Adama Mickiewicza w Poznaniu}},  
 isbn        = {978-83-932640-8-7}, pages       = {213-218}, address     = {{P}ozna\'n, {P}oland}}  

If you use some specialized version for Hungarian, please also cite the following paper:

Dóra Csendes, János Csirik, Tibor Gyimóthy and András Kocsor (2005): The Szeged Treebank. In: Text, Speech and Dialogue. Lecture Notes in Computer Science Volume 3658/2005, Springer: Berlin. pp. 123-131.

@inproceedings{Csendes:2005,  
 author={Csendes, D{\'o}ra and Csirik, J{\'a}nos and Gyim{\'o}thy, Tibor and Kocsor, Andr{\'a}s}, title={The {S}zeged {T}reebank}, booktitle={Lecture Notes in Computer Science: Text, Speech and Dialogue}, year={2005}, pages={123-131}, publisher={Springer}}