Awesome
emIOBUtils
A sequential labeling (IOB format) converter, corrector and evaluation package
emIOBUtils is the Python rewrite of CoreNLP's IOBUtils which is written in JAVA. It can take any (possibly ill-formed) IOB span input and convert/correct it according to the specified output style.
The program is useful to check whether the specified input contains valid spans or is ill-formed. Also, it can reduce or refine the possible labels for a specific purpose.
The supported formats are the following: iob[12], ioe[12], bio/iob, io, sbieo/iobes, noprefix.
The sequence evaluation metrics provided follows the naming convention of scikit-learn and contains all metrics from the current state of seqeval with a few new metrics introduced. For more complex evaluation we recommend using PyCM and scikit-learn
On IOB formats/styles
The documentation of the original class presents the idea very smoothly:
A 4-way representation of all entities, like S-PERS, B-PERS, I-PERS, E-PERS for a single word, beginning, internal, and end of an entity (IOBES or SBIEO); always marking the first word of an entity (IOB2 or BIO); only marking specially the beginning of non-first items of an entity sequence with B-PERS (IOB1); the reverse IOE1 and IOE2; IO where everything is I-tagged; and NOPREFIX, where no prefixes are written on category labels. The last two representations are deficient in not allowing adjacent entities of the same class to be represented, but nevertheless convenient. Note that the background label (e.g. O) is never given a prefix. This code is very specific to the particular CoNLL way of labelling classes for IOB-style encoding, but this notation is quite widespread. It will work on any of these styles of input. It will also recognize BILOU/IOBE1 format (B=B, I=I, L=E, O=O, U=S=1).
Requirements
- Python 3 (tested with 3.6)
- Pip to install the additional requirements in requirements.txt
Install on a local machine
- Clone the repository:
git clone https://github.com/dlt-rilmta/emiobutils
sudo pip3 install dist/*.whl
- Use from Python
Usage
It is recommended to use the program as the part of e-magyar language processing framework.
If all input columns are already existing one can use python3 -m emiobutils
with the unified xtsv CLI API.
Mandatory CLI arguments
To use this library as a standalone tool the following CLI arguments must be supplied:
--input-field-name
to specify the name of the column to be processed in the input TSV file--output-field-name
to specify the name of the column to put the input--output-style
to specify the IOB format that the output must comply
Available library functions
Conversion related:
EmIOBUtils
: The converter classEmIOBUtils.convert_format()
: An alternative constructor for one-liner conversionsEmIOBUtils.labels_to_entities()
: An alternative constructor for one-liner entities generator from input label sequence
Evaluation related:
label_format_accuracy_score()
: This score counts the ratio of good vs. misfit labels identified by the EmIOBUtils converterprecision_recall_fscore_support()
: Compute precision, recall, f_beta-score and support for entitiesf_score()
: Compute f_beta-score for entitiesf1_score()
: Same asf_score()
but beta is fixed to 1precision_score()
: Compute precision for entitiesrecall_score()
: Compute recall entitiessupport()
: Compute support for entitieslabel_measures()
: Compute true positive, false positive, false negative, true negative, accuracy and specificity for labelsaccuracy_score()
: Compute accuracy for labelsclassification_error()
: Compute classification error for labelsspecificity()
: Compute specificity for labelsperformance_measure()
: Compute confusion matrix (true positive, false positive, false negative, true negative in dict format)classification_report_vars()
: Compute the classification report metrics (precision, recall, f1-score, support for each label, micro, macro and weighted average) and return them as variablesclassification_report()
: Likeclassification_report_vars()
but returns formatted text reporttest_metrics()
: Simple tests for the above metrics
License
This program is licensed under the GPL 3.0 license.
Acknowledgement
The authors gratefully acknowledge the efforts of CoreNLP developers to develop the algorithm and release their code under a free license.
We dedicate this library to all fellows whoever started to write such converters on their own.