Awesome
Concraft-pl 2.0
This repository provides Concraft-pl, a morphosyntactic tagger for the Polish language based on conditional random fields [1,2]. The tool is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG).
This is the new, 2.0 version of Concraft-pl. The previous version, now obsolete, can be found at https://github.com/kawu/concraft-pl/tree/maca.
As for now, the tagger doesn't provide any lemmatisation capabilities. As a result, it may output multiple interpretations (all related to the same morphosyntactic tag, but with different lemmas) for some known words, while for the out-of-vocabulary words it just outputs orthographic forms as lemmas.
<!-- See the [homepage][homepage] if you wish to download a pre-trained model for the Polish language. -->Installation
First you will need to download and install the Haskell Tool Stack. Then use the following script:
git clone https://github.com/kawu/concraft-pl.git
cd concraft-pl
stack install
Known installation issues
- Ubuntu: if installation fails with the message that the
tinfo
library is missing, install thelibtinfo-dev
package (sudo apt install libtinfo-dev
) and then runstack install
again in the cloned repository.
Data format
Concraft-pl works with tab-separated values (.tsv
) files, with the individual
paragraphs followed by blank lines. Each non-blank line corresponds to an edge
in the paragraph DAG and contains the following 11 columns:
- ID of the start node
- ID of the end node
- word form
- base form (lemma)
- morphosyntactic tag
- commonness (common word, named entity)
- qualifiers
- probability of the edge
- interpretation-related meta information
- end-of-sentence (eos) marker
- segment-related meta information
For the moment, the tool ignores (i.e. rewrites) the values of commonness, qualifiers, and meta-information (both interpretation- and segment-related), but we plan to exploit them in the future.
An example of a file following the above specification can be found in
example/test.dag
.
Training
The train
command can be used to train the model based on a given .dag
file.
The following example relies on the files available in the example
directory.
concraft-pl train train.dag -c config.dhall --tagsetpath=tagset.cfg -e test.dag -o model.gz
where:
train.dag
is the training file, based on which the model parameters are estimatedtest.dag
is the evaluation file (optional; allows to track tagging quality during training)config.dhall
is the general configuration (e.g., disambiguation tiers)tagset.cfg
is the tagset configurationmodel.gz
is the output model (optional)
Run concraft-pl train --help
to learn more about the program arguments and
possible training options.
Pre-trained models
A model pre-trained on the National Corpus of Polish can be downloaded form the homepage (see the Downloads section). This model is compatible with the current version of Morfeusz SGJP, which should be used for morphosyntactic analysis preceding tagging.
<!-- A model pre-trained on the [National Corpus of Polish][nkjp] can be downloaded from [here][ncp-pre-model]. The corresponding training material (including configuration) is also [available for download][ncp-pre-train]. This model is compatible with the current version of [Morfeusz SGJP][morfeusz] (i.e., the version from September 1st 2018 or newer), which should be also used for morphosyntactic analysis preceding tagging. -->Runtime options
Consider using runtime system options. You can speed up processing
by making use of multiple cores by using the -N
option. The -s
option will
produce the runtime statistics, such as the time spent in the garbage collector.
If the program is spending too much time collecting garbage, you can try to
increase the allocation area size with the -A
option.
For example, to train the model using four threads and 256M allocation area size, run:
concraft-pl train train.dag -c config.dhall --tagsetpath=tagset.cfg -e test.dag -o model.gz +RTS -N4 -A256M -s
<!--
Finally, you may consider pruning the resultant model in order to reduce its size.
Features with values close to 0 (in log-domain) have little effect on the modeled
probability and, therefore, it should be safe to discard them.
concraft-pl prune -t 0.05 input-model.gz pruned-model.gz
-->
Probabilities
During the process of training, you may encounter a warning like this one:
===== Train sentence segmentation model =====
Discarded 49/18484 elements from the training dataset
This means that some of the graphs (paragraphs, sentences) in the training dataset are either ill-formed (e.g. have cycles) or have incorrectly assigned probabilities. You can use the following command to identify such graphs:
concraft-pl check -j tagset.cfg train.dag
The probabilities assigned to the individual interpretations in the DAG should
follow certain rules. Let in(v)
be the sum of the probabilities assigned to
the arcs incoming to v
and out(v)
be the sum of the probabilities assigned
to the arcs outgoing from v
. Let also assume that:
in(s) = 1
for the source nodes
(with no incoming arcs)out(t) = 1
for the target nodet
(with no outgoing arcs)
Then, the following constraint must be satisfied for any node v
in the DAG:
in(v) = out(v)
For instance, the following DAG (which contains four different paths, each with probability 0.25) is structured properly:
0 1 co co:s subst 0.25
0 1 co co:c comp 0.25
0 2 coś coś:s subst 0.25
0 2 coś coś:q part 0.25
1 2 ś być aglt 0.5
2 3 jadł jeść praet 1.0
Tagging
Once you have a Concraft-pl model you can use the following command to tag:
concraft-pl tag model.gz -i input.dag -o output.dag
<!--
With the `-\-marginals` option enabled, Concraft-pl will output marginal probabilities
corresponding to individual tags (determined on the basis of the disambiguation model)
instead of `disamb` markers.
-->
Run concraft-pl tag --help
to learn more about the possible tagging options.
Blacklist
You can provide a list of blacklisted tags using the -b
(--blackfile
)
option. Blacklisted tags are guaranteed not to be selected by the guesser.
The blacklisted tags provided on input (i.e., resulting from morphosyntactic
analysis) can still be selected by the disambiguation module, though.
The list of blacklisted tags should be provided in a separate file, one tag per line.
Marginals and performance considerations
By default, Concraft-pl outputs the marginal probabilities of the individual
interpetations on top of the standard disamb
markers. Calculating marginals,
however, is more computationally intensive than determining those markers.
If you wish to speed up tagging and you don't care about the
disambiguation-related probabilities, you can use the -p guess
option. With
this option, Concraft-pl outputs the marginal probabilities originating from
the guessing model istead.
Server
Concraft-pl provides also a client/server mode. It is handy when, for example, you need to tag a large collection of small files. Loading Concraft-pl model from a disk takes considerable amount of time.
To start the Concraft-pl server on port 3000
, run:
concraft-pl server --port=3000 -i model.gz
To use the server in a multi-threaded environment, you need to specify the -N
RTS option. A set of options which yields good server performance is
presented in the following example:
concraft-pl server --port=3000 -i model.gz +RTS -N -A64M
<!--
# NOTE: adding the options `-qg1 -I0` may be good, but it only showed
# improvements when using smaller allocation area size.
concraft-pl server -\-port=3000 -i model.gz +RTS -N -A4M -qg1 -I0
-->
The -Asize
option specifies the allocation area size of the garbage collector.
You can increase its value (e.g. -A256M
), which may still improve the
performance, but at the cost of a higher memory consumption.
Run concraft-pl server --help
to learn more about possible server-mode options.
Haskell Client
The client mode works just like the tagging mode. The difference is that, instead of supplying the client with a model, you need to specify the server:
concraft-pl client -s "http://localhost" --port=3000 -i input.dag -o output.dag
<!--
**NOTE**: the client has been designed so as to be run on short data files.
Ideally, the `input.dag` file should contain only one paragraph.
-->
NOTE: you can use stdin
and stdout
instead of the -i
and -o
options, respectively.
Run concraft-pl client --help
to learn more about possible client-mode options.
Python Client
A Python client code code is also provided. It allows to communicate with the Concraft-pl's server directly from Python. Clients in other programming languages can be written in a similar manner.
<!-- Tagging analysed data ===================== In some situations you might want to feed Concraft-pl with a previously analysed data. Perhaps your Maca instance is installed on a different machine, or maybe you want to use Concraft-pl with a custom preprocessing pipeline. If you want to use a preprocessing pipeline significantly different from the standard one (Maca), you should first train your own Concraft model. To train the model on analysed data use the `-\-noana` training flag. Use the same `-\-noana` flag when you want to tag analysed data. Input format should be the same as the output format. This option is currently not supported in the client/server mode. *Remember to use the same preprocessing pipeline (segmentation + analysis) for both training and disambiguation. Inconsistencies between training material and input data may severely harm the quality of disambiguation.* -->References
[1] Jakub Waszczuk. Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), pages 2789–2804, Mumbai, India, 2012.
[2] Jakub Waszczuk, Witold Kieraś, and Marcin Woliński. Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields. In Petr Sojka, Aleš Horák, Ivan Kopeček, and Karel Pala, editors, Text, Speech, and Dialogue: 21st International Conference, TSD 2018, Brno, Czech Republic, September 11-14, 2018.
<!-- [ncp-pre-model]: http://mozart.ipipan.waw.pl/~wkieras/DasModel-2019-10-08.gz "NCP model" --> <!-- [ncp-pre-model]: https://user.phil.hhu.de/~waszczuk/concraft/model-04-09-2018.gz "NCP model" [ncp-pre-train]: https://user.phil.hhu.de/~waszczuk/concraft/train.zip "NCP training data" -->