Awesome
This repository will not be maintained in the future, it will remain as it is. As a replacement, please use the new version of the e-magyar toolchain: https://github.com/dlt-rilmta/emtsv. This new version does not support GATE directly, but has efficient inter-module communication via simple tsv providing a much more convenient command-line support, and also a REST API.
hunlp-GATE
Sources for the Lang_Hungarian GATE plugin containing Hungarian processing resources (wrappers around already existing Hungarian NLP tools) developed by the Department of Language Technology at RIL-MTA.
Developers: Péter Kundráth, Márton Miháltz, Bálint Sass, Mátyás Gerőcs
Contents
The plugin contains the following GATE Processing Resources.
Firstly, the Lang_Hungarian plugin contains the e-magyar toolchain:
- emToken Hungarian Sentence Splitter and Tokenizer (QunToken)
- emMorph+emLem Hungarian Morphological Analyzer and Lemmatizer (HFST)
- emTag Hungarian POS Tagger and Lemmatizer (PurePOS in magyarlanc3.0)
- emDep Hungarian Dependency Parser (magyarlanc3.0)
- emCons Hungarian Constituency Parser (magyarlanc3.0)
- Preverb Identifier Tool
- emChunk Hungarian NP Chunker (HunTag3)
- emNer Hungarian Named Entity Recognizer (HunTag3)
- IOB2Annotation Converter Tool
Some older tools are also integrated:
- HunPos Hungarian POS Tagger (Linux)
- HunMorph Hungarian Morphological Analyzer (Linux, OS X)
- magyarlanc3.0 Hungarian Sentence Splitter and Tokenizer
- magyarlanc2.0 Hungarian Morphological Analyzer [KR code]
- magyarlanc2.0 Hungarian Morphological Analyzer And Guesser [MSD code]
XXX You will also find the following ready made applications in GATE Developer (to access, in the menu click File -> Ready Made Applications -> Hungarian, or right-click Applications in the GATE Resources tree):
- Magyarlánc Morphparse (Sentence Splitter and Tokenizer + Pos Tagger and Lemmatizer)
- Magyarlánc Depparse (Morphparse + Depdendency Parser)
- NP chunking with HunTag3 and Magyarlánc MorphParse
XXX Please see this Wiki page for more information on what tools are expected to be integrated and their statuses.
Installing under GATE Developer
Requirements:
-
64-bit operating system
-
16GB RAM (8GB maybe enough)
-
64-bit Java runtime (JRE or JDK) version 1.8 or later
-
GATE Developer 8.0 or 8.1 or 8.2. Note: do not use GATE 8.4. It is buggy: see #24. Get version 8.2 instead. Other GATE versions are not tested.
-
When launching GATE Developer request for 4GB of heap space.
On Linux or OS X, please use the following command:
<your_gate_installation_directory>/bin/gate.sh -Xmx4g -Xms2g
On windows, please set the _JAVA_OPTIONS
environment variable to -Xmx4g -Xms2g
, restart the computer, and then launch GATE Developer.
Method 1 (for users):
This is the default recommended install method for users. Only GATE Developer and internet access are required.
Follow these steps to install the plugin directly into GATE Developer using the ready-made online GATE plugin repository hosted at corpus.nytud.hu
(Note: the whole plugin complete with model files requires 1GB of space and may take a couple of minutes to download):
- Start GATE Developer.
- In the menu click: File / Manage CREOLE Plugins...
- Click on the "Configuration" tab.
- If you haven't already done so, set your User Plugin Directory (e.g.
/home/username/My_GATE_plugins/
). - Click the "+" sign button to the right of the Plugin Repositories list to add a new repository.
- Enter:
- Name:
RIL-MTA
- URL:
http://corpus.nytud.hu/GATE/gate-update-site.xml
- Click OK.
- Click the "Apply All" button at the bottom.
- Click on the "Available to Install" tab. (If you have already installed the plugin earlier, check the "Available Updates" tab for a newer version.)
- You should now see Lang_Hungarian in the list of plugins available to install. Enable the checkbox left to its name in column "Install".
- Click on the "Apply All" button to install the plugin.
- You should now see Lang_Hungarian in the list of installed plugins on the "Installed Plugins" tab.
- Enable the "Load Now" checkbox for Lang_Hungarian and click "Apply All" to load the plugin. Several new PRs become available right clicking "Processing Resources" on the left hand side panel and selecting "New".
- (skip this step on Windows) Open a terminal and issue the
sh xperm.sh
command inLang_Hungarian
directory under your GATE User Plugin Directory to add necessary execute permissions. - In order to use the HunTag3-based processing resources (emChunk, emNer) install the required python environment. On Linux (Debian or Ubuntu) run
Lang_Hungarian/resources/huntag3/setup_linux.sh
(with superuser privileges). On Windows seeLang_Hungarian/resources/huntag3/setup_windows.cmd
.
Method 2 (for developers):
This method gives more control over the installation process, it uses a clone of this github repository.
- Clone this git repository to your machine. (Optional: first build the plugin (see Building the Lang_Hungarian plugin), or just use the version already included in the repository.)
- Obtain all necessary resources not included in this repository
by running
complete.sh
(on Linux) or obtain these resources one by one:
- to use HFST morphological analyser, see the corresponding README.md about obtaining binaries
- to use Magyarlánc, see the corresponding README.md about obtaining binaries
- to use HunTag3 (emChunk, emNer):
- Run
Lang_Hungarian/resources/huntag3/setup_linux.sh
(on Ubuntu or Debian Linux) to install required dependencies for HunTag3 (superuser privileges required). - See the README.md for obtaining trained models for HunTag3.
- Run
- Copy the whole directory
Lang_Hungarian
into your GATE user plugin directory (see Plugin command-line installation). - Restart GATE Developer. You should now see Lang_Hungarian in the list of installed plugins. If it's not there, check if your user plugin directory is set (see steps 2-4. in Method 1 above).
Files
Lang_Hungarian
: directory tree for the Lang_Hungarian GATE pluginsrc
: Java sources of the included Processing Resources. See Javadocs for details.resources
: non-Java binaries, sources and resources files for the included toolshungarian.jar
: plugin Java binaries in a jar filebuild.xml
: use this to build the jar from sources using Apache Antcreole.xml
: this tells GATE how to use hungarian.jar as a CREOLE plugin.classpath
,.project
: use these to import project into Eclipse Java IDEMakefile
: use to rebuild, install etc. the plugin from command line
Building the Lang_Hungarian plugin (for developers)
To build the GATE plugin from the Java sources
(and add the neccessary metadata) run make build
.
A working GATE installation is necessary.
The GATE installation directory should be given to make
as GATE_HOME
:
make build GATE_HOME=/your/gate/installation/directory
This will create hungarian.jar
in the directory Lang_Hungarian
.
(A precompiled hungarian.jar
is also accessible directly from the repository.)
Plugin command-line installation (for developers)
If you have rebuilt the plugin, it is also possible to install it to your GATE user plugin directory with the following command:
make local_install GATE_USER_PLUGINS_DIR=/your/gate/user/plugin/directory
This will copy the whole directory tree under Lang_Hungarian/
from this repository to your GATE user plugin directory. Alternatively, you can also make a symbolic link using the following command:
make link_devdir GATE_USER_PLUGINS_DIR=/your/gate/user/plugin/directory
Updating the GATE plugin repository (only for maintainers)
To update the GATE plugin repository hosted at http://corpus.nytud.hu/GATE
,
first be sure that you have a fully functional plugin (see Method 2),
and then run make upload
specifying your user name on corpus.nytud.hu
:
make upload CORPUSUSER=yourusername
This will upload your local hungarian.jar
, creole.xml
and resources
directory to the update server.
This enables users to use Method 1 for installation.
Using or embedding the Lang_Hungarian plugin as a client-server system (for power users)
The Lang_Hungarian GATE Processing Resources can be run not just from the GATE GUI (called GATE Developer) but from Linux command line using GATE Embedded technology.
The recommended method is to use the so called gate-server which is an optimized solution for running GATE Processing Resources.
Preparation
- A working GATE installation and a clone of this github repository is needed.
- Obtain all necessary resources not included in this repository (see step 2. in Method 2 above).
Using the Lang_Hungarian plugin from the command line (for power users)
The secondary option to use the Lang_Hungarian GATE Processing Resources from Linux command line is the simple method described here.
Preparation
- A working GATE installation and a clone of this github repository is needed.
- Obtain all necessary resources not included in this repository (see step 2. in Method 2 above).
What is it?
This functionality which is implemented in Pipeline.java
means
that any combination of PRs in the Lang_Hungarian plugin can be run
with arbitrary parameter settings.
How to use?
Just type:
make GATE_HOME=/your/gate/installation/directory pipeline
By default texts/peldak.txt
is used as input file,
but it can be changed using the PIPELINE_INPUT
parameter
to e.g. the XML version of the default input file:
make GATE_HOME=/your/gate/installation/directory PIPELINE_INPUT=texts/peldak.xml pipeline
Configuration
The PRs to be run should be specified in a config file. Lines of this config file should contain either only the name of a PR:
hu.nytud.gate.parsers.MagyarlancDependencyParser
... or the name of PR together with some parameters for this PR given as
parameterName parameterValue
in the following format:
hu.nytud.gate.parsers.MagyarlancDependencyParser addPosTags true addMorphFeatures true
The default config file is
Lang_Hungarian/resources/pipeline/pipeline.config
which runs the full Lang_Hungarian
plugin
and can be overridden using the CONFIG
parameter:
make GATE_HOME=/your/gate/installation/directory CONFIG=/path/to/config/file pipeline
There are some ready-made config files in the
Lang_Hungarian/resources/pipeline
directory
for some usage scenarios.
Others
For converting GATE-XML coming from e-magyar to TSV use emconv.py
.