Home

Awesome

Neural Network Models for Joint POS Tagging and Dependency Parsing

<img width="750" alt="jptdpv2" src="https://user-images.githubusercontent.com/2412555/48745055-ef25b500-ecbd-11e8-8f83-7160e42e61f7.png">

Implementations of joint models for POS tagging and dependency parsing, as described in my papers:

  1. Dat Quoc Nguyen and Karin Verspoor. 2018. An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 81-91. [.bib] (jPTDP v2.0)
  2. Dat Quoc Nguyen, Mark Dras and Mark Johnson. 2017. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 134-142. [.bib] (jPTDP v1.0)

This github project currently supports jPTDP v2.0, while v1.0 can be found in the release section. Please cite paper [1] when jPTDP is used to produce published results or incorporated into other software. I would highly appreciate to have your bug reports, comments and suggestions about jPTDP. As a free open-source implementation, jPTDP is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Installation

jPTDP requires the following software packages:

Once you installed the prerequisite packages above, you can clone or download (and then unzip) jPTDP. Next sections show instructions to train a new joint model for POS tagging and dependency parsing, and then to utilize a pre-trained model.

NOTE: jPTDP is also ported to run with Python 3.4+ by Santiago Castro. Also note that pre-trained models I provide in the last section would not work with this ported version (see a discussion). Thus, you may want to retrain jPTDP if using this ported version.

Train a joint model

Suppose that SOURCE_DIR is simply used to denote the source code directory. Similar to files train.conllu and dev.conllu in folder SOURCE_DIR/sample or treebanks in the Universal Dependencies (UD) project, the training and development files are formatted following 10-column data format. For training, jPTDP will only use information from columns 1 (ID), 2 (FORM), 4 (Coarse-grained POS tags---UPOSTAG), 7 (HEAD) and 8 (DEPREL).

To train a joint model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 [--dynet-mem <int>] [--epochs <int>] [--lstmdims <int>] [--lstmlayers <int>] [--hidden <int>] [--wembedding <int>] [--cembedding <int>] [--pembedding <int>] [--prevectors <path-to-pre-trained-word-embedding-file>] [--model <String>] [--params <String>] --outdir <path-to-output-directory> --train <path-to-train-file>  --dev <path-to-dev-file>

where hyper-parameters in [] are optional:

For example:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 --dynet-mem 1000 --epochs 30 --lstmdims 128 --lstmlayers 2 --hidden 100 --wembedding 100 --cembedding 50 --pembedding 100  --model trialmodel --params trialmodel.params --outdir sample/ --train sample/train.conllu --dev sample/dev.conllu

will produce model files trialmodel and trialmodel.params in folder SOURCE_DIR/sample.

If you would like to use the fine-grained language-specific POS tags in the 5th column instead of the coarse-grained POS tags in the 4th column, you should use swapper.py in folder SOURCE_DIR/utils to swap contents in the 4th and 5th columns:

SOURCE_DIR$ python utils/swapper.py <path-to-train-(and-dev)-file>

For example:

SOURCE_DIR$ python utils/swapper.py sample/train.conllu
SOURCE_DIR$ python utils/swapper.py sample/dev.conllu

will generate two new files for training: train.conllu.ux2xu and dev.conllu.ux2xu in folder SOURCE_DIR/sample.

Utilize a pre-trained model

Assume that you are going to utilize a pre-trained model for annotating a corpus whose each line represents a tokenized/word-segmented sentence. You should use converter.py in folder SOURCE_DIR/utils to obtain a 10-column data format of this corpus:

SOURCE_DIR$ python utils/converter.py <file-path>

For example:

SOURCE_DIR$ python utils/converter.py sample/test

will generate in folder SOURCE_DIR/sample a file named test.conllu which can be used later as input to the pre-trained model.

To utilize a pre-trained model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --predict --model <path-to-model-parameters-file> --params <path-to-model-hyper-parameters-file> --test <path-to-10-column-input-file> --outdir <path-to-output-directory> --output <String>

For example:

SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/test.conllu --outdir sample/ --output test.conllu.pred
SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/dev.conllu --outdir sample/ --output dev.conllu.pred

will produce output files test.conllu.pred and dev.conllu.pred in folder SOURCE_DIR/sample.

Pre-trained models

Pre-trained jPTDP v2.0 models, which were trained on English WSJ Penn treebank, GENIA and UD v2.2 treebanks, can be found at HERE. Results on test sets (as detailed in paper [1]) are as follows:

TreebankModel namePOSUASLAS
English WSJ Penn treebankmodel25697.9794.5192.87
English WSJ Penn treebankmodel97.8894.2592.58

model256 and model denote the pre-trained models which use 256- and 128-dimensional LSTM hidden states, respectively, i.e. model256 is more accurate but slower.

TreebankCodeUPOSUASLAS
UD_Afrikaans-AfriBoomsaf_afribooms95.7382.5778.89
UD_Ancient_Greek-PROIELgrc_proiel96.0577.5772.84
UD_Ancient_Greek-Perseusgrc_perseus88.9565.0958.35
UD_Arabic-PADTar_padt96.3386.0880.97
UD_Basque-BDTeu_bdt93.6279.8675.07
UD_Bulgarian-BTBbg_btb98.0791.4787.69
UD_Catalan-AnCoraca_ancora98.4690.7888.40
UD_Chinese-GSDzh_gsd93.2682.5077.51
UD_Croatian-SEThr_set97.4288.7483.62
UD_Czech-CACcs_cac98.8789.8587.13
UD_Czech-FicTreecs_fictree97.9888.9485.64
UD_Czech-PDTcs_pdt98.7489.6487.04
UD_Czech-PUDcs_pud96.7187.6282.28
UD_Danish-DDTda_ddt96.1882.1778.88
UD_Dutch-Alpinonl_alpino95.6286.3482.37
UD_Dutch-LassySmallnl_lassysmall95.2186.4682.14
UD_English-EWTen_ewt95.4887.5584.71
UD_English-GUMen_gum94.1084.8880.45
UD_English-LinESen_lines95.5580.3475.40
UD_English-PUDen_pud95.2587.4984.25
UD_Estonian-EDTet_edt96.8785.4582.13
UD_Finnish-FTBfi_ftb94.5386.1082.45
UD_Finnish-PUDfi_pud96.4487.5484.60
UD_Finnish-TDTfi_tdt96.1286.0782.92
UD_French-GSDfr_gsd97.1189.4586.43
UD_French-Sequoiafr_sequoia97.9289.7187.43
UD_French-Spokenfr_spoken94.2579.8073.45
UD_Galician-CTGgl_ctg97.1285.0981.93
UD_Galician-TreeGalgl_treegal93.6677.7171.63
UD_German-GSDde_gsd94.0781.4576.68
UD_Gothic-PROIELgot_proiel93.4579.8071.85
UD_Greek-GDTel_gdt96.5987.5284.64
UD_Hebrew-HTBhe_htb96.2487.6582.64
UD_Hindi-HDTBhi_hdtb96.9493.2589.83
UD_Hungarian-Szegedhu_szeged92.0776.1869.75
UD_Indonesian-GSDid_gsd93.2984.6477.71
UD_Irish-IDTga_idt89.7475.7265.78
UD_Italian-ISDTit_isdt98.0192.3390.20
UD_Italian-PoSTWITAit_postwita95.4184.2079.11
UD_Japanese-GSDja_gsd97.2794.2192.02
UD_Japanese-Modernja_modern70.5366.8849.51
UD_Korean-GSDko_gsd93.3581.3276.58
UD_Korean-Kaistko_kaist93.5383.5980.74
UD_Latin-ITTBla_ittb98.1282.9979.96
UD_Latin-PROIELla_proiel95.5474.9569.76
UD_Latin-Perseusla_perseus82.3657.2146.28
UD_Latvian-LVTBlv_lvtb93.5381.0676.13
UD_North_Sami-Giellasme_giella87.4865.7958.09
UD_Norwegian-Bokmaalno_bokmaal97.7389.8387.57
UD_Norwegian-Nynorskno_nynorsk97.3389.7387.29
UD_Norwegian-NynorskLIAno_nynorsklia85.2264.1454.31
UD_Old_Church_Slavonic-PROIELcu_proiel93.6980.5973.93
UD_Old_French-SRCMFfro_srcmf95.1286.6581.15
UD_Persian-Serajifa_seraji96.6688.0784.07
UD_Polish-LFGpl_lfg98.2295.2993.10
UD_Polish-SZpl_sz97.0590.9887.66
UD_Portuguese-Bosquept_bosque96.7688.6785.71
UD_Romanian-RRTro_rrt97.4388.7483.54
UD_Russian-SynTagRusru_syntagrus98.5191.0088.91
UD_Russian-Taigaru_taiga85.4965.5256.33
UD_Serbian-SETsr_set97.4089.3285.03
UD_Slovak-SNKsk_snk95.1885.8881.89
UD_Slovenian-SSJsl_ssj97.7988.2686.10
UD_Slovenian-SSTsl_sst89.5066.1458.13
UD_Spanish-AnCoraes_ancora98.5790.3087.98
UD_Swedish-LinESsv_lines95.5183.6078.97
UD_Swedish-PUDsv_pud92.1079.5374.53
UD_Swedish-Talbankensv_talbanken96.5586.5383.01
UD_Turkish-IMSTtr_imst92.9370.5362.55
UD_Ukrainian-IUuk_iu95.2483.4779.38
UD_Urdu-UDTBur_udtb93.3586.7480.44
UD_Uyghur-UDTug_udt87.6376.1463.37
UD_Vietnamese-VTBvi_vtb87.6367.7258.27