Awesome
Treebank format converter
Version 1.1
A Python module for converting bracket-parsed PPCHE-format treebanks to the Universal Dependencies framework. It is heavily based on existing NLTK packages.
The module is specifically configured to convert treebanks in the IcePaHC format, which is based on PPCME.
The converter has been used to create two Icelandic UD treebanks: UD_Icelandic-IcePaHC and UD_Icelandic-Modern, and one Faroese: UD_Faroese-FarPaHC.
Version 1.1 has an 82.87 LAS.
Setup
Install all requirements by running:
pip install -r requirements.txt
Usage
Scripts to run are in the scripts
folder.
In all examples below, the --output
flag is used to write to files in the /CoNLLU/
output folder. Otherwise prints to standard output.
Convert single file or directory of files:
convert.py -N -i path/to/corpus/file.psd --output --post_process
convert.py -N -i path/to/corpus/* --output --post_process
For further usage, input files must be placed in a folder within the corpora
folde:r
Convert single tree in treebank using sentence ID (only prints to standard output):
convert.py -C FOLDER_NAME -id SENTENCE_ID
Convert single file in treebank
convert.py -C FOLDER_NAME -f FILE_NAME --output --post_process
Additionally included is a script to only convert the IcePaHC corpus ( icepahc-v0.9
), with pre- and post-processing:
convert_icepahc.py
Acknowledgements
This converter is part of the UniTree project for IcePaHC, funded by The Strategic Research and Development Programme for Language Technology, grant no. 180020-5301. Thanks are due to Örvar Kárason, whose previous work was used as a basis for the conversion.
This converter was improved as part of the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur (https://almannaromur.is/), is funded by the Icelandic Ministry of Education, Science and Culture.