Home

Awesome

MGAD: Multilingual Generation of Analogy Datasets

Submitted to LREC2018

Description

We present a novel, minimally supervised method of generating word embedding evaluation datasets for a large number of languages. Our approach utilizes existing dependency treebanks and parsers in order to create language-specific syntactic analogy datasets that do not rely on translation or human annotation. As part of our work, we offer syntactic analogy datasets for three previously unexplored languages: Arabic, Hindi, and Russian. These can be found in the data/ subdirectory.

Usage

Prior to running extract.py, it is recommended that a feature template for generating synactic analogies be provided in-file. The following is a sample template written for Hindi:

NOUN|Number=Plur|Case=Nom   NOUN|Number=Sing|Case=Nom
VERB|Aspect=Perf|Gender=Masc|Number=Sing|VerbForm=Part  VERB|Case=Nom|VerbForm=Inf

The features expressed here can be found at the Universal Dependencies website.

To run extract.py, enter the following command in terminal, where the corpus is a connllu-formatted UD treebank:

cat corpus.connllu | python extract.py --all > output_file.txt