Home

Awesome

Silver Data for Coreference Resolution in Ukrainian: Translation, Alignment, and Projection

This repository contains the following:

Ukrainian OntoNotes Dataset

The experiments were conducted using OntoNotes 5.0 data. The corpus can be downloaded here; registration needed.

Preprocessing

The provided code expects data in jsonlines format, so some preprocessing is necessary.

  1. Extract OntoNotes 5.0 arhive. In case it's in the repo's root directory:

     tar -xzvf ontonotes-release-5.0_LDC2013T19.tgz
     
    
  2. Switch to Python 2.7 environment (where python would run 2.7 version). This is necessary for CoNLL scripts to run correctly. To do it with conda:

     conda create -y --name py27 python=2.7 && conda activate py27
     
    
  3. Run the CoNLL data preparation scripts:

     sh preprocessing/get_conll_data.sh ontonotes-release-5.0 ontonotes-ua
     
    
  4. Download the CoNLL scorers and Stanford Parser:

     sh preprocessing/get_third_party.sh
     
    
  5. Prepare your environment. To do it with conda:

     conda create -y --name ua-coref-data python=3.7 openjdk perl
     conda activate ua-coref-data
     python -m pip install -r requirements.txt
     
    
  6. Build the corpus in jsonlines format:

     python preprocessing/convert_to_jsonlines.py ontonotes-ua/conll-2012/ --out-dir ontonotes-ua
     
    

Building the silver Ukrainian version

Run the scripts to translate the sentences, align the mentions, and project the annotations from English to Ukrainian:

    python scripts/build_silver_data.py -train -dev -test

Processing the whole corpus may take a while because of the current logic behind MT model usage, so you may exclude some splits if necessary.

The machine translation model can be specified using the --translation_model flag. Note that in our experiments, Helsinki-NLP/opus-mt-en-uk model was used, and alignment is based on the cross-attention of the 0-th head of the 1-st layer. Using a different model may require changing this as well.

Statistics

The dataset contains:

SplitDocumentsSentencesTokensMentionsClusters
train2,80275,1871,158,965161,01035,025
dev3439,603146,21020,1684,533
test3489,479151,54220,5224,513
TOTAL3,49394,2691,456,717201,70044,071

Ukrainian WSC Dataset

wsc-ua contains manual translations of 263 Winograd schemas from the WSC dataset in csv and jsonlines formats.

Format

No equivalent translations were found for 22 original schemas, so they were excluded:

87-88, 217-218, 221-222, 231-232, 233-234, 237-238, 243-244, 245-246, 247-248, 274-275, 276-277

Contributing

Data and code improvements are welcome. Please submit a pull request.

Citation

@inproceedings{kuchmiichuk-2023-silver,
    title = "Silver Data for Coreference Resolution in {U}krainian: Translation, Alignment, and Projection",
    author = "Kuchmiichuk, Pavlo",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.8",
    pages = "62--72",
    abstract = "Low-resource languages continue to present challenges for current NLP methods, and multilingual NLP is gaining attention in the research community. One of the main issues is the lack of sufficient high-quality annotated data for low-resource languages. In this paper, we show how labeled data for high-resource languages such as English can be used in low-resource NLP. We present two silver datasets for coreference resolution in Ukrainian, adapted from existing English data by manual translation and machine translation in combination with automatic alignment and annotation projection. The code is made publicly available.",
}

Contacts

pavlo.kuchmiichuk@rochester.edu