Awesome
awesome-ukrainian-nlp
Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
News
2024/01 -- UNLP 2024 shared task has been announced
1. Datasets / Corpora
Monolingual
- Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News.
- Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
- UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
- Wikipedia
- OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
- CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
- mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
- Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
- Ukrainian forums — 250k sentences scraped from forums.
- Ukrainain news headlines — 5.2M news headlines.
Parallel
- OPUS
- Tatoeba MT Challenge data sets
- Polish-Ukrainian Parallel Corpus
- Back-translated monolingual Wiki data
- Wiki Edits — 5M sentence edits extracted from the Ukrainian Wikipedia revision history.
See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.
Labeled
- ZNO — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO).
- UA-GEC — grammatical error correction (GEC) and fluency corpus.
- NER-uk — Brown-UK labeled for named entities.
- Yakaboo Book Reviews — book reviews, ratings and descriptions.
- Universal Dependencies — dependency trees corpus.
- ua-news — 150k news article in 5 categories.
- UA-SQuAD — Ukrainian version of Stanford Question Answering Dataset.
- Ukrainian Winograd schema challenge (WSC) Dataset — manually translated.
- Ukrainian OntoNotes Dataset — scripts to build large silver dataset for coreference resolution.
Dictionaries
- ВЕСУМ — POS tag dictionary. Can generate a list of all word forms valid for spelling.
- Tonal dictionary
- Multilingualsentiment, includes Ukrainian - a list of positive/negative words
- obscene-ukr — profanity dictionary
- Word stress dictionary — word stress for 2.7M word forms. See ukrainian-word-stress
- Heteronyms — words that share the same spelling but have different meaning/pronunciation.
- Abbreviations — map abbreviation to expansion
Prompts
- Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.
2. Tools
-
tree_stem — stemmer
-
pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
-
LanguageTool — grammar, style and spell checker
-
Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
-
nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
-
NLP-Cube - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.
3. Pretrained models
Language models
Autoregressive:
- aya-101 — massively multilingual LM, 13B parameters
- pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
- UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
- XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
- Tereveni-AI/GPT-2
- uk4b and haloop inference toolkit - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.
Masked:
- xlm-roberta-base-uk — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left.
- youscan/ukr-roberta-base
Mixed:
Machine translation
- Helsinki-NLP / OPUS-MT models — Ukrainian to/from 25 langaguages.
- M2M-100 — Ukrainian to/from 100 languages.
- Uk-En folktale corpus — small sentence-aligned corpus of fairy tales.
See Helsinki-NLP/ UkrainianLT for more.
Sequence-to-sequence models
Named-entity recognition (NER)
Part-of-speech tagging (POS)
Word embeddings
- fastText
- Official fastText trained on CommonCrawl and Wiki — 157 languages, including Ukrainian.
- Older official fastText trained on Wiki — 294 languages, including Ukrainian.
- fastText_multilingual — 78 languages, aligned to the same vector space.
- fasttext_uk (2023) and cbow — trained on UberText 2.0
- Word2Vec
- GloVe
- LexVec
- BPEmb: Subword Embeddings, includes Ukrainian - easy to use with Flair
- Flair — Ukrainian added in 2022.
Other
- uk-punctcase — punctuation and case restoration model based on XLM-RoBERTa-Uk.
- punctuation_uk_bert — another punctation and case restoration model based on bert-base-multilingual-cased.
- ukrainian-word-stress — adds word stress.
4. Paid
- LORELEI Ukrainian Representative Language Pack - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities
5. Other resources and links
- Helsinki-NLP/ UkrainianLT — another collection of links to Ukrainian language tools.
- egorsmkv / speech-recognition-uk — speech recognition and text-to-speech models and datasets
6. Workshops and conferences
- Ukrainian Natural Language Processing Workshop
- UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian
- UNLP 2024 shared task — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian