Home

Awesome

NLPdeLUX

This repo collects and documents NLP tools and resources for Luxembourgish. If you know of any that should be featured here, let us know.

NLP tools

LuxemBERT

Description: First Luxembourgish BERT model trained from scratch with optimizations for several downstream tasks such text classification, NEW or intent classification.

Link: huggingface.co/lothritz/LuxemBERT

spaCy

Description: Starting with version 2.2.2, SpaCy has language support for Luxembourgish. This includes tokenization and POS tagging.

Link: github.com/explosion/spaCy

spellux

Description: Automatic text normalization tool for Luxembourgish (spelling correction, lemmatization). Currently in development for training and evaluation.

Link: github.com/questoph/spellux

Syllabifier-for-Luxembourgish

Description: Implementation of the Penn Phonetics Toolkit for Luxembourgish, developed by Peter Gilles. Allows the phonetic syllabification of transcribed words.

Link: github.com/PeterGilles/Syllabifier-for-Luxembourgish

Automatic voice recognition

wav2vec2-large-xls-r-LUXEMBOURGISH2

Description: This is a first experimental build of an automatic voice recogniton system by Peter Gilles trained on a custom data set (~8 hours of Luxembourgish audio+transcript data).

Link: https://huggingface.co/pgilles/wav2vec2-large-xls-r-LUXEMBOURGISH2

OCR

tesseract

Description: Starting with version 4.0, tesseract has language support for Luxembourgish for Opctical Character Recognition.

Link: github.com/tesseract-ocr/tesseract

Resources

Luxembourgish language resources

Description: Phonetic transcriptions of the lemma lists from spellchecker.lu and lod.lu

Link: github.com/PeterGilles/Luxembourgish-language-resources

Luxembourgish dictionaries

Description: HunSpell dictionary and MyThes thesaurus for the Luxembourgish language based on spellchecker.lu

Link: github.com/spellchecker-lu/dictionary-lb-lu

Luxembourgish word embedding

Description: This dataset is a word embedding model trained on Luxembourgish user comments from the media platform RTL.lu. It contains data from roughly 544k Luxembourgish texts published between December 2008 and December 2018.

Link: https://zenodo.org/record/3978066

Universal dependencies

Description: Repository for Luxembourgish as part of the Universal Dependecies project with POS annotated data.

Link: github.com/UniversalDependencies/UD_Luxembourgish-LuxBank