Awesome
NLPdeLUX
This repo collects and documents NLP tools and resources for Luxembourgish. If you know of any that should be featured here, let us know.
NLP tools
LuxemBERT
Description: First Luxembourgish BERT model trained from scratch with optimizations for several downstream tasks such text classification, NEW or intent classification.
Link: huggingface.co/lothritz/LuxemBERT
spaCy
Description: Starting with version 2.2.2, SpaCy has language support for Luxembourgish. This includes tokenization and POS tagging.
Link: github.com/explosion/spaCy
spellux
Description: Automatic text normalization tool for Luxembourgish (spelling correction, lemmatization). Currently in development for training and evaluation.
Link: github.com/questoph/spellux
Syllabifier-for-Luxembourgish
Description: Implementation of the Penn Phonetics Toolkit for Luxembourgish, developed by Peter Gilles. Allows the phonetic syllabification of transcribed words.
Link: github.com/PeterGilles/Syllabifier-for-Luxembourgish
Automatic voice recognition
wav2vec2-large-xls-r-LUXEMBOURGISH2
Description: This is a first experimental build of an automatic voice recogniton system by Peter Gilles trained on a custom data set (~8 hours of Luxembourgish audio+transcript data).
Link: https://huggingface.co/pgilles/wav2vec2-large-xls-r-LUXEMBOURGISH2
OCR
tesseract
Description: Starting with version 4.0, tesseract has language support for Luxembourgish for Opctical Character Recognition.
Link: github.com/tesseract-ocr/tesseract
Resources
Luxembourgish language resources
Description: Phonetic transcriptions of the lemma lists from spellchecker.lu and lod.lu
Link: github.com/PeterGilles/Luxembourgish-language-resources
Luxembourgish dictionaries
Description: HunSpell dictionary and MyThes thesaurus for the Luxembourgish language based on spellchecker.lu
Link: github.com/spellchecker-lu/dictionary-lb-lu
Luxembourgish word embedding
Description: This dataset is a word embedding model trained on Luxembourgish user comments from the media platform RTL.lu. It contains data from roughly 544k Luxembourgish texts published between December 2008 and December 2018.
Link: https://zenodo.org/record/3978066
Universal dependencies
Description: Repository for Luxembourgish as part of the Universal Dependecies project with POS annotated data.
Link: github.com/UniversalDependencies/UD_Luxembourgish-LuxBank