Awesome
German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Resources and tools which can be used either off-the-shelf or with minor adjustments and which are currently maintained are primarily chosen for this list. It is deliberately biased in terms of usability and user-friendliness.
Community support is needed to keep this list up-to-date, pull requests and suggestions are welcome! See contributing guidelines.
Table of Contents
- Text corpora
- Generic resources
- Linguistic processing
- Semantic analysis
- Speech NLP
- Machine Translation
- Large Language Models
- Teaching resources and tutorials
- More lists
Text corpora
General-purpose
- Araneum Germanicum
- CEHugeWebCorpus
- COW
- Digitales Wörterbuch der deutschen Sprache (DWDS)
- GC4 Corpus (CommonCrawl)
- IDS Corpora
- Leipzig Corpora Collection
- SdeWaC
Historical
- Anselm (14th-16th centuries)
- Austrian Newspapers (19th C. NewsEye / READ OCR training dataset)
- Deutsches Textarchiv
- Elektronische Texte (Thomas Gloning)
- GerManC (1650-1800)
- German Drama Corpus (GerDraCor)
- German Novels
- German Poetry Corpus (DLK)
- Lesekorpus Altdeutsch (750-1050)
- LiederCorpus
- Referenzkorpus Altdeutsch (750-1050)
- Referenzkorpus Mittelhochdeutsch (1050-1350)
- Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
- Referenzkorpus Frühneuhochdeutsch (1350-1650)
- Thesaurus Indogermanischer Text- und Sprachmaterialien (TITUS)
- Transkriptionen von Fibeln (19. Jahrhundert)
Specialized
- AGB-DE
- arg-microtexts
- auto-hMDS (multi-document summarization)
- DFKI MobIE
- DIRNDL -- (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis
- Dortmunder Chat Korpus
- Feidegger (Fashion Images and Descriptions)
- Foodblog-Korpus
- Fußballlinguistik
- German EUROPARL data w/ NE annotation
- German Job Reference Corpus
- German Political Speeches Corpus
- German Recipes Dataset
- GermaParl (Bundestag)
- German Parliamentary Corpus (GerParCor)
- German Wikipedia Text Corpus
- GRAIN corpus -- (G)erman-(RA)dio-(IN)terviews
- Legal Entity Recognition
- One Million Posts Corpus
- Open Legal Data Corpus (German laws and court decisions)
- Pegida Facebook Comments
- Potsdam Commentary Corpus (PCC)
- Songkorpus
- Survey of Corpora for Germanic Low-Resource Languages and Dialects
- [TeCoPhy: A Text Corpus of German Physics Texts]{https://zenodo.org/records/8316079}
- Ten Thousand German News Articles Dataset
- TTLab StadtWiki Corpus
- GSM-1k-de (translated german subset of the first 1000 items of GSM8K)
Swiss German
- ArchiMob Corpus
- NOAH's Corpus: Part-of-Speech Tagging for Swiss German
- SpinningBytes Swiss German Sentiment Corpus
- Swiss SMS Corpus
Learner and Error Corpora
Word lists
- Analogies in German Particle Verb Meaning Shifts
- Degree of Grammaticalization for German Prepositions
- DWDS lemma list
- DeReWo
- Diachronic Usage Relatedness (DURel)
- DiMLex (lexicon of German discourse markers)
- German Compound Database
- German derivational lexicons
- German nouns from Wiktionary
- german_stopwords
- German Wiktionary Lexicon Graph
- German word list for GNU Aspell
- Metaphoric Change (annotated lexemes)
- Morphological Dictionaries (DEMorphy)
- OpenThesaurus
- Stopwords German (DE)
- VulGer
- wiktextract
- wiktionary-de-parser
Data acquisition
- bundestag
- bundestweets
- DKPro C4Corpus
- german-reddit
- news-crawler
- news-please
- pattern
- scrape-gutenberg-de
- SwigSpot Schwyzertuutsch-Spotting
- trafilatura
Lists of corpora
- CLARIN-D list
- Corpora at the IMS
- CorpusExplorer's list of corpora
- Korpusarchiv (IDS Mannheim)
- Laudatio (Long-term Access and Usage of Deeply Annotated Information)
- Parallel corpora (see below)
- Treebanks (see below)
- ZAS list
Generic resources
Frameworks
- AmbiverseNLU
- CLARIN-D web tools
- CorpusExplorer
- DKPro Core
- DKPro Similarity
- DKPro Text Classification (TC)
- DKPro Word Sense Disambiguation (WSD)
- flair
- FreeLing
- ixa pipes
- Mate Tools, webservice via WebLicht
- NLP-Cube
- nlptasks
- spaCy
- Sparv
- Stanford CoreNLP
- textblob-de
- TextImager
Treebanks
- German Universal Dependency Treebank/UD German GSD
- Hamburg Dependency Treebank
- NEGRA
- TIGER Corpus
- TGermaCorp (literary texts)
- TüBa-D/Z
Deep learning models and transformers
- LAION LeoLM Llama v2 German Foundation Language Model 7B Parameters
- LAION LeoLM Llama v2 German Foundation Language Model 13B Parameters
- dbmdz BERT models
- Deepset German BERT model
- Evaluating German Transformer Language Models with Syntactic Agreement Tests
- German ELMo Model
- german-transformer-training
- GermLM (NER exploration)
- GerPT2
- Sentence Transformers
Annotation
Standards
Linguistic processing
Preprocessing
Tokenization / Sentence boundary detection
- Cutter
- Datok
- deep-eos (sentence boundary detection only)
- FullStop (sentence boundary detection only)
- JTok
- KorAP-Tokenizer
- nnsplit (sentence boundary detection only)
- SoMaJo
- syntok
- waste
- german-abbreviations (resource)
Stemming
Lemmatization
Morphological analysis
- CharSplit
- DEMorphy
- dehyphen
- deep-german (classification of nouns by genders)
- Durm Lemmatizer
- german_compound_splitter
- GermanNumerus
- HypheNN-de
- jwordsplitter
- lang-deu
- Low German morphology and tools
- MarMoT
- MOP Compound Splitter
- Morphy
- morphisto
- nnsplit
- SECOS (unsupervised compound splitter)
- SFST
- SMOR, webservice via WebLicht
- timur
- zmorge
Normalization
Phonology
POS-tagging
- clevertagger
- HanTa
- hunpos
- LemmaTag
- moot
- pattern.de
- RFTagger, webservice via WebLicht
- RNNTagger
- SoMeWeTa
- TnT
- TreeTagger (including models)
Syntactical parsing
- Berkeley Parser
- BitPar, webservice via WebLicht
- CDG
- IMSTrans (dependency parser)
- ParZu
- Stanford Parser
- STEPS Parser
Named Entity Recognition
- AmbiverseNLU KnowNER
- flair
- GermaNER
- GERNERMED
- historic-ner
- LSTM+CRF+FastText with models for (historic) German
- microNER
- Named Entity Recognition (LSTM + CRF + FastText) with models for [historic] German
- ner-corpora
- NER-datasets
- (Faruqui & Pado 2010) Components and evaluation data
- Towards Robust Named Entity Recognition for Historic German
Misc
Text generation
Industry/Applications
- German Decompounder for Apache Lucene / Apache Solr / Elasticsearch
- holmes-extractor
- LanguageTool
- Plenum First Said
Evaluation
Semantic analysis
Datasets
- Complex Word Identification (DE, EN, ES, FR)
- Distributional memories: DM.de TransDM.de
- Distributional thesauri (includes German)
- Downloads page of the Interest Group on German Sentiment Analysis
- Lexical Chains
- Logical metonymy database
- schulteimwalde.de/resources.html
- Semantic Relations in Context
- UKP Darmstadt data list
Word embeddings and senses
- disco (semantic similarity)
- GermaNet
- german2vec
- GermanWordEmbeddings
- German ELMO model
- Open German WordNet
- sensegram
- SpinningBytes word embeddings (tweets)
- UBY Linked Lexical Resource
- WECHSEL (subword embeddings)
Sentiment analysis datasets / polarity clues
- Affective norms: abstractness, arousal, imageability and valence ratings
- German Sentiment Classification Model for Dialog Systems
- GermanPolarityClues
- HeiST – Heidelberg Sentiment Treebank
- (Non-)Literalness Ratings for complex verbs
- Potsdam Twitter Sentiment Corpus (PotTS)
- Sentiment dictionary for German political language
- Sentiment Lexicon (Univ. Zurich)
- SentimentWortschatz
- SpinningBytes Swiss German Sentiment Corpus
Sentiment detection
- 3x8emotions
- EmotiKLUE
- germansentiment: A simple python package for sentiment classification
- LT-ABSA: Aspect-based Sentiment Analysis
- sentiment-analyser
- spacy-sentiws
GermEval
(category to improve)
- Official GermEval tools list
- GermEval 2015 data (Lexical Substitution)
- Germeval Task 2017
- GermEval-2018 data
- germeval-rug
- IWG_hatespeech_public
- jpadillamontani/germeval2018
- uhh-lt/GermEval2017-Baseline
- UKP embeddings for GermEval 2017
Discourse
- Bilingual formality (T/V) corpus (EN/DE)
- Bilingual FrameNet frame embeddings (EN/DE)
- Bilingual parallel frame-semantic annotation (EN/DE)
- Coreferee
- CorZu (coreference resolution)
- Discourse Segmenter
- Frame Identification
- German social media textual entailment dataset
- HotCoref DE (coreference resolution)
- PropS-DE (proposition structures)
- Tense-mood-voice annotation system
Summarization and Simplification
- DEPlain
- Klexikon (Joint Summarization and Simplification)
- Tools and corpora for summarization of German texts
Psycholinguistics
Speech NLP
- Archiv für gesprochenes Deutsch
- BAS ressources
- Bochumer Korpus der gesprochenen Sprache im Ruhrgebiet
- Database for Spoken German (IDS Mannheim)
- deepspeech-german
- (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis
- Hamburger Zentrum für Sprachkorpora
- kaldi-tuda-de
- Open Speech Data Corpus
- Thorsten (Emotional) - Open German Voice Dataset
- Thorsten (Neutral) - Open German Voice Dataset
Machine Translation
(category to improve)
Parallel corpora
Large Language Models
- EM_German
- German Alpaca Dataset
- German Benchmark Datasets
- German Language Models
- GermanRAG
- German Text Embedding Clustering Benchmark
- Swiss German Text Encoders
- Vox Populi, Vox AI
Teaching resources and tutorials
- bubenhofer.com/korpuslinguistik/kurs/
- CorpusExplorer v2.0 – Seminartauglich in einem halben Tag
- deeplearning4nlp-tutorial
- deutsch-nlp (text classification)
- German Text Classification Tutorial Series
- Statistics for linguists (S. Vasishth)
- Stilometrie
- Uni Zürich: Sprachtechnologie in den Digital Humanities – MOOC Youtube & Coursera
More lists
German
- CLARIN VLO (DE+public)
- computerlinguistik.org
- Learn German as a foreign language
- LRE Map
- MetaShare Language Resources
- Peter Kolb's list
- Swiss German Language Processing
General
- GitHub topics corpus-linguistics & nlp
- nlp-datasets
- NLP-progress
- /r/LanguageTechnology/
Comparable lists
- awesome-nlp
- Awesome Community-Curated NLP List
- awesome-chinese-nlp
- awesome-danish
- awesome-hungarian-nlp
- awesome Information Retrieval
- Indonesian NLP
- Norwegian NLP resources
- awesome-nlp-polish
- awesome-spanish-nlp
- NLP-Pandect
- M. Weisser's list of NLP/Computational Linguistics Resources
- NLP tools (Saarland University)
- W. Roberts' Computational Linguistics Links
Larger institutional GitHub groups
- DFKI-NLP
- Language Technology Group, Universität Hamburg
- Saarland University Spoken Language Systems Group
- Ubiquitous Knowledge Processing Lab, TU Darmstadt
- Webis
Contributors
See the list of contributors.