Awesome
german2vec
Overview
This repository contains documentation and code for building a German Language Model using the fastai library and applying it on a variety of NLP tasks such as text classification. The language model is based on 3-layer AWD-LSTM that was first published by Salesforce Research.
The backbone of the model is trained on the German Wikipedia Corpus and uses transfer learning to apply it to on text classification tasks (as described in Universal Language Model Fine-tuning for Text Classification).
Update:
A pre-trained Language Model using the German Wikipedia Corpus is available from this website: https://lernapparat.de/german-lm/. Thanks for sharing, Thomas!
Project structure
data/
-- language model for German language (available from https://lernapparat.de/german-lm/)doc/
-- documentation and implementation notessb-10k_german_sentiment_classification/
-- raw data for SB-10k Corpusscr/
-- notebooks used for various experiments on NLP classification
Notebook | Task |
---|---|
sb-10k-use_pretrained_language_model.ipynb | classifier for SB-10k Corpus (built on pre-trained language model) |
sb-10k_small_wikipedia_corpus.ipynb | classifier for SB-10k Corpus (built on self-trained language model using German Wikipedia) |
sb-10k-data_preprocessing.ipynb | data pre-processing steps for SB-10k: German Sentiment Corpus |
TODO
- fine-tune and evaluate classifier using SB-10k: German Sentiment Corpus
Future research
to be updated
Contact
For more information, please feel free to contact me via e-mail (bachfischer.matthias@googlemail.com)