Awesome
transnormer
A lexical normalizer for historical spelling variants using a transformer architecture.
This project provides code for training encoder-decoder transformer models, applying a model and inspecting and evaluating a model's performance.
Project Organization
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train` [not in use yet]
├── README.md
├── requirements.txt
├── requirements-dev.txt
├── pyproject.toml <- makes project pip installable
|
├── data [not in use yet]
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── models <- Trained models
│
├── notebooks <- Jupyter notebooks
│
├── references <- Manuals and other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
└── src <- Source code for use in this project. [TODO]
└── transnormer
├── __init__.py <- Makes src a Python module
│
├── data <- Scripts to download or generate data
│ └── make_dataset.py
│
├── features <- Scripts to turn raw data into features for modeling
│ └── build_features.py
│
├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ ├── predict_model.py
│ └── train_model.py
|
└── tests <- Testing facilities for source code
│
└── visualization <- Scripts to create exploratory and results oriented visualizations
└── visualize.py
Project structure is based on the cookiecutter data science project template.
Installation
Create a conda environment and install dependencies.
# Install conda environment
conda install -y pip
conda create -y --name <environment-name> python=3.9 pip
conda activate <environment-name>
conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# and/or install the transnormer package (-e for editable mode)
pip install -e .
Hints
- Do
export TOKENIZERS_PARALLELISM=false
to get rid of parallelism warning messages (seestdout-with-erros.md
) - If you use the
Trainer()
, doexport CUDA_VISIBLE_DEVICES=1
andgpu = "cuda:0"
in config file. Otherwise it will use both GPUs automatically.
Required resources
- A dataset of historical language documents with (gold-)normalized labels
- Encoder and decoder models (available on the Huggingface Model Hub)
- Tokenizers that belong to the models (also available via Huggingface)
Config file
To run the training script, you also need a toml file called training_config.toml
with training configuration parameters. The file training_config_TEMPLATE.toml
is a template for this.
NOTE: This will probably be changed in later versions, see issues 6 and 7
Usage
Run the training script:
cd src/transnormer/models
python3 model_train.py
Intuition
Historical text normalization is treated as a seq2seq task, like machine translation. We use a transformer encoder-decoder model. The encoder-decoder gets warm-started with pre-trained models and fine-tuned on a dataset for lexical normalization.
- Encoder for historic German
- Decoder for modern German
- Encoder-decoder wired together
- Supervised learning with labeled data
Motivation
- Transformers are state of the art in NLP
- By using pre-trained transformer models we can leverage linguistic knowledge from large quantities of data
- There exists more historical text that is not normalized than (gold) normalized text
- An encoder (LM) can be learned for historical text without a normalization layer
- Intuition: We create a model from an encoder that knows a lot about historical language, and a decoder that knows a lot about modern language and plug them together by training them on gold-normalized data.
Background
[Perhaps this should go somewhere else, e.g. into ./references
]
References
- For a blogpost on warm-starting encoder-decoder, see here
- Corresponding colab notebook
- Paper "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks"