Home

Awesome

transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

This project provides code for training encoder-decoder transformer models, applying a model and inspecting and evaluating a model's performance.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train` [not in use yet]
├── README.md          
├── requirements.txt   
├── requirements-dev.txt   
├── pyproject.toml     <- makes project pip installable 
|
├── data                  [not in use yet]
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained models
│
├── notebooks          <- Jupyter notebooks
│
├── references         <- Manuals and other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
└── src                <- Source code for use in this project. [TODO]
    └── transnormer        
        ├── __init__.py    <- Makes src a Python module
        │
        ├── data           <- Scripts to download or generate data
        │   └── make_dataset.py
        │
        ├── features       <- Scripts to turn raw data into features for modeling
        │   └── build_features.py
        │
        ├── models         <- Scripts to train models and then use trained models to make
        │   │                 predictions
        │   ├── predict_model.py
        │   └── train_model.py
        |
        └── tests  <- Testing facilities for source code
        │
        └── visualization  <- Scripts to create exploratory and results oriented visualizations
            └── visualize.py

Project structure is based on the cookiecutter data science project template.

Installation

Create a conda environment and install dependencies.

# Install conda environment
conda install -y pip
conda create -y --name <environment-name> python=3.9 pip
conda activate <environment-name>

conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html

# Install dependencies 
pip install -r requirements.txt
pip install -r requirements-dev.txt

# and/or install the transnormer package (-e for editable mode)
pip install -e . 

Hints

Required resources

Config file

To run the training script, you also need a toml file called training_config.toml with training configuration parameters. The file training_config_TEMPLATE.toml is a template for this.
NOTE: This will probably be changed in later versions, see issues 6 and 7

Usage

Run the training script:

cd src/transnormer/models
python3 model_train.py

Intuition

Historical text normalization is treated as a seq2seq task, like machine translation. We use a transformer encoder-decoder model. The encoder-decoder gets warm-started with pre-trained models and fine-tuned on a dataset for lexical normalization.

  1. Encoder for historic German
  2. Decoder for modern German
  3. Encoder-decoder wired together
    • Supervised learning with labeled data

Motivation

Background

[Perhaps this should go somewhere else, e.g. into ./references]

References