Awesome

Hareef

Hareef is an implementation of state-of-the-art models for diacritics restoration for Arabic language.

Features

Training using pytorch-lightning
Standardized calculation of diacritization evaluation metrics
Export trained models to onnx
Easy to use scripts for preprocessing, cleaning, tokenizing, and post-processing text and outputs
Support for extracting sentences from any diacritized corpus

Currently implemented models

Implementation of the following models is considered complete:

Sarf: our own model that uses deep GRU network and transformer encoder layers
CBHG model from the paper Effective Deep Learning Models for Automatic Diacritization of Arabic Text

Planned models

The following models will be implemented in the near future:

2SDiac from the paper Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text
D2/D3 models from the paper Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization

Usage

Here's how to train the CBHG model. The process is very similar for the other models.

Review model config

Every command requires passing a --config argument. The config contains model hyper parameters and data paths.

For CBHG model this is the file config/cbhg/config.json.

Please review the keys and change them based on your environment and needs. For instance, if you have abundant vram, you can increase batch_size or max_len, both of which may improve the model's predictions.

Install packages

Make sure you have Python 3.10 or later.

Then clone this repo:

git clone https://github.com/mush42/hareef

After this cd to the repo, create a virtualenv, and install required packages:

cd ./hareef
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip setuptools
python3 -m pip install -r requirements.txt

Prepareing the dataset

Training the models included in this repo requires a large corpus of fully diacritized Arabic text. You can download such a corpus from this drive link and unzip it to a location of your choice.

After downloading and unzipping the corpus, run the following command from the root of the repo:

python3 -m hareef.cbhg.process_corpus --config ./config/cbhg/config.json --validate [/path/to/extracted/arabic-diacritization-corpus.txt]

This will create train.txt, val.txt, and test.txt in the ./data/cbhg/CA_MSA/ directory (or the path you configured in config.json)

Training

Lightning is used for training. Run the following command to start the training loop.

python3 -m hareef.cbhg.train --config ./config/cbhg/config.json

By default the model will train for 100 epochs. Early stop criteria will stop training earlier if the loss metric does not improve for 5 consecutive epochs.

Evaluation

To calculate WER/DER metrics with and without case-endings, use the following command:

python3 -m hareef.cbhg.error_rates --config ./config/cbhg/config.json

Testing

To test the model using the test data split, use the following command:

python3 -m hareef.cbhg.train --test --config ./config/cbhg/config.json

Inference

Use the following command to diacritize some passage of Arabic text using the last checkpoint:

python -m hareef.cbhg.infer --config ./config/cbhg/config.json --text "الجو جميل، والهواء عليل."

If you exported the model to ONNX, you can use the ONNX model instead of torch checkpoint by passing the --onnx argument to the script.

Exporting to ONNX

To export the last checkpoint to ONNX, use the following command:

python3 -m hareef.cbhg.export_onnx --config ./config/cbhg/config.json --output ./model.onnx

License

MIT License