Home

Awesome

Handwritten Text Recognition with TensorFlow

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. The model takes images of single words or text lines (multiple words) as input and outputs the recognized text. 3/4 of the words from the validation-set are correctly recognized, and the character error rate is around 10%.

htr

Run demo

The input images, and the expected outputs are shown below when the text line model is used.

test

> python main.py
Init with stored values from ../model/snapshot-13
Recognized: "word"
Probability: 0.9806370139122009

test

> python main.py --img_file ../data/line.png
Init with stored values from ../model/snapshot-13
Recognized: "or work on line level"
Probability: 0.6674373149871826

Command line arguments

Integrate word beam search decoding

The word beam search decoder can be used instead of the two decoders shipped with TF. Words are constrained to those contained in a dictionary, but arbitrary non-word character strings (numbers, punctuation marks) can still be recognized. The following illustration shows a sample for which word beam search is able to recognize the correct text, while the other decoders fail.

decoder_comparison

Follow these instructions to integrate word beam search decoding:

  1. Clone repository CTCWordBeamSearch
  2. Compile and install by running pip install . at the root level of the CTCWordBeamSearch repository
  3. Specify the command line option --decoder wordbeamsearch when executing main.py to actually use the decoder

The dictionary is automatically created in training and validation mode by using all words contained in the IAM dataset (i.e. also including words from validation set) and is saved into the file data/corpus.txt. Further, the manually created list of word-characters can be found in the file model/wordCharList.txt. Beam width is set to 50 to conform with the beam width of vanilla beam search decoding.

Train model on IAM dataset

Prepare dataset

Follow these instructions to get the IAM dataset:

Run training

The pretrained word model was trained with this command on a GTX 1050 Ti:

python main.py --mode train --fast --data_dir path/to/iam  --batch_size 500 --early_stopping 15

And the line model with:

python main.py --mode train --fast --data_dir path/to/iam  --batch_size 250 --early_stopping 10

Fast image loading

Loading and decoding the png image files from the disk is the bottleneck even when using only a small GPU. The database LMDB is used to speed up image loading:

The dataset should be located on an SSD drive. Using the --fast option and a GTX 1050 Ti training on single words takes around 3h with a batch size of 500. Training on text lines takes a bit longer.

Information about model

The model is a stripped-down version of the HTR system I implemented for my thesis. What remains is the bare minimum to recognize text with an acceptable accuracy. It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. For more details see this Medium article.

References