Home

Awesome

Document classification model training

This repository contains code that can be used for training a model to classify input documents into distinct classes. Best results can be achieved with input that contains clearly distinguishable document formats, which differ structurally from each other.

In the National Archives of Finland, the code has been used for training a model to classify documents relating to Finnish inheritance taxation. These consist of a four-page form and an appendix, which can vary in length and format. The model was trained to classify these documents into five classes, with one class for each page of the form and a separate class for the document belonging to the appendix. With this data, the model was able to reach a high level of classification accuracy:

ClassTraining samplesValidation samplesValidation accuracy
Page 1379942299.95%
Page 2379942299.95%
Page 3380142299.88%
Page 4380142299.98%
Appendix450050099.56%

The code can be used for training a model to classify a varying number of document types in the input data. However, the following 'rules of thumb' should be kept in mind:

The code has been built using the Pytorch library, and the model training is done by fine-tuning an existing Densenet neural network model.

The code is split into two files:

Running the code in a virtual environment

These instructions use a conda virtual environment, and as a precondition you should have Miniconda or Anaconda installed on your operating system. More information on the installation is available here.

Create and activate conda environment using the following commands:

conda create -n doc_classification_env python=3.7

conda activate doc_classification_env

Install dependencies listed in the requirements.txt file:

pip install -r requirements.txt

Run the training code

When using the default values for all of the model parameters, the training can be initiated from the command line by typing

python train.py

The different model parameters are explained in more detail below.

Model parameters

Parameters related to training and validation data

By default, the code expects the following folder structure, where training and validation data for each document class/type is placed in a separate folder named with a numeric value (1,2,3...):

├──doc_classification 
      ├──models
      ├──results 
      ├──data
      |   ├──train
      |   |   ├──1
      |   |   ├──2
      |   |   ├──3 ...
      |   └──validation
      |   |   ├──1
      |   |   ├──2
      |   |   ├──3 ...
      ├──train.py
      ├──utils.py
      └──requirements.txt

Parameters:

The parameter values can be set in command line when initiating training:

python train.py --tr_data_folder ./data/train/ --val_data_folder ./data/validation/

The accepted input image file types are .jpg, .png and .tiff. Pdf files should be transformed into one of these images formats before used as an input to the model.

Parameters related to saving the model and the training and validation results

The training performance is measured using training and validation loss, accuracy and F1 score (more information on the F1 score can be found for example here). The average of these values is saved each epoch, and the resulting values are plotted and saved in the folder defined by the user. The confusion matrix containing the classification accuracy scores for each class is also plotted and saved after each epoch.

The trained model is saved by default after each epoch when the validation F1 score improves the previous top score. The model is saved by default to the ./models folder as densenet_date.pth. Date refers to the current date, so that a model trained on 4.7.2023 would be saved as densenet_04072023.pth.

Parameters:

The parameter values can be set in command line when initiating training:

python train.py --results_folder ./results --save_model_path ./models

Parameters related to model training

A Number of parameters are used for defining the conditions for model training.

The code allows the fine-tuning of the base model to be performed in two stages. In the first stage, the parameters of the base model are frozen, so that the training impacts only the parameters of the final classification layer. In the second stage, all parameters of the model are 'unfrozen' for training. In the code, the num_epochs parameter defines the number of epochs used in the first stage of training, and unfreeze_epochs defines the number of epochs used in the second stage of training.

Learning rate defines how much the model weights are tuned after each iteration based on the gradient of the loss function. In the code, the lr parameter defines the learning rate for the second stage of training, while the learning rate used for the classification layer's parameters in the first training stage is automatically set to be 10 times larger than lr.

Number of document types/classes used in the classification task is set using the num_classes parameter. This should correspond with the number of data folders used for the training and validation data.

Batch size (batch_size) sets the number of images that are processed before the model weights are updated. Early stopping (early_stop_threshold) is a method used for reducing overfitting by stopping training after a specific learning metric (loss, accuracy etc.) has not improved during a defined number of epochs.

Random seed (random_seed) parameter is used for setting the seed for initializing random number generation. This makes the training results reproducible when using the same seed, model and data.

The device parameters defines whether cpu or gpu is used for model training. Currently the code does not support multi-gpu training.

Parameters:

The parameter values can be set in command line when initiating training:

python train.py --lr 0.0001 --batch_size 16 --num_epochs 5 --unfreeze_epochs 5 --num_classes 5 --early_stop_threshold 2 --random_seed 8765 --device cpu