Home

Awesome

Table segmentation

<img src="table_example.jpg" width="40%" height="40%">

The code, provided by Alpha Logos Software Oy, contains several functions that can be used for detecting table lines as well as content elements from tables in digitized document images. The code has been developed specifically to detect table lines and content elements from Finnish ship logbooks from the late 19th - early 20th centuries (see example images in sample_logbook_data folder), and the default parameter values are optimized for that dataset. Most probably achieving good segmentation results with different datasets will require adjustment of the parameter values.

The documentation of the code is provided in a separate documentation.pdf file, and images in the example_images folder illustrate the output and different processing stages of the included functions. More detailed information on many of the functions and parameters can be found from the documentation of the OpenCV library.

Installation

conda create -n table_seg_env python=3.11

conda activate table_seg_env

pip install -r requirements.txt

Running the code

Image segmentation can be performed by running the run_main_tests.py file using the command line. By default, the code expects input images to be located in subfolders of the sample_logbook_data folder, which contains examples of the ship logbooks used for developing the code. The results of the analysis are placed in subfolders of the results folder. When these default folder names are used and the output is chosen to include table line images, progress images, table element images and table element cell position images, the following folder structure (folder and image names used here are examples) is expected before running the code:

├──Table_segmentation
      ├──results 
      ├──sample_logbook_data
      |   ├──image_folder_1
      |   |   ├──img_1_1.jpg...
      |   └──image_folder_2
      |       ├──img_2_1.jpg...
      ├──run_main_tests.py
      ├──main_test_functions.py
      ├──utilities.py
      ├──requirements.txt
      ...

After running the code, the results folder content should have the following subfolders, which contain the result files:

├──Table_segmentation
      ├──results 
      |   ├──image_folder_1
      |   |   ├──arrays
      |   |   ├──images
      |   |   ├──progress_images
      |   |   └──numbers_of_table_elements.npy
      |   └──image_folder_2
      |       ...
      ...

Parameters

There are a variety of parameters that can be provided as input arguments to the code. The arguments are listed in the run_segment.py file, and explanation of the technical parameters can be found in the documentation.pdf file. The parameters relating to the directory paths, output files and processing type are listed below:

The following example shows how to run the code with the default values of the above arguments: python run_main_tests.py

If you want for example to change input folder name to ./input and exclude progress images from the results, type: python run_main_tests.py --INPUT_DIR ./input --CONSTRUCT_PROGRESS_IMAGES