

Zoning Aggregated Hypercolumns for Keyword Spotting

Zoning Aggregated Hypercolumn features (ZAH features) are introduced with this work. Motivated by recent research in machine vision, we use an appropriately pretrained convolutional network as a feature extraction tool. The convolutional network are trained with matconvnet on a large collection of word images. The resulting local cues are subsequently aggregated to form word-level fixed-length descriptors.

The Euclidean distance can then be used to compare and query resulting descriptors of different word images (Query-by-Example keyword spotting).

If you find this work useful, please read and cite the related paper:

  title={Zoning Aggregated Hypercolumns for Keyword Spotting},
  author={Sfikas, Giorgos and Retsinas, Giorgos and Gatos, Basilis},
  booktitle={15th International Conference on Frontiers in Handwriting Recognition (ICFHR)},

ZAH Workflow

The workflow is:

  1. The (normalised) image is split into zones
  2. Hypercolumn features are computed for each of the zones, using a pretrained convolutional neural network
  3. Hypercolumns are aggregated into a single feature vector per zone
  4. Per-zone features are concatenated into a single feature vector, which therefore describes the whole word image

The workflow is summarized in the following figure. A word image is in the input (top), and a vector is returned at the output (bottom):


Before running the code

First you will have to compile some of the code with matlab mex, and optionally enable GPU support:

On the MATLAB prompt, add all repo subfolders to the path, by running the following:

Note that it is important that you execute addpath after having finished compiling the necessary items with MEX.

Running the code

In order to compute the ZAH descriptor of an input image, run

descriptor = extractAggregatedHypercolumns_zoning('img/1/1.jpg');

After the input file argument, the parameters are:

For example, the following command will extract a ZAH descriptor using only the unigram-trained CNN model, use activations of layers 3 and 6, apply a centerprior with precision equal to 3 and resize input to a height of 24 pixels:

descriptor = extractAggregatedHypercolumns_zoning('img/1/1.jpg', 0, [3 6], 3, 24);

Batch extraction

Multiple images can be processed with batch_extract_zoning.m. For example:


All files with extension '.jpg' that are found in the given folder will be processed.

If batchExtract_zoning is run without arguments, three files will be created, containing the result:


The file dimensions.txt contains a single integer value. That is the dimensionality of the extracted per-word descriptors. The file distance.txt contains one descriptor on each line. The file filenames.txt gives the correspondence between lines in distance.txt and filenames.


In the current work we make use of this third-party code/material: