Awesome

Fine-grained classification with textual cues

Implementation based in our paper: "Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features"

https://arxiv.org/pdf/2001.04732.pdf

<p></p>

alt text

Install

Create Conda environment

$ conda env create -f environment.yml

Activate the environment

$ conda activate finegrained

Train from scratch

python3 train.py

(Please refer to the code to decide the args to train the model)

Datasets

Con-Text dataset can be downloaded from: https://staff.fnwi.uva.nl/s.karaoglu/datasetWeb/Dataset.html

Drink-Bottle dataset: https://drive.google.com/file/d/10BZN5_BGg21olZA857SMvF0TPgukmVI4/view?usp=sharing

Textual Features

The results depicted in the paper were obtained by using the Fisher Vector of a set of PHOCs obtained from an image. To extract the PHOCs, the following to repos can be used:

https://github.com/DreadPiratePsyopus/Pytorch-yolo-phoc (Pytorch) https://github.com/lluisgomez/single-shot-str (Tensorflow)

Finally, the Fisher Vector out of the obtained PHOCs are used during training/inference time.

The Fisher Vector implementation was taken from: https://gist.github.com/danoneata/9927923

In the folder 'preproc' there is a script which does the following:

Create a PHOC dictionary.
Perform Scaling, Normalization, PCA of the PHOC dictionary.
Train a GMM based on the PHOC data.
Given a PHOC result path with .json files as PHOC predictions, reads each file and constructs the Fisher Vector to be used to train the model.

Simply edit the path that contains the PHOC predictions and the path to save the Fisher Vectors. This path is the one that the Dataloader uses to load the textual features at training/inference time. Finally, run:

$ python2 phocs_to_FV.py

Precomputed textual features for the Bottles and Context dataset used in the paper can be provided, but if you want to train/test the model with another dataset you will have to generate the Textual features.

Classification Results

alt text

Reference

If you found this code useful, please cite the following paper:

@inproceedings{mafla2020fine, title={Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features}, author={Mafla, Andres and Dey, Sounak and Biten, Ali Furkan and Gomez, Lluis and Karatzas, Dimosthenis}, booktitle={The IEEE Winter Conference on Applications of Computer Vision}, pages={2950--2959}, year={2020} }

License

Apache License 2.0