Awesome

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (ECCV 2022)

Official PyTorch implementation for Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features (MATRN) in ECCV 2022.

Byeonghu Na, Yoonsik Kim, and Sungrae Park

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Datasets

We use lmdb dataset for training and evaluation dataset. The datasets can be downloaded in clova (for validation and evaluation) and ABINet (for training and evaluation).

Training datasets
Validation datasets
- The union of the training set of ICDAR2013, ICDAR2015, IIIT5K, and Street View Text
Evaluation datasets
- Regular datasets
 - IIIT5K (IIIT)
 - Street View Text (SVT)
 - ICDAR2013: IC13S with 857 images, IC13L with 1015 images
- Irregular dataset
 - ICDAR2015: IC15S with 1811 images, IC15L with 2077 images
 - Street View Text Perspective (SVTP)
 - CUTE80 (CUTE)

Tree structure of data directory

data
├── charset_36.txt
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IC15_2077
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   └── ST
├── validation
├── WikiText-103.csv
└── WikiText-103_eval_d1.csv

Requirements

pip install torch==1.7.1 torchvision==0.8.2 fastai==1.0.60 lmdb pillow opencv-python tensorboardX editdistance

Pretrained Models

Download pretrained model of MATRN from this link. Performances of the pretrained model are:

Model	IIIT	SVT	IC13<sub>S</sub>	IC13<sub>L</sub>	IC15<sub>S</sub>	IC15<sub>L</sub>	SVTP	CUTE
MATRN	96.7	94.9	97.9	95.8	86.6	82.9	90.5	94.1

If you want to train with pretrained visioan and language model, download pretrained model of vision and language model from ABINet.

Training and Evaluation

Training

python main.py --config=configs/train_matrn.yaml

Evaluation

python main.py --config=configs/train_matrn.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision|language] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Acknowledgements

This implementation has been based on ABINet.

Citation

Please cite this work in your publications if it helps your research.

@inproceedings{na2022multi,
 title={Multi-modal text recognition networks: Interactive enhancements between visual and semantic features},
 author={Na, Byeonghu and Kim, Yoonsik and Park, Sungrae},
 booktitle={European Conference on Computer Vision},
 pages={446--463},
 year={2022},
 organization={Springer}
}