Home

Awesome

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

<img src="figs/overall.png" width="100%">

This is the pytorch implementation of Paper: ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer (ICCV 2023). The paper is available at this link.

News

2024.04.09 We release a new text spotting pipeline Bridge Text Spotting that combines the advantages of end-to-end and two-step text spotting. Code

2023.07.21 Code is available.

Getting Started

Python 3.8 + PyTorch 1.10.0 + CUDA 11.3 + torchvision=0.11.0 + Detectron2 (v0.2.1) + OpenCV for visualization

conda create -n ESTS python=3.8 -y
conda activate ESTS
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
git clone https://github.com/mxin262/ESTextSpotter.git
cd detectron2-0.2.1
python setup.py build develop
pip install opencv-python
cd models/ests/ops
sh make.sh

Data Preparation

Please download TotalText, CTW1500, MLT, ICDAR2013, ICDAR2015, and CurvedSynText150k according to the guide provided by SPTS v2: README.md.

Please download the MLT 2019 in Images / Annotations.

Extract all the datasets and make sure you organize them as follows

- datasets
  | - CTW1500
  |   | - annotations
  |   | - ctwtest_text_image
  |   | - ctwtrain_text_image
  | - totaltext (or icdar2015)
  |   | - test_images
  |   | - train_images
  |   | - test.json
  |   | - train.json
  | - mlt2017 (or syntext1, syntext2)
      | - annotations
      | - images

Model Zoo

DatasetDet-PDet-RDet-F1E2E-NoneE2E-FullWeights
Pretrain90.785.387.973.885.5OneDrive
Total-Text91.888.290.080.987.1OneDrive
CTW150091.388.689.965.083.9OneDrive
DatasetDet-PDet-RDet-F1E2E-SE2E-WE2E-GWeights
ICDAR201595.18891.488.583.178.1OneDrive
DatasetH-meanWeights
VinText73.6OneDrive
DatasetDet-PDet-RDet-H1-NEDWeights
ICDAR 2019 ReCTS94.191.392.778.1OneDrive
DatasetRPHAPArabicLatinChineseJapaneseKoreanBanglaHindiWeights
MLT75.583.3779.2472.5252.0077.3448.2048.4263.5638.2650.83OneDrive

Training

We use 8 GPUs for training and 2 images each GPU by default.

  1. Pretrain
bash scripts/Pretrain.sh /path/to/your/dataset
  1. Fine-tune model on the mixed real dataset
bash scripts/Joint_train.sh /path/to/your/dataset
  1. Fine-tune model
bash scripts/TT_finetune.sh /path/to/your/dataset

Evaluation

0 for Text Detection; 1 for Text Spotting.

bash scripts/test.sh config/ESTS/ESTS_5scale_tt_finetune.py /path/to/your/dataset 1 /path/to/your/checkpoint /path/to/your/test dataset

e.g.:

bash scripts/test.sh config/ESTS/ESTS_5scale_tt_finetune.py ../datasets 1 totaltext_checkpoint.pth totaltext_val

Visualization

Visualize the detection and recognition results

python vis.py

Example Results:

<img src="figs/results.png" width="100%">

Copyright

This repository can only be used for non-commercial research purpose.

For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).

Copyright 2023, Deep Learning and Vision Computing Lab, South China University of Technology.

Acknowlegement

AdelaiDet, DINO, Detectron2, TESTR

Citation

If our paper helps your research, please cite it in your publications:

@InProceedings{Huang_2023_ICCV,
    author    = {Huang, Mingxin and Zhang, Jiaxin and Peng, Dezhi and Lu, Hao and Huang, Can and Liu, Yuliang and Bai, Xiang and Jin, Lianwen},
    title     = {ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {19495-19505}
}