Home

Awesome

<h1 align="center">I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection</h1> <p align="center"> <a href="https://arxiv.org/abs/2108.01343v3"><img src="https://img.shields.io/badge/arXiv-Paper-<color>"></a> <a href="https://link.springer.com/article/10.1007/s11263-022-01616-6"><img src="https://img.shields.io/badge/publication-Paper-<color>"></a> </p> <p align="center"> <a href="#updates">Updates</a> | <a href="#introduction">Introduction</a> | <a href="#results">Results</a> | <a href="#usage">Usage</a> | <a href="#citation">Citation</a> | <a href="#acknowledgment">Acknowledgment</a> </p > This is the repo for [IJCV'22] "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection". I3CL with ViTAEv2, ResNet50 and ResNet50 w/ RegionCL backbone are included.

Updates

[2022/04/13] Publish links of training datasets.

[2022/04/11] Add SSL training code for this implementation.

[2022/04/09] The training code for ICDAR2019 ArT dataset is uploaded.

[2021/07/05] Ranks the first at ICDAR2019 ArT leaderboard.

Other applications of ViTAE Transformer: Image Classification | Object Detection | Sementic Segmentation | Animal Pose Estimation | Matting | Remote Sensing

Introduction

Existing methods for arbitrary-shaped text detection in natural scenes face two critical issues, i.e., 1) fracture detections at the gaps in a text instance; and 2) inaccurate detections of arbitrary-shaped text instances with diverse background context. To address these issues, we propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields, which is able to collaboratively learn better character and gap feature representations at local and long ranges inside a text instance. To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances and a global context module to exploit the semantic context from the shared background, which are able to collaboratively learn more discriminative text feature representation. In this way, I3CL can effectively exploit the intra- and inter-instance dependencies together in a unified end-to-end trainable framework. Besides, to make full use of the unlabeled data, we design an effective semi-supervised learning method to leverage the pseudo labels via an ensemble strategy. Without bells and whistles, experimental results show that the proposed I3CL sets new state-of-the-art results on three challenging public benchmarks, i.e., an F-measure of 77.5% on ArT, 86.9% on Total-Text, and 86.4% on CTW-1500. Notably, our I3CL with the ResNeSt-101 backbone ranked the 1st place on the ArT leaderboard.

image

Results

Example results from paper.

image

Evaluation results of I3CL with different backbones on ArT. Note that: (1) I3CL with ViTAE only adopts one training stage with LSVT+MLT19+ArT training datasets in this repo. ResNet series adopt three training stages, i.e, pre-train on SynthText, mix-train on ReCTS+RCTW+LSVT+MLT19+ArT and lastly finetune on LSVT+MLT19+ArT. (2) Original implementation of ResNet series is based on Detectron2. The results and model links of ResNet-50 will be updated soon in this implementation.

BackboneModel LinkTraining DataRecallPrecisionF-measure
<p>ViTAEv2-S<br>[this repo]</p><p>OneDrive/<br>百度网盘 (pw:w754)</p>LSVT,MLT19,ArT75.482.878.9
<p>ResNet-50<br>[paper]</p>-<p>SynthText800K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>71.382.776.6
<p>ResNet-50<br>[this repo]</p><p>OneDrive/<br>百度网盘 (pw:acy0)</p><p>SynthText150K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>73.781.277.3
<p>ResNet-50 w/ RegionCL(finetuning)<br>[paper]</p>-<p>SynthText800K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>72.681.977.0
<p>ResNet-50 w/ RegionCL(finetuning)<br>[this repo]</p><p>OneDrive/<br>百度网盘 (pw:k13v)</p><p>SynthText150K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>75.480.677.9
<p>ResNet-50 w/ RegionCL(w/o finetuning)<br>[paper]</p>-<p>SynthText800K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>73.581.677.3
<p>ResNet-50 w/ RegionCL(w/o finetuning)<br>[this repo]</p><p>OneDrive/<br>百度网盘 (pw:7k84)</p><p>SynthText150K<br>ReCTS,RCTW,LSVT,MLT19,ArT</p>75.180.677.8

Usage

Install

Prerequisites:

  1. Create a conda virtual environment and activate it. Note that this implementation is based on mmdetection 2.20.0 version.

  2. Install Pytorch and torchvision following official instructions.

  3. Install mmcv-full and timm. Please refer to mmcv to install the proper version. For example:

    pip install mmcv-full==1.4.3 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
    pip install timm
    
  4. Clone this repository and then install it:

    git clone https://github.com/ViTAE-Transformer/ViTAE-Transformer-Scene-Text-Detection.git
    cd ViTAE-Transformer-Scene-Text-Detection
    pip install -r requirements/build.txt
    pip install -r requirements/runtime.txt
    pip install -v -e .
    

Preparation

Model:

Data

Training

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_vitae_fpn/i3cl_vitae_fpn_ms_train.py --launcher pytorch --work-dir ./out_dir/${your_dir}

stage1:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_pretrain.py --launcher pytorch --work-dir ./out_dir/art_r50_pretrain/

stage2:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_mixtrain.py --launcher pytorch --work-dir ./out_dir/art_r50_mixtrain/

stage3:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_fpn/i3cl_r50_fpn_ms_finetune.py --launcher pytorch --work-dir ./out_dir/art_r50_finetune/

stage1:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_pretrain.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_pretrain/

stage2:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_mixtrain.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_mixtrain/

stage3:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29500 tools/train.py \
configs/i3cl_r50_regioncl_fpn/i3cl_r50_fpn_ms_finetune.py --launcher pytorch --work-dir ./out_dir/art_r50_regioncl_finetune/

Note:

Inference

For example, use our trained I3CL model to get inference results on ICDAR2019 ArT test set with visualization images, txt format records and the json file for testing submission, please run:

python demo/art_demo.py --checkpoint pretrained_model/I3CL/vitae_epoch_12.pth --score-thr 0.45 --json_file art_submission.json

Note:

Citation

This project is for research purpose only.

If you find I3CL useful in your research, please consider citing:

@article{du2022i3cl,
  title={I3CL: Intra-and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection},
  author={Du, Bo and Ye, Jian and Zhang, Jing and Liu, Juhua and Tao, Dacheng},
  journal={International Journal of Computer Vision},
  volume={130},
  number={8},
  pages={1961--1977},
  year={2022},
  publisher={Springer}
}

Acknowledgement

Thanks for mmdetection.