Awesome
<div align=center>Revisiting Scene Text Recognition: A Data Perspective
</div> <div align=center> <img src='https://github.com/open-mmlab/mmocr/assets/65173622/4544c4ff-0f30-46b2-ae04-7bd795694df4' width=600 > </div> <div align=center> <p >Union14M is a large scene text recognition (STR) dataset collected from 17 publicly available datasets, which contains 4M of labeled data (Union14M-L) and 10M of unlabeled data (Union14M-U), intended to provide a more profound analysis for the STR community</p> <div align=center> </div> </div> <p align="center"> <strong><a href="#1-introduction">Introduction </a></strong> • <strong><a href="#34-download">Download </a></strong> • <strong><a href="#5-maerec">MAERec</a></strong> </p>What's New
- 2023/7/20: Both MAERec and Union14M will be migrated to official MMOCR repo soon.
- 2023/7/19: We post a Zhihu blog for this work ICCV2023 | 场景文字识别的新玩法:从数据角度重新审视场景文字
- 2023/7/15: Our work is accepted by ICCV2023🥳
1. Introduction
- Scene Text Recognition (STR) is a fundamental task in computer vision, which aims to recognize the text in natural images. STR has been developed rapidly in recent years, and recent state-of-the-arts have shown a trend of accuracy saturation on six commonly used benchmarks (IC13, IC15, SVT, IIIT5K, SVTP, CUTE80). This is a promising result, but it also raises a question: Are we done with STR? Or it's just the lack of challenges in current benchmarks that cover the drawbacks of existing methods in read-world scenarios.
- To explore the challenges that STR models still face, we consolidate a large-scale STR dataset for analysis and identified seven open challenges. Furthermore, we propose a challenge-driven benchmark to facilitate the future development of STR. Additionally, we reveal that the utilization of massive unlabeled data through self-supervised pre-training can remarkably enhance the performance of the STR model in real-world scenarios, suggesting a practical solution for STR from a data perspective. We hope this work can spark future research beyond the realm of existing data paradigms.
2. Contents
- Rethinking Scene Text Recognition: A Data Perspective
3. Union14M Dataset
3.1. Union14M-L
- Union14M-L contains 4M images collected from 14 public available datasets. See Source Datasets for the details of the 14 datasets. We adopt serval strategies to refine the naive concatation of the 14 datasaets, including:
- Cropping: We use minimal axis-aligned bounding box to crop the images.
- De-duplicate: Some datasets contains duplicate images, we remove them.
- We also categorize the images in Union14M-L into five difficulty levels using an error voting method.
3.2. Union14M-U
- The optimal solution to improve the performance of STR in real-world scenarios is to utilize more data for training. However, labeling text images is both costly and time-intensive, given that it involves annotating sequences and needs specialized language expertise. Therefore, it would be desirable to investigate the potential of utilizing unlabeled data via self-supervised learning for STR. To this end we collect 10M unlabeled images from 3 large datasets with an IoU Voting method
3.3. Union14M-Benchmark
- We raise seven open challenges for STR in real-world scenarios, and propose a challenge-driven benchmark to facilitate the future development.
3.4. Download
Datasets | One Drive | Baidu Yun |
---|---|---|
Union14M-L & Union14M-Benchmark (12GB) | One Drive | Baidu Yun |
Union14M-U (36.63GB) | One Drive | Baidu Yun |
6 Common Benchmarks (17.6MB) | One Drive | Baidu Yun |
-
The Structure of Union14M will be organized as follows:
<details close> <summary><strong>Structure of Union14M-L & Union14M-Benchmark</strong></summary>
</details> <details close> <summary><strong>Structure of Union14M-U</strong></summary>|--Union14M-L |--full_images |--art_curve # Images collected from the 14 datasets |--art_scene |--COCOTextV2 |--... |--train_annos |--mmocr-0.x # annotation in mmocr0.x format |--train_challenging.jsonl # challenging subset |--train_easy.jsonl # easy subset |--train_hard.jsonl # hard subset |--train_medium.jsonl # medium subset |--train_normal.jsonl # normal subset |--val_annos.jsonl # validation subset |--mmocr1.0.x # annotation in mmocr1.0 format |--... |--Union14M-Benchmarks |--artistic |--imgs |--annotation.json # annotation in mmocr1.0 format |--annotation.jsonl # annotation in mmocr0.x format |--...
We store images in LMDB format, and the structure of Union14M-U will be organized as belows.
</details>|--Union14M-U |--book32_lmdb |--cc_lmdb |--openvino_lmdb
4. STR Models trained on Union14M-L
- We train serval STR models on Union14M-L using MMOCR-1.0
4.1. Checkpoints
-
Evaluated on both common benchmarks and Union14M-Benchmark. Accuracy (WAICS) in $\color{grey}{grey}$ are original implementation (Trained on synthtic datasest), and accuracay in $\color{green}{green}$ are trained on Union14M-L. All the re-trained models are trained to predict upper & lower text, symbols and space.
Models Checkpoint IIIT5K SVT IC13-1015 IC15-2077 SVTP CUTE80 Avg. ASTER GoogleDrive / BaiduYun / OneDrive $\color{grey}{93.57}$ \ $\color{green}{94.37}$ $\color{grey}{89.49}$ \ $\color{green}{89.03}$ $\color{grey}{92.81}$ \ $\color{green}{93.60}$ $\color{grey}{76.65}$ \ $\color{green}{78.57}$ $\color{grey}{80.62}$ \ $\color{green}{80.93}$ $\color{grey}{85.07}$ \ $\color{green}{90.97}$ $\color{grey}{86.37}$ \ $\color{green}{88.07}$ ABINet GoogleDrive / BaiduYun / OneDrive $\color{grey}{95.23}$ \ $\color{green}{97.30}$ $\color{grey}{90.57}$ \ $\color{green}{96.45}$ $\color{grey}{93.69}$ \ $\color{green}{95.52}$ $\color{grey}{78.86}$ \ $\color{green}{85.36}$ $\color{grey}{84.03}$ \ $\color{green}{89.77}$ $\color{grey}{84.37}$ \ $\color{green}{94.79}$ $\color{grey}{87.79}$ \ $\color{green}{93.20}$ NRTR Google Drive / BaiduYun / OneDrive $\color{grey}{91.50}$ \ $\color{green}{96.73}$ $\color{grey}{88.25}$ \ $\color{green}{93.20}$ $\color{grey}{93.69}$ \ $\color{green}{95.57}$ $\color{grey}{72.32}$ \ $\color{green}{80.74}$ $\color{grey}{77.83}$ \ $\color{green}{83.57}$ $\color{grey}{75.00}$ \ $\color{green}{92.01}$ $\color{grey}{83.09}$ \ $\color{green}{90.30}$ SATRN Google Drive / BaiduYun / OneDrive $\color{grey}{96.00}$ \ $\color{green}{97.27}$ $\color{grey}{91.96}$ \ $\color{green}{95.36}$ $\color{grey}{96.06}$ \ $\color{green}{96.85}$ $\color{grey}{80.31}$ \ $\color{green}{87.14}$ $\color{grey}{88.37}$ \ $\color{green}{90.39}$ $\color{grey}{89.93}$ \ $\color{green}{96.18}$ $\color{grey}{90.43}$ \ $\color{green}{93.89}$ SAR Google Drive / BaiduYun / OneDrive $\color{grey}{95.33}$ \ $\color{green}{97.07}$ $\color{grey}{88.41}$ \ $\color{green}{93.66}$ $\color{grey}{93.69}$ \ $\color{green}{95.76}$ $\color{grey}{76.02}$ \ $\color{green}{82.19}$ $\color{grey}{83.26}$ \ $\color{green}{86.98}$ $\color{grey}{90.28}$ \ $\color{green}{92.01}$ $\color{grey}{87.83}$ \ $\color{green}{91.27}$
5. MAERec
-
MAERec is a scene text recognition model composed of a ViT backbone and a Transformer decoder in auto-regressive style. It shows an outstanding performance in scene text recognition, especially when pre-trained on the Union14M-U through MAE.
<div align=center> <img src='https://github.com/open-mmlab/mmocr/assets/65173622/f0ed4487-8064-452e-8657-b12e6f90792d' width=400 > </div> -
Results of MAERec on six common benchmarks and Union14M-Benchmarks
<div align=center> <img src='https://github.com/open-mmlab/mmocr/assets/65173622/e697ebe0-22e5-4998-ac0d-38e543c7b400' width=800 > </div> -
Predictions of MAERec on some challenging examples
<div align=center> <img src='https://github.com/open-mmlab/mmocr/assets/65173622/465f45b4-5f4e-4b08-962d-67f094f674e2' width=800 > </div>
5.1. Pre-training
-
ViT pretrained on Union14M-U.
Variants Input Size Patch Size Embedding Depth Heads Parameters Download ViT-S 32x128 4x4 384 12 6 21M GoogleDrive / BaiduYun / OneDrive ViT-B 32x128 4x4 768 12 12 85M GoogleDrive / BaiduYun / OneDrive -
If you want to pre-train the ViT backbone on your own dataset, check pre-training
5.2. Fine-tuning
-
MAERec finetuned on Union14M-L
Variants Acc on Common Benchmarks Acc on Union14M-Benchmarks Download MAERec-S 95.1 78.6 GoogleDrive / BaiduYun / OneDrive MAERec-B 96.2 85.2 GoogleDrive / BaiduYun / OneDrive -
If you want to fine-tune MAERec on your own dataset, check fine-tuning
5.3. Evaluation
- If you want to evaluate MAERec on benchmarks, check evaluation
5.4. Inferencing
- If you want to inferencing MAERec on your raw pictures, check inferencing
5.5. Demo
- We also provide a Gradio APP for MAERec, which can be used to inferencing on your own pictures. You can run it locally or play with it on 🤗HuggingFace Spaces.
- To run it locally, you can run the following command:
-
- Install gradio and download the pretrained weights
pip install gradio wget https://download.openmmlab.com/mmocr/textdet/dbnetpp/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015_20221101_124139-4ecb39ac.pth -O dbnetpp.pth wget https://github.com/Mountchicken/Union14M/releases/download/Checkpoint/maerec_b_union14m.pth -O maerec_b.pth
-
- Run the gradio app
python tools/gradio_app.py \ --rec_config mmocr-dev-1.x/configs/textrecog/maerec/maerec_b_union14m.py \ --rec_weight ${PATH_TO_MAEREC_B} \ --det_config mmocr-dev-1.x/configs/textdet/dbnetpp/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015.py \ --det_weight ${PATH_TO_DBNETPP} \
-
6. License
- The repository is released under the MIT license.
7. Acknowledgment
- We sincerely thank all the constructors of the 17 datasets used in Union14M, and also the developers of MMOCR.
8. Citation
@inproceedings{jiang2023revisiting,
title={Revisiting Scene Text Recognition: A Data Perspective},
author={Qing Jiang and Jiapeng Wang and Dezhi Peng and Chongyu Liu and Lianwen Jin}
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
year={2023},
}