Home

Awesome

Scene Text Recognition Recommendations

<h1 align="center"> <br> <img src="img/head.JPG" > </h1> <h4 align="center">Everything about Scene Text Recognition</h4> <p align="center"> <strong><a href="#sota">SOTA </a></strong> • <strong><a href="./papers.md">Papers </a></strong> • <strong><a href="./datasets.md">Datasets </a></strong> • <strong><a href="#code">Code </a></strong>• <strong><a href="Framework/main.md">Our Framework </a></strong> </p>

Contents


Checkout Our New Work!

Revisiting Scene Text Recognition: A Data Perspective

<h1 align="center"> <br> <img src="img/ours.png" width=800> </h1>

1. Papers

All Papers Can be Find Here

<details open> <summary><strong>up to (2023-11-29)</strong></summary> </details> <details open> <summary><strong>up to (2023-8-11)</strong></summary> </details> <details open> <summary><strong>up to (2023-7-25)</strong></summary> </details> <details open> <summary><strong>up to (2023-7-25)</strong></summary> </details> <details open> <summary><strong>up to (2023-7-20)</strong></summary> </details> <details open> <summary><strong>up to (2023-6-1)</strong></summary> </details> <details open> <summary><strong>up to (2023-5-16)</strong></summary> </details> <details close> <summary><strong>up to (2023-3-16)</strong></summary> </details> <details close> <summary><strong>up to (2022-12-29)</strong></summary> </details> <details close> <summary><strong>up to (2022-11-1)</strong></summary> </details> <details close> <summary><strong>up to (2022-11-1)</strong></summary> </details> <details close> <summary><strong>up to (2022-9-20)</strong></summary> </details> <details close> <summary><strong>up to (2022-8-9)</strong></summary> </details> <details close> <summary><strong>up to (2022-7-24)</strong></summary> </details> <details close> <summary><strong>up to (2022-7-9)</strong></summary> </details> <details close> <summary><strong>up to (2022-5-12)</strong></summary> </details> <h2 id='datasets'>2. Datasets</h2>

All Datasets Can be Find Here

2.1 Synthetic Training Datasets

DatasetDescriptionExamplesBaiduNetdisk link
SynthText9 million synthetic text instance images from a set of 90k common English words. Words are rendered onto nartural images with random transformationsSynthTextScene text datasets(提取码:emco)
MJSynth6 million synthetic text instances. It's a generation of SynthText.MJTextScene text datasets(提取码:emco)

2.2 Benchmarks

DatasetDescriptionExamplesBaiduNetdisk link
IIIT5k-Words(IIIT5K)3000 test images instances. Take from street scenes and from originally-digital imagesIIIT5KScene text datasets(提取码:emco)
Street View Text(SVT)647 test images instances. Some images are severely corrupted by noise, blur, and low resolutionSVTScene text datasets(提取码:emco)
StreetViewText-Perspective(SVT-P)639 test images instances. It is specifically designed to evaluate perspective distorted textrecognition. It is built based on the original SVT dataset by selecting the images at the sameaddress on Google Street View but with different view angles. Therefore, most text instancesare heavily distorted by the non-frontal view angle.SVTPScene text datasets(提取码:emco)
ICDAR 2003(IC03)867 test image instancesIC03Scene text datasets(提取码:mfir)
ICDAR 2013(IC13)1015 test images instancesIC13Scene text datasets(提取码:emco)
ICDAR 2015(IC15)2077 test images instances. As text images were taken by Google Glasses without ensuringthe image quality, most of the text is very small, blurred, and multi-orientedIC15Scene text datasets(提取码:emco)
CUTE80(CUTE)288 It focuses on curved text recognition. Most images in CUTE have acomplex background, perspective distortion, and poor resolutionCUTEScene text datasets(提取码:emco)

2.3 Other Real Datasets

DatasetDescriptionExamplesBaiduNetdisk link
COCO-Text39K Created from the MS COCO dataset. As the MS COCO dataset is not intended to capture text. COCO contains many occluded or low-resolution textsIIIT5KOthers(提取码:DLVC)
RCTW8186 in English. RCTW is created for Reading Chinese Text in the Wild competition. We select those in englishIIIT5KOthers(提取码:DLVC)
Uber-Text92K. Collecetd from Bing Maps Streetside. Many are house number, and some are text on signboardsIIIT5KOthers(提取码:DLVC)
Art29K. Art is created to recognize Arbitrary-shaped Text. Many are perspective or curved texts. It also includes Totaltext and CTW1500, which contain many rotated or curved textsIIIT5KOthers(提取码:DLVC)
LSVT34K in English. LSVT is a Large-scale Streeet View Text dataset, collected from streets in China. We select those in englishIIIT5KOthers(提取码:DLVC)
MLT1946K in English. MLT19 is created to recognize Multi-Lingual Text. It consists of seven languages:Arabic, Latin, Chinese, Japanese, Korean, Bangla, and Hindi. We select those in englishIIIT5KOthers(提取码:DLVC)
ReCTS23K in English. ReCTS is created for the Reading Chinese Text on Signboard competition. It contains many irregular texts arranged in various layouts or written with unique fonts. We select those in englishIIIT5KOthers(提取码:DLVC)
<h2 id='code'>3 Public Code</h2>

3.1 Frameworks

PaddleOCR (百度)


MMOCR (OpenMMLab)


Deep Text Recognition Benchmark (ClovaAI)


DAVAR-Lab-OCR (海康威视)


3.2. Algorithms

CRNN


ASTER


MORANv2


<h2 id='sota'>4. SOTAs</h2>

All the models are evaluated in a lexicon-free manner

<table border="0" cellpadding="0" cellspacing="0" width="840" style="border-collapse: collapse;table-layout:fixed;width:629pt"> <colgroup><col width="95" style="mso-width-source:userset;mso-width-alt:3384;width:71pt"> <col width="64" span="2" style="width:48pt"> <col width="80" style="mso-width-source:userset;mso-width-alt:2844;width:60pt"> <col width="74" style="mso-width-source:userset;mso-width-alt:2616;width:55pt"> <col width="82" style="mso-width-source:userset;mso-width-alt:2929;width:62pt"> <col width="83" style="mso-width-source:userset;mso-width-alt:2958;width:62pt"> <col width="82" style="mso-width-source:userset;mso-width-alt:2901;width:61pt"> <col width="77" style="mso-width-source:userset;mso-width-alt:2730;width:58pt"> <col width="75" style="mso-width-source:userset;mso-width-alt:2673;width:56pt"> <col width="64" style="width:48pt"> </colgroup><tbody><tr height="21" style="height:15.6pt"> <td height="21" width="95" style="height:15.6pt;width:71pt"></td> <td width="64" style="width:48pt"></td> <td colspan="4" class="xl66" width="300" style="width:225pt">Regular Dataset</td> <td colspan="4" class="xl66" width="317" style="width:237pt">Irregular<span style="mso-spacerun:yes">&nbsp; </span>dataset</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt">Model</td> <td>Year</td> <td class="xl65">IIIT</td> <td class="xl65">SVT</td> <td class="xl65">IC13(857)</td> <td class="xl65">IC13(1015)</td> <td class="xl65">IC15(1811)</td> <td class="xl65">IC15(2077)</td> <td class="xl65">SVTP</td> <td class="xl65">CUTE</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://ieeexplore.ieee.org/abstract/document/7801919">CRNN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2015</td> <td class="xl65">78.2</td> <td class="xl65">80.8</td> <td class="xl65">-</td> <td class="xl65">86.7</td> <td class="xl65">-</td> <td class="xl65">-</td> <td class="xl65">-</td> <td class="xl65">-</td> </tr> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://ieeexplore.ieee.org/abstract/document/8395027">ASTER(L2R)</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2015</td> <td class="xl65">92.67</td> <td class="xl65">91.16</td> <td class="xl65">-</td> <td class="xl65">90.74</td> <td class="xl65">76.1</td> <td class="xl65">-</td> <td class="xl65">78.76</td> <td class="xl65">76.39</td> </tr> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content_ICCV_2019/html/Baek_What_Is_Wrong_With_Scene_Text_Recognition_Model_Comparisons_Dataset_ICCV_2019_paper.html">CombBest</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2019</td> <td class="xl65">87.9</td> <td class="xl65">87.5</td> <td class="xl65">93.6</td> <td class="xl65">92.3</td> <td class="xl65">77.6</td> <td class="xl65">71.8</td> <td class="xl65">79.2</td> <td class="xl65">74</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content_CVPR_2019/html/Zhan_ESIR_End-To-End_Scene_Text_Recognition_via_Iterative_Image_Rectification_CVPR_2019_paper.html">ESIR</a></td> <td align="right">2019</td> <td class="xl65">93.3</td> <td class="xl65">90.2</td> <td class="xl65">-</td> <td class="xl65">91.3</td> <td class="xl65">-</td> <td class="xl65">76.9</td> <td class="xl65">79.6</td> <td class="xl65">83.3</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content_CVPR_2020/html/Qiao_SEED_Semantics_Enhanced_Encoder-Decoder_Framework_for_Scene_Text_Recognition_CVPR_2020_paper.html">SE-ASTER</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">93.8</td> <td class="xl65">89.6</td> <td class="xl65">-</td> <td class="xl65">92.8</td> <td class="xl65">80</td> <td class="xl65"></td> <td class="xl65">81.4</td> <td class="xl65">83.6</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://ojs.aaai.org/index.php/AAAI/article/view/6903">DAN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">94.3</td> <td class="xl65">89.2</td> <td class="xl65">-</td> <td class="xl65">93.9</td> <td class="xl65">-</td> <td class="xl65">74.5</td> <td class="xl65">80</td> <td class="xl65">84.4</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://link.springer.com/chapter/10.1007/978-3-030-58529-7_9">RobustScanner</a><span style="display:none"> </span></td> <td align="right">2020</td> <td class="xl65">95.3</td> <td class="xl65">88.1</td> <td class="xl65">-</td> <td class="xl65">94.8</td> <td class="xl65">-</td> <td class="xl65">77.1</td> <td class="xl65">79.5</td> <td class="xl65">90.3</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://link.springer.com/content/pdf/10.1007/978-3-030-58586-0_44.pdf">AutoSTR</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">94.7</td> <td class="xl65">90.9</td> <td class="xl65">-</td> <td class="xl65">94.2</td> <td class="xl65">81.8</td> <td class="xl65">-</td> <td class="xl65">81.7</td> <td class="xl65">-</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://www.sciencedirect.com/science/article/abs/pii/S0925231220311176">Yang et al.</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">94.7</td> <td class="xl65">88.9</td> <td class="xl65">-</td> <td class="xl65">93.2</td> <td class="xl65">79.5</td> <td class="xl65">77.1</td> <td class="xl65">80.9</td> <td class="xl65">85.4</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content_CVPRW_2020/html/w34/Lee_On_Recognizing_Texts_of_Arbitrary_Shapes_With_2D_Self-Attention_CVPRW_2020_paper.html">SATRN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">92.8</td> <td class="xl65">91.3</td> <td class="xl65">-</td> <td class="xl65">94.1</td> <td class="xl65">-</td> <td class="xl65">79</td> <td class="xl65">86.5</td> <td class="xl65">87.8</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content_CVPR_2020/html/Yu_Towards_Accurate_Scene_Text_Recognition_With_Semantic_Reasoning_Networks_CVPR_2020_paper.html">SRN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2020</td> <td class="xl65">94.8</td> <td class="xl65">91.5</td> <td class="xl65">95.5</td> <td class="xl65">-</td> <td class="xl65">82.7</td> <td class="xl65">-</td> <td class="xl65">85.1</td> <td class="xl65">87.8</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://arxiv.org/abs/2005.13117">GA-SPIN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">95.2</td> <td class="xl65">90.9</td> <td class="xl65">-</td> <td class="xl65">94.8</td> <td class="xl65">82.8</td> <td class="xl65">79.5</td> <td class="xl65">83.2</td> <td class="xl65">87.5</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content/CVPR2021/html/Yan_Primitive_Representation_Learning_for_Scene_Text_Recognition_CVPR_2021_paper.html">PREN2D</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">95.6</td> <td class="xl65">94</td> <td class="xl65">96.4</td> <td class="xl65">-</td> <td class="xl65">83</td> <td class="xl65">-</td> <td class="xl65">87.6</td> <td class="xl65">91.7</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content/ICCV2021/html/Bhunia_Joint_Visual_Semantic_Reasoning_Multi-Stage_Decoder_for_Text_Recognition_ICCV_2021_paper.html">Bhunia et al.</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">95.2</td> <td class="xl65">92.2</td> <td class="xl65">-</td> <td class="xl65">95.5</td> <td class="xl65">-</td> <td class="xl65"><strong>84</strong></td> <td class="xl65">85.7</td> <td class="xl65">89.7</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://link.springer.com/article/10.1007/s11263-020-01411-1">Luo et al.</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">95.6</td> <td class="xl65">90.6</td> <td class="xl65">-</td> <td class="xl65"> <strong>96.0</strong> </td> <td class="xl65">83.9</td> <td class="xl65">81.4</td> <td class="xl65">85.1</td> <td class="xl65">91.3</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content/ICCV2021/html/Wang_From_Two_to_One_A_New_Scene_Text_Recognizer_With_ICCV_2021_paper.html">VisionLAN</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">95.8</td> <td class="xl65">91.7</td> <td class="xl65">95.7</td> <td class="xl65">-</td> <td class="xl65">83.7</td> <td class="xl65">-</td> <td class="xl65">86</td> <td class="xl65">88.5</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://openaccess.thecvf.com/content/CVPR2021/html/Fang_Read_Like_Humans_Autonomous_Bidirectional_and_Iterative_Language_Modeling_for_CVPR_2021_paper.html">ABINet</a><span style="mso-spacerun:yes">&nbsp;</span></td> <td align="right">2021</td> <td class="xl65">96.2</td> <td class="xl65">93.5</td> <td class="xl65">97.4</td> <td class="xl65">-</td> <td class="xl65">86.0</td> <td class="xl65">-</td> <td class="xl65">89.3</td> <td class="xl65">89.2</td> </tr> <tr height="18" style="height:13.8pt"> <td height="18" style="height:13.8pt"><a href="https://arxiv.org/abs/2111.15263">MATRN</a></td> <td align="right">2021</td> <td class="xl65"><strong>96.7</strong></td> <td class="xl65"><strong>94.9</strong></td> <td class="xl65"><strong>97.9</strong></td> <td class="xl65"><strong>95.8</strong></td> <td class="xl65"><strong>86.6</strong></td> <td class="xl65">82.9</td> <td class="xl65"><strong>90.5</strong></td> <td class="xl65"><strong>94.1</strong></td> </tr> <!--[if supportMisalignedColumns]--> <tr height="0" style="display:none"> <td width="95" style="width:71pt"></td> <td width="64" style="width:48pt"></td> <td width="64" style="width:48pt"></td> <td width="80" style="width:60pt"></td> <td width="74" style="width:55pt"></td> <td width="82" style="width:62pt"></td> <td width="83" style="width:62pt"></td> <td width="82" style="width:61pt"></td> <td width="77" style="width:58pt"></td> <td width="75" style="width:56pt"></td> <td width="64" style="width:48pt"></td> </tr> <!--[endif]--> </tbody></table>

Baek's Reimplementation Version

img