Awesome
TESTR: Text Spotting Transformers
This repository is the official implementations for the following paper:
Xiang Zhang, Yongwen Su, Subarna Tripathi, and Zhuowen Tu, CVPR 2022
<img src="figures/arch.svg" />Getting Started
We use the following environment in our experiments. It's recommended to install the dependencies via Anaconda
- CUDA 11.3
- Python 3.8
- PyTorch 1.10.1
- Official Pre-Built Detectron2
Installation
Please refer to the Installation section of AdelaiDet: README.md.
If you have not installed Detectron2, following the official guide: INSTALL.md.
After that, build this repository with
python setup.py build develop
Preparing Datasets
Please download TotalText, CTW1500, MLT, and CurvedSynText150k according to the guide provided by AdelaiDet: README.md.
ICDAR2015 dataset can be download via link.
Extract all the datasets and make sure you organize them as follows
- datasets
| - CTW1500
| | - annotations
| | - ctwtest_text_image
| | - ctwtrain_text_image
| - totaltext (or icdar2015)
| | - test_images
| | - train_images
| | - test.json
| | - train.json
| - mlt2017 (or syntext1, syntext2)
| - annotations
| - images
After that, download polygonal annotations, along with evaluation files and extract them under datasets
folder.
Visualization Demo
You can try to visualize the predictions of the network using the following command:
python demo/demo.py --config-file <PATH_TO_CONFIG_FILE> --input <FOLDER_TO_INTPUT_IMAGES> --output <OUTPUT_FOLDER> --opts MODEL.WEIGHTS <PATH_TO_MODEL_FILE> MODEL.TRANSFORMER.INFERENCE_TH_TEST 0.3
You may want to adjust INFERENCE_TH_TEST
to filter out predictions with lower scores.
Training
You can train from scratch or finetune the model by putting pretrained weights in weights
folder.
Example commands:
python tools/train_net.py --config-file <PATH_TO_CONFIG_FILE> --num-gpus 8
All configuration files can be found in configs/TESTR
, excluding those files named Base-xxxx.yaml
.
TESTR_R_50.yaml
is the config for TESTR-Bezier, while TESTR_R_50_Polygon.yaml
is for TESTR-Polygon.
Evaluation
python tools/train_net.py --config-file <PATH_TO_CONFIG_FILE> --eval-only MODEL.WEIGHTS <PATH_TO_MODEL_FILE>
Pretrained Models
<table> <thead> <tr> <th>Dataset</th> <th>Annotation Type</th> <th>Lexicon</th> <th>Det-P</th> <th>Det-R</th> <th>Det-F</th> <th>E2E-P</th> <th>E2E-R</th> <th>E2E-F</th> <th>Link</th> </tr> </thead> <tbody> <tr> <td rowspan="2"><span style="font-weight:400;font-style:normal;text-decoration:none">Pretrain</span></td> <td>Bezier</td> <td>None</td> <td>88.87</td> <td>76.47</td> <td>82.20</td> <td>63.58</td> <td>56.92</td> <td>60.06</td> <td><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/EU5VLFSLcfJFm_gFYDf4HX4BloAAlq1nshsJaPoZSJDxWw?e=WwzQFv" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Polygonal</td> <td>None</td> <td>88.18</td> <td>77.51</td> <td>82.50</td> <td>66.19</td> <td>61.14</td> <td>63.57</td> <td><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/EW4ewmHyPaJFqPbf_iEoQGEBfVVtlPtoK5XjgVCuXxQWpA?e=M4RSkq" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td rowspan="4"><span style="font-weight:400;font-style:normal;text-decoration:none">TotalText</span></td> <td rowspan="2">Bezier</td> <td>None</td> <td>92.83</td> <td>83.65</td> <td>88.00</td> <td>74.26</td> <td>69.05</td> <td>71.56</td> <td rowspan="2"><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/EVAgHo_OdOFPqkIFcVh7w3EByIgZD3PS3wCYXxW7Qizn7A?e=lgl6Q2" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Full</td> <td>-</td> <td>-</td> <td>-</td> <td>86.42</td> <td>80.35</td> <td>83.28</td> </tr> <tr> <td rowspan="2">Polygonal</td> <td>None</td> <td>93.36</td> <td>81.35</td> <td>86.94</td> <td>76.85</td> <td>69.98</td> <td>73.25</td> <td rowspan="2"><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/ESwSFxppsplEiEaUphJB0TABkIKoRvIljkVIazPUNEXI7g?e=Q8zJ0Q" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Full</td> <td>-</td> <td>-</td> <td>-</td> <td>88.00</td> <td>80.13</td> <td>83.88</td> </tr> <tr> <td rowspan="4"><span style="font-weight:400;font-style:normal;text-decoration:none">CTW1500</span></td> <td rowspan="2">Bezier</td> <td>None</td> <td>89.71</td> <td>83.07</td> <td>86.27</td> <td>55.44</td> <td>51.34</td> <td>53.31</td> <td rowspan="2"><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/EU17mK38HT1DvsfeylJiYloBpZyhehG2rfw_IZiPrfgPYw?e=oa3gtR" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Full</td> <td>-</td> <td>-</td> <td>-</td> <td>83.05</td> <td>76.90</td> <td>79.85</td> </tr> <tr> <td rowspan="2">Polygonal</td> <td>None</td> <td>92.04</td> <td>82.63</td> <td>87.08</td> <td>59.14</td> <td>53.09</td> <td>55.95</td> <td rowspan="2"><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/ETkgFej-l39Gr7GYqwZt6LQBH9r2snHlidb3pTEGjiWZPw?e=9I5plv" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Full</td> <td>-</td> <td>-</td> <td>-</td> <td>86.16</td> <td>77.34</td> <td>81.51</td> </tr> <tr> <td rowspan="4"><span style="font-weight:400;font-style:normal;text-decoration:none">ICDAR15</span></td> <td rowspan="4">Polygonal</td> <td>None</td> <td>90.31</td> <td>89.70</td> <td>90.00</td> <td>65.49</td> <td>65.05</td> <td>65.27</td> <td rowspan="4"><a href="https://ucsdcloud-my.sharepoint.com/:u:/g/personal/xiz102_ucsd_edu/ETwRegsVcwtNgnbjm-79XqQBkQgjsRwIUedJysUz8Fm6wA?e=yKR2mN" target="_blank" rel="noopener noreferrer">OneDrive</a></td> </tr> <tr> <td>Strong</td> <td>-</td> <td>-</td> <td>-</td> <td>87.11</td> <td>83.29</td> <td>85.16</td> </tr> <tr> <td>Weak</td> <td>-</td> <td>-</td> <td>-</td> <td>80.36</td> <td>78.38</td> <td>79.36</td> </tr> <tr> <td>Generic</td> <td>-</td> <td>-</td> <td>-</td> <td>73.82</td> <td>73.33</td> <td>73.57</td> </tr> </tbody> </table>The Lite
models only use the image feature from the last stage of ResNet.
Method | Annotation Type | Lexicon | Det-P | Det-R | Det-F | E2E-P | E2E-R | E2E-F | Link |
---|---|---|---|---|---|---|---|---|---|
Pretrain (Lite) | Polygonal | None | 90.28 | 72.58 | 80.47 | 59.49 | 50.22 | 54.46 | OneDrive |
TotalText (Lite) | Polygonal | None | 92.16 | 79.09 | 85.12 | 66.42 | 59.06 | 62.52 | OneDrive |
Citation
@InProceedings{Zhang_2022_CVPR,
author = {Zhang, Xiang and Su, Yongwen and Tripathi, Subarna and Tu, Zhuowen},
title = {Text Spotting Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {9519-9528}
}
License
This repository is released under the Apache License 2.0. License can be found in LICENSE file.
Acknowledgement
Thanks to AdelaiDet for a standardized training and inference framework, and Deformable-DETR for the implementation of multi-scale deformable cross-attention.