

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

The official code of LPV

LPV proposes a Cascade Position Attention (CPA) strategy and a Global Linguistic Reconstruction Module to aggregate linguistic information in both query and features. The pipeline is shown in the following figure.


ToDo List

Install requirements


Pretrained Models

Available model weights:



The training is divided into two stages. 4 3090 GPUs are used in this implementation.

Stage 1 (w/o mask in GLRM)

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501 train_final_dist.py \
--isrand_aug --backbone svtr_tiny --trans_ln 2 --exp_name svtr-tiny-exp \
--batch_size 96 --num_iter 413940 --drop_iter 240000

Stage 2 (with mask in GLRM)

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501 train_final_dist.py \
--isrand_aug --backbone svtr_tiny --trans_ln 2 --exp_name svtr-tiny-exp-mask \
--batch_size 96 --num_iter 413940 --drop_iter 240000 \
--mask --saved_model [dir_to_checkpoint_of_the_first_stage]

Explanation of parameters:

--backbone:	Can be choosed in [svtr_tiny, svtr_small, svtr_base]
--trans_ln:	The layer of number in GLRM. We set to 2 in LPV-Tiny and 3 in LPV-Small and  LPV-Base.
--exp_name:	The name of experiment folder to save logs and checkpoints.
--batch_size:	The batch size of each GPU. Default is 96.
--num_iter:	The total steps in training. Default is 413940, which equals to 10 epoches when training on MJ and ST.
--drop_iter:	The drop position. Default is 240000.
--mask:	Whether to use mask in GLRM.
--saved_model:	Resume the training.
--imgH:		The height of input image.
--imgW:		The width of input image.

The image size is set to 48*160 for LPV-Base, so it is necessary to add two parameters: --imgH 48 and --imgW 160 when training.


CUDA_VISIBLE_DEVICES=0 python test_final.py --benchmark_all_eval \
--exp_name [the_exp_name] --backbone svtr_tiny --trans_ln 2 \ 
--model_dir [dir_to_your_checkpoint] --eval_data [dir_to_your_evaluated_data] \
--batch_size 96 --mask --show attn --fast_acc

Explanation of parameters:

--exp_name:	The name of experiment folder.
--backbone:	Can be choosed in [svtr_tiny, svtr_small, svtr_base]
--trans_ln:	The layer of number in GLRM. We set to 2 in LPV-Tiny and 3 in LPV-Small and  LPV-Base.
--model_dir:	The direction of the checkpoint.
--eval_data:	The direction of the evaluated data.
--fast_acc:	To test on six benchmarks.


If you find our method useful for your reserach, please cite

  title={Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition},
  author={Zhang, Boqiang and Xie, Hongtao and Wang, Yuxin and Xu, Jianjun and Zhang, Yongdong},
  journal={arXiv preprint arXiv:2305.05140},


This implementation has been based on these repository CLOVA AI: deep text recognition benchmark, Advanced Literate Machinery: MGP-STR


Suggestions and discussions are greatly welcome. Please contact the authors by sending email to cyril@mail.ustc.edu.cn