Awesome
Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
This is a pytorch implementation for paper TMIM
Installation
1.Requirements
- Python==3.8.12
- Pytorch==1.11.0
- CUDA==11.3
conda create -n tmim python==3.8.12
conda activate tmim
pip install --upgrade pip
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install -r requirements.txt
2.Datasets
-
Create a "data" folder. Download text removal dataset (SCUT-Enstext) and text detection datasets(TextOCR, Total-Text, ICDAR2015, COCO-Text, MLT19, ArT, lsvt(fullly annotated), ReCTS).
-
Create the coco-style annotations for text detection datasets with the code in utils/prepare_dataset/ (or download them from here(data.zip).
-
The structure of the data folder is shown below.
data ├── text_det │ ├── art │ │ ├── train_images │ │ └── annotation.json │ ├── cocotext │ │ ├── train2014 │ │ └── cocotext.v2.json │ ├── ic15 │ │ ├── train_images │ │ └── annotation.json │ ├── lsvt │ │ ├── train_images │ │ └── annotation.json │ ├── mlt19 │ │ ├── train_images │ │ └── annotation.json │ ├── rects │ │ ├── img │ │ └── annotation.json │ ├── textocr │ │ ├── train_images │ │ ├── TextOCR_0.1_train.json │ │ └── TextOCR_0.1_val.json │ └── totaltext │ ├── train_images │ └── annotation.json └── text_rmv └── SCUT-EnsText ├── train │ ├── all_images │ ├── all_labels │ └── mask └── test ├── all_images ├── all_labels └── mask
Models
Model | Method | PSNR | MSSIM | MSE | AGE | Download |
---|---|---|---|---|---|---|
Uformer-B | Pretrained | 36.66 | 97.66 | 0.0637 | 1.70 | uformer_b_tmim.pth |
Uformer-B | Fintuned | 37.42 | 97.70 | 0.0459 | 1.52 | uformer_b_tmim_str.pth |
PERT | Pretrained | 34.51 | 96.63 | 0.1231 | 2.11 | pert_tmim.pth |
PERT | Fintuned | 35.66 | 97.18 | 0.0729 | 1.76 | pert_tmim_str.pth |
EraseNet | Pretrained | 34.25 | 97.03 | 0.1141 | 2.23 | erasenet_tmim.pth |
EraseNet | Fintuned | 35.47 | 97.30 | 0.0765 | 1.95 | erasenet_tmim_str.pth |
Inference
- Download the pretrained models and run the following command for inference.
python -m torch.distributed.launch --master_port 29501 --nproc_per_node=1 demo.py --cfg configs/uformer_b_str.py --resume path/to/uformer_b_tmim_str.pth --test-dir path/to/image/folder --visualize-dir path/to/result/folder
Training and Testing
- Set the "snapshot_dir"(The location for saving the checkpoints) and "dataroot"(The location of the datasets) in configs/*.py
- Erasenet and Pert require 4 1080ti GPUs. Uformer requires 8 1080ti GPUs
1.Pretraining
- Run the following command to pretrain the model on text detection datasets.
python -m torch.distributed.launch --master_port 29501 --nproc_per_node=8 train.py --cfg configs/uformer_b_tmim.py --ckpt-name uformer_b_tmim --save-log
- Run the following command to test the performance of the pretrained model.
python test.py --cfg configs/uformer_b_tmim.py --ckpt-name uformer_b_tmim/latest.pth --save-log --visualize
2.Finetuning
- Run the following command to finetune the model on text removal datasets.
python -m torch.distributed.launch --master_port 29501} --nproc_per_node=8 train.py --cfg configs/uformer_b_str.py --ckpt-name uformer_b_tmim_str --save-log --resume 'ckpt/uformer_b_tmim/latest.pth'
- Run the following command to test the performance of the finetuned model.
python test.py --cfg configs/uformer_b_str.py --ckpt-name uformer_b_tmim_str/latest.pth --save-log --visualize