Home

Awesome

ViTEraser (AAAI 2024)

The official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024). The ViTEraser revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Below are the frameworks of ViTEraser and SegMIM.

ViTEraser SegMIM

Todo List

Environment

We recommend using Anaconda to manage environments. Run the following commands to install dependencies.

conda create -n viteraser python=3.7 -y
conda activate viteraser
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
git clone https://github.com/shannanyinxiang/ViTEraser.git
cd ViTEraser
pip install -r requirements.txt

Datasets

1. Text Removal Dataset

2. SegMIM Pretraining Datasets

(optional, only required by SegMIM pretraining)

Please prepare the above datasets into the data folder following the file structure below.

data
├─TextErase
│  └─SCUT-EnsText
│     ├─train
│     │  ├─image
│     │  ├─label
│     │  └─mask
│     └─test
│        ├─image
│        ├─label
│        └─mask
└─SegMIMDatasets
   ├─ArT
   ├─ICDAR2013
   ├─ICDAR2015
   ├─LSVT
   ├─MLT2017
   ├─ReCTS
   └─TextOCR

Models

The download links of pre-trained ViTEraser weights are provided in the following table.

NameBaiduNetDiskGoogleDrive
ViTEraser-Tinylinklink
ViTEraser-Smalllinklink
ViTEraser-Baselinklink

Inference

The example command for the inference with ViTEraser-Tiny is:

CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
        --master_port=3151 \
        --nproc_per_node 1 \
        --use_env \
        main.py \
        --eval \
        --data_root data/TextErase/ \
        --val_dataset scutens_test \
        --batch_size 1 \
        --encoder swinv2 \
        --decoder swinv2 \
        --pred_mask false \
        --intermediate_erase false \
        --swin_enc_embed_dim 96 \
        --swin_enc_depths 2 2 6 2 \
        --swin_enc_num_heads 3 6 12 24 \
        --swin_enc_window_size 16 \
        --swin_dec_depths 2 6 2 2 2 \
        --swin_dec_num_heads 24 12 6 3 2 \
        --swin_dec_window_size 16 \
        --output_dir path/to/save/output/ \
        --resume path/to/weights/

Argument changes for different scales of ViTEraser are as below:

ArgumentTinySmallBase
swin_enc_embed_dim9696128
swin_enc_depths2 2 6 22 2 18 22 2 18 2
swin_enc_num_heads3 6 12 243 6 12 244 8 16 32
swin_enc_window_size16168
swin_dec_depths2 6 2 2 22 18 2 2 22 18 2 2 2
swin_dec_num_heads24 12 6 3 224 12 6 3 232 16 8 4 2
swin_dec_window_size1688

Evaluation

The command for calculating metrics is:

python eval/evaluation.py \
    --gt_path data/TextErase/SCUT-EnsText/test/label/ \
    --target_path path/to/model/output/

python -m pytorch_fid \
    data/TextErase/SCUT-EnsText/test/label/ \
    path/to/model/output/ \
    --device cuda:0

ViTEraser Training

1. Training without SegMIM pretraining

bash scripts/viteraser-training-wosegmim/viteraser-tiny-train.sh

2. Training with SegMIM pretraining

bash scripts/viteraser-training-withsegmim/viteraser-tiny-train-withsegmim.sh

SegMIM Pretraining

# end-to-end encoder-decoder pretraining
bash scripts/segmim/viteraser-tiny-segmim.sh

# standalone encoder finetuning
bash scripts/segmim/viteraser-tiny-encoder-finetune.sh

Citation

@inproceedings{peng2024viteraser,
  title={ViTEraser: Harnessing the power of vision transformers for scene text removal with SegMIM pretraining},
  author={Peng, Dezhi and Liu, Chongyu and Liu, Yuliang and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={5},
  pages={4468--4477},
  year={2024}
}

Copyright

This repository can only be used for non-commercial research purpose.

For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).

Copyright 2024, Deep Learning and Vision Computing Lab, South China University of Technology.