Home

Awesome

CLIP4STR

PWC PWC PWC PWCPWCPWCPWCPWCPWCPWCPWC

This is a dedicated re-implementation of CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model .

Table of Contents

<!--ts--> <!--te-->

News

Introduction

<div align="justify">

This is a third-party implementation of the paper <a href="https://arxiv.org/abs/2305.14014"> CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. </a>

<div align=center> <img src="misc/overall.png" style="zoom:100%"/></pr>

The framework of CLIP4STR. It has a visual branch and a cross-modal branch. The cross-modal branch refines the prediction of the visual branch for the final output. The text encoder is partially frozen.

</div>

CLIP4STR aims to build a scene text recognizer with the pre-trained vision-language model. In this re-implementation, we try to reproduce the performance of the original paper and evaluate the effectiveness of pre-trained VL models in the STR area.

Installation

Prepare data

First of all, you need to download the STR dataset.

Generally, directories are organized as follows:

${ABSOLUTE_ROOT}
├── dataset
│   │
│   ├── str_dataset_ub
│   └── str_dataset           
│       ├── train
│       │   ├── real
│       │   └── synth
│       ├── val     
│       └── test
│
├── code              
│   │
│   └── clip4str
│
├── output (save the output of the program)
│
│
├── pretrained
│   └── clip (download the CLIP pre-trained weights and put them here)
│       └── ViT-B-16.pt
│
...

Dependency

Requires Python >= 3.8 and PyTorch >= 1.12. The following commands are tested on a Linux machine with CUDA Driver Version 525.105.17 and CUDA Version 11.3.

conda create --name clip4str python=3.8.5
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 -c pytorch
pip install -r requirements.txt 

If you meet problems in continual training of an intermediate checkpoint, try to upgrade your PyTorch

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Results

CLIP4STR pre-trained on OpenAI WIT-400M

CLIP4STR-B means using the CLIP-ViT-B/16 as the backbone, and CLIP4STR-L means using the CLIP-ViT-L/14 as the backbone.

MethodTrain dataIIIT5KSVTIC13IC15IC15SVTPCUTEHOSTWOST
3,0006471,0151,8112,0776452882,4162,416
CLIP4STR-BMJ+ST97.7095.3696.0687.4784.0291.4794.4480.0186.75
CLIP4STR-LMJ+ST97.5795.3696.7588.0284.4091.7894.4481.0887.38
CLIP4STR-BReal(3.3M)99.2098.3098.2391.4490.6196.9099.6577.3687.87
CLIP4STR-LReal(3.3M)99.4398.1598.5291.6691.1497.3698.9679.2289.07
MethodTrain dataCOCOArTUberCheckpoint
9,82535,14980,551
CLIP4STR-BMJ+ST66.6972.8243.52a5e3386222
CLIP4STR-LMJ+ST67.4573.4844.593544c362f0
CLIP4STR-BReal(3.3M)80.8085.7486.70d70bde1f2d
CLIP4STR-LReal(3.3M)81.9785.8387.36f125500adc

CLIP4STR pre-trained on DataComp-1B, LAION-2B, and DFN-5B

All models are trained on RBU(6.5M).

MethodPre-trainTrainIIIT5KSVTIC13IC15IC15SVTPCUTEHOSTWOST
3,0006471,0151,8112,0776452882,4162,416
CLIP4STR-BDC-1BRBU99.598.398.691.491.198.099.079.388.8
CLIP4STR-LDC-1BRBU99.698.699.091.991.498.199.781.190.6
CLIP4STR-HLAION-2BRBU99.798.698.991.691.198.599.780.690.0
CLIP4STR-HDFN-5BRBU99.599.198.991.791.098.099.082.690.9
MethodPre-trainTrainCOCOArTUberlogCheckpoint
9,82535,14980,551
CLIP4STR-BDC-1BRBU81.385.892.16e9fe947ac_log6e9fe947ac, BaiduYun
CLIP4STR-LDC-1BRBU82.786.492.23c9d881b88_log3c9d881b88, BaiduYun
CLIP4STR-HLAION-2BRBU82.586.291.25eef9f86e2_log5eef9f86e2, BaiduYun
CLIP4STR-HDFN-5BRBU83.086.491.73e942729b1_log3e942729b1, BaiduYun

Training

For CLIP4STR with CLIP-ViT-B, refer to

bash scripts/vl4str_base.sh

8 NVIDIA GPUs with more than 24GB memory (per GPU) are required. For users with limited GPUs, you can change trainer.gpus=A, trainer.accumulate_grad_batches=B, and model.batch_size=C under the condition A * B * C = 1024 in the bash scripts. Do not modify the code, the PyTorch Lightning will handle the left.

For CLIP4STR with CLIP-ViT-L, refer to

bash scripts/vl4str_large.sh

We also provide the training script of CLIP4STR + Adapter described in the original paper,

bash scripts/str_adapter.sh

Inference

bash test.sh {gpu_id} {subpath_of_ckpt}

For example,

bash scripts/test.sh 0 clip4str_base16x16_d70bde1f2d.ckpt

If you want to read characters from an image, try:

bash test.sh {gpu_id} {subpath_of_ckpt} {image_folder_path}

For example,

bash scripts/read.sh 0 clip4str_base16x16_d70bde1f2d.ckpt misc/test_images

Output:
image_1576.jpeg: Chicken

Citations

@article{zhao2023clip4str,
  title={Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model},
  author={Zhao, Shuai and Quan, Ruijie and Zhu, Linchao and Yang, Yi},
  journal={arXiv preprint arXiv:2305.14014},
  year={2023}
}

Acknowledgements

<!--ts--> <!--te-->