

Optical Character Recognition with Segment Anything (OCR-SAM)

πŸ‡ Introduction πŸ™

Can SAM be applied to OCR? We take a simple try to combine two off-the-shelf OCR models in MMOCR with SAM to develop some OCR-related application demos, including SAM for Text, Text Removal and Text Inpainting. And we also provide a WebUI by gradio to give a better interaction.

πŸ“… Updates πŸ‘€

πŸ“Έ Demo Zoo πŸ”₯

This project includes:

🚧 Installation πŸ› οΈ


Environment Setup

Clone this repo:

git clone https://github.com/yeungchenwa/OCR-SAM.git

Step 0: Download and install Miniconda from the official website.

Step 1: Create a conda environment and activate it.

conda create -n ocr-sam python=3.8 -y
conda activate ocr-sam

Step 2: Install related version Pytorch following here.

# Suggested
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

Step 3: Install the mmengine, mmcv, mmdet, mmcls, mmocr.

pip install -U openmim
mim install mmengine
mim install mmocr
# In Window, the following symbol ' should be changed to "
mim install 'mmcv==2.0.0rc4'
mim install 'mmdet==3.0.0rc5'
mim install 'mmcls==1.0.0rc5'

# Install sam
pip install git+https://github.com/facebookresearch/segment-anything.git

# Install required packages
pip install -r requirements.txt

Step 4: Prepare for the diffusers and latent-diffusion.

# Install Gradio
pip install gradio

# Install the diffusers
pip install diffusers

# Install the pytorch_lightning for ldm
pip install pytorch-lightning==2.0.1.post0

πŸ“’ Model checkpoints πŸ–₯

We retrain DBNet++ with Swin Transformer V2 as the backbone on a combination of multiple scene text datsets (e.g. HierText, TextOCR). Checkpoint for DBNet++ on Google Drive (1G).

And you should make dir following:

mkdir checkpoints
mkdir checkpoints/mmocr
mkdir checkpoints/sam
mkdir checkpoints/ldm
mv db_swin_mix_pretrain.pth checkpoints/mmocr

Download the rest of the checkpoints to the related path (If you've done so, ignore the following):

# mmocr recognizer ckpt
wget -O checkpoints/mmocr/abinet_20e_st-an_mj_20221005_012617-ead8c139.pth https://download.openmmlab.com/mmocr/textrecog/abinet/abinet_20e_st-an_mj/abinet_20e_st-an_mj_20221005_012617-ead8c139.pth

# sam ckpt, more details: https://github.com/facebookresearch/segment-anything#model-checkpoints
wget -O checkpoints/sam/sam_vit_h_4b8939.pth https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# ldm ckpt
wget -O checkpoints/ldm/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1

πŸƒπŸ»β€β™‚οΈ Run Demo πŸŠβ€β™‚οΈ

SAM for Text🧐

Run the following script:

python mmocr_sam.py \
    --inputs /YOUR/INPUT/IMG_PATH \ 
    --outdir /YOUR/OUTPUT_DIR \ 
    --device cuda \ 


In this application demo, we use the latent-diffusion-inpainting to erase, or the Stable-Diffusion-inpainting with text prompt to erase, which you can choose one of both by the parameter --diffusion_model. Also, you can choose whether to use the SAM output mask to erase by the parameter --use_sam. More implementation details are listed here

Run the following script:

python mmocr_sam_erase.py \ 
    --inputs /YOUR/INPUT/IMG_PATH \ 
    --outdir /YOUR/OUTPUT_DIR \ 
    --device cuda \ 
    --use_sam True \ 
    --dilate_iteration 2 \ 
    --diffusion_model \ 
    --sd_ckpt None \ 
    --prompt None \ 
    --img_size (512, 512) \ 

Run the WebUI: see here

Note: The first time you run may cost some time, because downloading the stable-diffusion ckpt costs a lot, so wait patientlyπŸ‘€


More implementation details are listed here

Run the following script:

python mmocr_sam_inpainting.py \
    --img_path /YOUR/INPUT/IMG_PATH \ 
    --outdir /YOUR/OUTPUT_DIR \ 
    --device cuda \ 
    --prompt YOUR_PROMPT \ 
    --select_index 0 \ 

Run WebUI

This repo also provides the WebUI(decided by gradio), including the Erasing and Inpainting.

Before running the script, you should install the gradio package:

pip install gradio


python mmocr_sam_erase_app.py

Detector and Recognizer WebUI Result

Erasing WebUI Result

In our WebUI, users can interactly choose the SAM output and the diffusion model. Especially, users can choose which text to be erased.


python mmocr_sam_inpainting_app.py

Inpainting WebUI Result

Note: Before you open the web, it may take some time, so wait patientlyπŸ‘€

πŸ’— Acknowledgement