Home

Awesome

oCLIP

This repository is the official implementation for the following paper:

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Chuhui Xue, Wenqing Zhang, Yu Hao, Shijian Lu, Philip Torr, Song Bai, ECCV 2022 (Oral)

Part of the code is inherited from open_clip.

Models

BackbonePre-train DataPre-train ModelFine-tune DataFine-tune Model (PSENet)PrecisionRecallF-score
ResNet-50SynthTextLinkTotal-TextLink89.981.685.5
ResNet-101SynthTextLinkTotal-TextLink89.982.285.9
ResNet-50Web ImageLinkTotal-TextLink90.183.586.7
BackbonePre-train DataPre-train Model
ResNet-50LSVT-Weak AnnotationLink

Training oCLIP

Conda

conda create -n oclip python=3.7
conda activate oclip
pip install -r requirement.txt

git clone https://github.com/bytedance/oclip.git
cd oclip
export PYTHONPATH="$PYTHONPATH:$PWD/src"

Data

Download SynthText and put it to ./data.

You may use the provided script to generate the annotation for pre-training:

python tools/convert_synthtext_csv.py --data_dir data/SynthText/ --save_dir data/SynthText/

Train

Sample running code for training:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u src/training/main.py \
    --save-frequency 3 \
    --report-to tensorboard \
    --train-data="data/SynthText/train_char.csv"  \
    --char-dict-pth="data/SynthText/char_dict" \
    --csv-img-key filepath \
    --csv-caption-key title \
    --warmup 10000 \
    --batch-size=64 \
    --lr=1e-4\
    --wd=0.1 \
    --epochs=100 \
    --workers=8 \
    --model RN50 \
    --logs='output/RN50_synthtext' 

Visualization

We also provide a script for visualization of attention maps in the pre-trained model.

Download the pre-trained model to ./pretrained.

python3 tools/visualize_attn.py --model_path pretrained/RN50_synthtext.pt --char_dict_path data/SynthText/char_dict --model_config_file src/training/model_configs/RN50.json --im_fn demo/sample.jpg --text_list "ST LING" "STRLIN " "A GYLL'S" " ODGINGS" --demo_path demo/
Input ImageImage Attenion Map"ST LING""STRLIN ""A GYLL'S"" ODGINGS"
Input ImageImage Attenion MapChar Attention Map 0Char Attention Map 1Char Attention Map 2Char Attention Map 3

Fine-tune in MMOCR

We provide a script for converting model parameter names, thus it could be used in the dev-1.x branch of MMOCR

# first modify the model_path and save_path in tools/convert2mmocr.py
python tools/convert2mmocr.py

Citation

@article{xue2022language,
  title={Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting},
  author={Xue, Chuhui and Zhang, Wenqing and Hao, Yu and Lu, Shijian and Torr, Philip and Bai, Song},
  journal={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}