Home

Awesome

CoText

Real-Time End-to-End Video Text Spotting with Contrastive Representation Learning

License: MIT

Note: The repository is the official code for Real-Time End-to-End Video Text Spotting with Contrastive Representation Learningļ¼Œ which is the arxiv paper under review.

Introduction

Real-Time End-to-End Video Text Spotting with Contrastive Representation Learning| Youtube Demo

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text instances in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is no friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) For the first time, we simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, light-weight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTCbased recognition head with Masked RoI, and track head with contrastive learning. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 35.2 FPS on ICDAR2015video [13], with 10.5% and 26.2 FPS improvement the previous best method.

Link to our new benchmark BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting

Updates

Performance

ICDAR2015(video) Tracking challenge

MethodsMOTAMOTPIDF1Mostly MatchedMostly LostFPS
CoText(640)47.472.365.341.431.559.5
CoText(832)51.473.668.649.623.541.0

ICDAR2015(video) Video Text Spotting challenge

MethodsMOTAMOTPIDF1Mostly MatchedMostly LostFPS
CoText(640)53.672.467.640.232.859.5
CoText(736)57.874.270.345.928.649.6
CoText(832)59.074.572.048.626.441.0

Notes

Demo

<img src="demo.gif" width="400"/> <img src="demo1.gif" width="400"/>

Installation

The codebases are built on top of PAN++.

Usage

Dataset preparation

  1. Please download ICDAR2015 and COCOTextV2 dataset.

  2. You should modify the corresponding path in cov_ICDAR15video_to_ICDAR15.py and use the following script to generate txt file:

cd utils
python cov_ICDAR15video_to_ICDAR15.py  

Training and Evaluation

Training on single node

Before training, you need to modify the corresponding dataset path in dataloader to your path.

The whole training pipeline need two step.

  1. Training Detection and recognition branch:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py config/CoText_r18_ic15_detrec.py

  1. Training tracking branch:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py config/CoText_r18_ic15_desc.py

Evaluation on ICDAR15 for tracking task

You can download the pretrained model of CoText (the link is in "Main Results" session), then run following command to evaluate it on ICDAR2015 dataset:

python track_icd15.py

Evaluation on ICDAR15 for e2e text spotting task

You can download the pretrained model of CoText (the link is in "Main Results" session), then run following command to evaluate it on ICDAR2015 dataset:

python spotting_icd15_.py

Visualization

You need to modify the corresponding dataset path in "vis_video.py" to your path. Then:

cd eval
python vis_video.py

License

CoText is released under MIT License.

Citing

If you use CoText in your research or wish to refer to the baseline results published here, please use the following BibTeX entries: