Home

Awesome

TransDETR: End-to-end Video Text Spotting with Transformer

License: MIT

Introduction

End-to-end Video Text Spotting with Transformer | Youtube Demo

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text instances in video. Recent methods typically develop sophisticated pipelines based on Intersection over Union (IoU) or appearance similarity in adjacent frames to tackle this task. In this paper, rooted in Transformer sequence modeling, we propose a novel video text DEtection, Tracking, and Recognition framework (TransDETR), which views the VTS task as a direct long-sequence temporal modeling problem.

Link to our new benchmark BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting

Updates

Performance

ICDAR2015(video) Tracking challenge

MethodsMOTAMOTPIDF1Mostly MatchedPartially MatchedMostly Lost
TransDETR47.574.265.5832484600

Models are also available in Google Drive.

ICDAR2015(video) Video Text Spotting challenge

MethodsMOTAMOTPIDF1Mostly MatchedPartially MatchedMostly Lost
TransDETR58.475.270.4614326427
TransDETR(aug)60.974.672.8644323400

Models are also available in Google Drive.

Notes

Demo

<img src="demo.gif" width="400"/> <img src="demo1.gif" width="400"/>

Installation

The codebases are built on top of Deformable DETR and MOTR.

Usage

Dataset preparation

  1. Please download ICDAR2015, COCOTextV2 dataset, DSText](https://rrc.cvc.uab.es/?ch=22&com=downloads) and organize them like FairMOT as following:

Firstly, after downloading the video data, you can use ExtractFrame_FromVideo.py to extract frames, and copy the frames to images path. labels_with_ids path is automatically generated by the generation script in tools/gen_labels.

./Data
    ├── COCOText
    │   ├── images
    │   └── labels_with_ids
    ├── ICDAR15
    │   ├── images
    │       ├── track
    │           ├── train
                    ├──Video_10_1_1
                        ├──1.jpg
                        ├──2.jpg
                    ├──Video_13_4_1
    │           ├── val
                    ├──Video_11_4_1
    │   ├── labels
    │       ├── track
    │           ├── train
    │           ├── val
    ├── DSText
    │   ├── images
    │       ├── train
    │           ├── Activity
    │           ├── Driving
    │           ├── Game
    │           ├── ....
    │       ├── test
    │           ├── Activity
    │           ├── Driving
    │           ├── Game
    │           ├── ....
    │   ├── labels_with_ids
    │       ├── train
    │           ├── Activity
    │           ├── Driving
    │           ├── Game
    │           ├── ....

  1. You also can use the following script to generate txt file:
cd tools/gen_labels
python3 gen_labels_COCOTextV2.py
python3 gen_labels_15.py
python3 gen_labels_YVT.py
cd ../../

(These scripts are mainly intended to accomplish two tasks: 1) Generate the ground truth in the labels_with_ids path. 2) Generate the corresponding training image list (*.txt) for each dataset's training set in the ./datasets/data_path.)

Note: Before running the corresponding script, you need to modify the paths in the .py file to your own paths. Specifically, you should modify the following paths:

Training and Evaluation

Training on single node

Before training, you need to modify the following paths in the .sh file: mot_path: your data path (e.g., ./Data). data_txt_path_train: the training image list file (.txt) that was generated during the data preparation. Please update these paths to match your specific setup.

You can download COCOTextV2 pretrained weights for Pretrained TransDETR Google Drive. Or training by youself:

sh configs/r50_TransDETR_pretrain_COCOText.sh

Then training on ICDAR2015 with 8 GPUs as following:

sh configs/r50_TransDETR_train_ICDAR15video.sh

Or training on DSText with 8 GPUs as following:

sh configs/r50_TransDETR_train_DSText.sh

Evaluation on ICDAR13 and ICDAR15

You can download the pretrained model of TransDETR (the link is in "Main Results" session), then run following command to evaluate it on ICDAR2015 dataset:

sh configs/r50_TransDETR_eval_ICDAR2015.sh

evaluate on ICDAR13

python tools/Evaluation_ICDAR13/evaluation.py --groundtruths "./tools/Evaluation_ICDAR13/gt" --tests "./exps/e2e_TransVTS_r50_ICDAR15/jons"

evaluate on ICDAR15

cd exps/e2e_TransVTS_r50_ICDAR15
zip -r preds.zip ./preds/*

then submit to the ICDAR2015 online metric

Evaluation on DSText

Inference , we also provide the trained weight on Google drive

sh configs/r50_TransDETR_eval_BOVText.sh

Then zip the result file and submit to the DSText online metric

cd exps/e2e_TransVTS_r50_DSText/preds
zip -r ../preds.zip ./*

Visualization

For visual in demo video, you can enable 'vis=True' in eval.py like:

--show

then run the script:

python tools/vis.py

License

TransDETR is released under MIT License.

Citing

If you use TransDETR in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{wu2022transdetr,
  title={End-to-End Video Text Spotting with Transformer},
  author={Weijia Wu, Chunhua Shen, Yuanqiang Cai, Debing Zhang, Ying Fu, Ping Luo, Hong Zhou},
  journal={arxiv},
  year={2022}
}

If you have any questions, please contact me at: weijiawu@zju.edu.cn

This code uses codes from MOTR, TransVTSpotter and EAST. Many thanks to their wonderful work. Consider citing them as well:

@inproceedings{zeng2021motr,
  title={MOTR: End-to-End Multiple-Object Tracking with TRansformer},
  author={Zeng, Fangao and Dong, Bin and Zhang, Yuang and Wang, Tiancai and Zhang, Xiangyu and Wei, Yichen},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022}
}

@article{wu2021bilingual,
  title={A bilingual, OpenWorld video text dataset and end-to-end video text spotter with transformer},
  author={Wu, Weijia and Cai, Yuanqiang and Zhang, Debing and Wang, Sibo and Li, Zhuang and Li, Jiahong and Tang, Yejun and Zhou, Hong},
  journal={arXiv preprint arXiv:2112.04888},
  year={2021}
}

@inproceedings{zhou2017east,
  title={East: an efficient and accurate scene text detector},
  author={Zhou, Xinyu and Yao, Cong and Wen, He and Wang, Yuzhi and Zhou, Shuchang and He, Weiran and Liang, Jiajun},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={5551--5560},
  year={2017}
}