Home

Awesome

E2TIMT

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

The official repository for ICDAR 2023 Conference paper:

1. Introduction

Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end- to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The end-to- end models are efficient but suffer from training data deficiency. To this end, in our paper, we propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets to pursue both an effective and efficient framework. More specifically, we build a novel modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end TIMT loss and cross-modal contrastive loss are utilized jointly to align the feature distribution of the OCR and MT tasks. Extensive experiments show that the proposed method outperforms the existing two-stage cascade models and one-stage end-to-end models with a lighter and faster architecture. Furthermore, the ablation studies verify the generalization of our method, where the proposed modal adapter is effective to bridge various OCR and MT models.

<img src="./Figures/model.jpg" style="zoom:100%;" />

2. Usage

2.1 Requirements

2.2 Train the Model

bash ./train_model_guide.sh

2.3 Evaluate the Model

bash ./test_model_guide.sh

2.4 Datasets

We use the dataset released in E2E_TIT_With_MT.

3. Acknowledgement

The reference code of the provided methods are:

We thanks for all these researchers who have made their codes publicly available.

4. Citation

If you want to cite our paper, please use this bibtex version:

If you have any issues, please contact with email.