Home

Awesome

XLM-Align

Code and models for the paper Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment.

Update: release the aligner for word alignment on translation pairs. See this page.

The XLM-Align pretraining code has uploaded to the unilm repo.

Introduction

XLM-Align is a pretrained cross-lingual language model that supports 94 languages. See details in our paper.

Our Cross-Lingual Language Models

Example Application Scenarios

How to Use

From huggingface model hub

We provide the models in huggingface format, so you can use the model directly with huggingface API:

XLM-Align

model = AutoModel.from_pretrained("microsoft/xlm-align-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/xlm-align-base")

Note: We have moved the XLM-Align model from CZWin32768/xlm-align to microsoft/xlm-align-base. We will also preseve the original repo for compatibility. So, there is no difference between these two repositories.

InfoXLM-base

model = AutoModel.from_pretrained("microsoft/infoxlm-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/infoxlm-base")

InfoXLM-large

model = AutoModel.from_pretrained("microsoft/infoxlm-large")
tokenizer = AutoTokenizer.from_pretrained("microsoft/infoxlm-large")

Finetuning on end tasks

Our models use the same vocabulary, tokenizer, and architecture with XLM-Roberta. So you can directly use the existing codes for finetuning XLM-R, just by replacing the model name from xlm-roberta-base to microsoft/xlm-align-base, microsoft/infoxlm-base, or microsoft/infoxlm-base.

For example, you can evaluate our model with xTune on the XTREME benchmark.

Evaluation Results

XTREME cross-lingual understanding tasks:

ModelPOSNERXQuADMLQATyDiQAXNLIPAWS-XAvg
XLM-R_base75.661.871.9 / 56.465.1 / 47.255.4 / 38.375.084.966.4
InfoXLM_base---68.1 / 49.6-76.5--
XLM-Align_base76.063.774.7 / 59.068.1 / 49.862.1 / 44.876.286.868.9

(The models are finetuned under the cross-lingual transfer setting, i.e., finetuning only with Enlgish training data but directly test on target langauges)

Pretraining XLM-Align

We have uploaded the pretraining code to the unilm repo.

Here is an example for pretraining XLM-Align-base:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python src-infoxlm/train.py ${MLM_DATA_DIR} \
--task xlm_align --criterion dwa_mlm_tlm \
--tlm_data ${TLM_DATA_DIR} \
--arch xlm_align_base --sample-break-mode complete --tokens-per-sample 512 \
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 \
--clip-norm 1.0 --lr-scheduler polynomial_decay --lr 0.0002 \
--warmup-updates 10000 --total-num-update 200000 --max-update 200000 \
--dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \
--max-sentences 16 --update-freq 16 --log-format simple \
--log-interval 1 --disable-validation --save-interval-updates 5000 --no-epoch-checkpoints \
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \
--seed 1 \
--save-dir .${SAVE_DIR} \
--tensorboard-logdir .${SAVE_DIR}/tb-log \
--roberta-model-path /path/to/model.pt \
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn \
--wa_layer 10 --wa_max_count 2 --sinkhorn_iter 2

See more details at the InfoXLM page.

References

Please cite the paper if you found the resources in this repository useful.

[1] XLM-Align (ACL 2021, paper, repo, model) Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

@inproceedings{xlmalign,
  title = "Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment",
  author={Zewen Chi and Li Dong and Bo Zheng and Shaohan Huang and Xian-Ling Mao and Heyan Huang and Furu Wei},
  booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
  month = aug,
  year = "2021",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-long.265",
  doi = "10.18653/v1/2021.acl-long.265",
  pages = "3418--3430",}

[2] InfoXLM (NAACL 2021, paper, repo, model) InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.

@inproceedings{chi-etal-2021-infoxlm,
  title = "{I}nfo{XLM}: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training",
  author={Chi, Zewen and Dong, Li and Wei, Furu and Yang, Nan and Singhal, Saksham and Wang, Wenhui and Song, Xia and Mao, Xian-Ling and Huang, Heyan and Zhou, Ming},
  booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
  month = jun,
  year = "2021",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.naacl-main.280",
  doi = "10.18653/v1/2021.naacl-main.280",
  pages = "3576--3588",}

Contact Information

Zewen Chi (chizewen@outlook.com)