Awesome

Adaptive-MT-LLM-Fine-tuning

Code and data for the paper Fine-tuning Large Language Models for Adaptive Machine Translation

The paper presents the outcomes of fine-tuning Mistral 7B, a general-purpose large language model (LLM), for adaptive machine translation (MT). The fine-tuning process involves utilizing a combination of zero-shot and one-shot translation prompts within the medical domain. Zero-shot prompts represet regular translation without any context, while one-shot prompts augment the new source with a similar translation pair, i.e. a fuzzy match, to improve the adherence to terminology and style of the domain The primary objective is to enhance real-time adaptive MT capabilities of Mistral 7B, enabling it to adapt translations to the required domain at inference time. Our experiments demonstrate that, with a relatively small dataset of 20,000 segments that incorporate a mix of zero-shot and one-shot prompts, fine-tuning significantly enhances Mistral's in-context learning ability, especially for real-time adaptive MT.

Dependencies

You might want to install the latest versions of the used libraries, but if you are facing issues, try the versions used in the requirements file.

pip3 install -r requirements.txt

Data (training and test)

The original dataset is a mix of medical datasets from OPUS, namely ELRC, EMEA, SciELO, and TICO-19.

Training data (small)

Fine-tuning data - small [ES][EN]: Data for actual fine-tuning: 10,000 translation pairs
Context Dataset [ES][EN]: Data for fuzzy match retrieval for training: 50,000 translation pairs
Retrieved data: Data after retrieval for training: 10,000 entries (format: {score} ||| {fuzzy_src_sent} ||| {new_src_sent} ||| {fuzzy_tgt_sent})

Test Data

Test dataset [ES][EN]: Data used for actual inference/translation: 10,000 translation pairs
Context Dataset [ES][EN]: Data for fuzzy match retrieval for testing: 50,000 translation pairs
Retrieved data: Data after retrieval for testing: 10,000 entries (format: {score} ||| {fuzzy_src_sent} ||| {new_src_sent} ||| {fuzzy_tgt_sent})

Data Processing

The original dataset is a mix of medical datasets from OPUS, namely ELRC, EMEA, SciELO, and TICO-19. The pre-processing step mainly removes duplicates and too long sentences. The code for data pre-processing is at Data-Processing-Adaptive-MT.ipynb

Fuzzy Match Retrieval

We use Sentence-Transformers with a multilingual model, namely Microsoft’s “Multilingual-MiniLM-L12-H384”, to generate the embeddings for the datasets. For indexing, we use Faiss. Then we retrieve fuzzy matches through semantic search. You can find more details about the retrieval process in our paper. The code of this fuzzy match retrieval process is at Retrieve-Fuzzy-Matches-Faiss-Adaptive-MT.ipynb

Fine-tuning Mistral 7B

We used QLoRA for efficient fine-tuning with 4bit quantization, with Hugging Face Transformers. You can find more details in the paper and the notebook Mistral-Fine-Tuning-Adaptive-MT.ipynb

Prompts are created in this notebook using the create_prompt() function. If one_shot=False it creates a zero-shot translation prompt; otherwise, it creates a one-shot translation prompt. Please check out the notebook itself for actual examples.

Inference

Conversion to the CTranslate2 format

Mistral 7B (baseline): To convert Mistral baseline (before fine-tuning) to the CTranslate2 format:

ct2-transformers-converter --model mistralai/Mistral-7B-v0.1 --quantization int8 --output_dir ct2-mistral-7B-v0.1

Mistral 7B (fine-tuned): To convert Mistral after FINE-TUNING to the CTranslate2 format, check the steps at Convert-Mistral-Finetuned-CTranslate2.ipynb
NLLB-200: To convert NLLB-200 to the CTranslate2 format:

ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir ct2/nllb-200-distilled-600M-int8

Tokenizers

Mistral 7B: You can directly use the tokenizers from the Transformers library as illustrated in the notebook Mistral-CTranslate2-Adaptive-MT.ipynb
NLLB-200: Download the SentencePiece model for NLLB-200; then use it as illustrated in the notebook NLLB-200-CTranslate2-Adaptive-MT.ipynb

!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model

Translation

Mistral 7B (baseline and fine-tuned): Translation code with CTranslate2 is at Mistral-CTranslate2-Adaptive-MT.ipynb
NLLB-200: Translation code with CTranslate2 is at NLLB-200-CTranslate2-Adaptive-MT.ipynb
ChatGPT: Translation via the official API; the code is at ChatGPT-Adaptive-MT.ipynb

Evaluation

Evaluation was done based on BLEU, chrF++, TER, and COMET metrics. The code is available at Evaluation-Adaptive-MT.ipynb. The full evaluation scores are available at the paper under the Results section, and a detailed version is at Evaluation-Scores-Adaptive-MT.csv

Questions

If you have questions, please feel free to contact me.

Citations

Fine-tuning Large Language Models for Adaptive Machine Translation

@ARTICLE{Moslem2023-Finetuning-LLM-AdaptiveMT,
  title         = "{Fine-tuning Large Language Models for Adaptive Machine Translation}",
  author        = "Moslem, Yasmin and Haque, Rejwanul and Way, Andy",
  month         =  dec,
  year          =  2023,
  url           = "http://arxiv.org/abs/2312.12740",
  archivePrefix = "arXiv",
  primaryClass  = "cs.CL",
  eprint        = "2312.12740"
}

Adaptive Machine Translation with Large Language Models

@INPROCEEDINGS{Moslem2023-AdaptiveMT,
  title     = "{Adaptive Machine Translation with Large Language Models}",
  booktitle = "{Proceedings of the 24th Annual Conference of the European Association
               for Machine Translation}",
  author    = "Moslem, Yasmin and Haque, Rejwanul and Kelleher, John D and Way, Andy",
  publisher = "European Association for Machine Translation",
  pages     = "227--237",
  month     =  jun,
  year      =  2023,
  url       = "https://aclanthology.org/2023.eamt-1.22",
  address   = "Tampere, Finland"
}