Awesome

Lost in the Source Language

This is the repository for the paper "Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation" accepted by ACL 2024 findings. We provide our source code, data and results for easy reimplementation.

Requirements

Python >= 3.8.0
Pytorch >= 2.1.2
langchain >= 0.1.0
langchain-core >= 0.1.9
pandas
openai
Optional vllm

We recommend to use vllm to accelerate the inference.

Coarse-grained Score Prediction(GEMBA)

We use GEMBA's source code to predict scores. The results of our experiments are in the gemba_results folder. We compute the correlations between the metrics scores and human scores using mt-metrics-eval.

Fine-grained Error Detection(AutoMQM)

We implement AutoMQM for fine-grained error detection. For example, if you want to use GPT-3.5 in the S-R-T mode, simply run automqm.py as follows:

python automqm.py --model-name gpt-3.5-turbo-0613 --lang-pair en-de --prefix gpt3.5-turbo_ref_stratified_wmt22_ende_3200 --example-selector stratified --has-source --has-reference --prompt-path prompts/prompt_ref_sample.json

To evaluate the output of AutoMQM, use the evaluate.py with the corresponding subcommand like sf1_mf1, mcc, etc., or just use test_all subcommand. If you want to convert the results to MQM scores, use the save_scores subcommand in evaluate.py.

Fine-tune Llama2

The processed training data is in the data folder, which is derived from wmt21 MQM data. The output format is similar to that of InstructScore. To fine-tune Llama2 model, simply run the finetune_llama2.sh. Don't forget to configure some of the parameters like $MODEL_PATH_OR_NAME.

After training, use the inference.py to generate the answers of the testset.

Finally, use postprocess_inference.py to compute the MQM scores of the answers.