Awesome
uncertainties_MT_eval
Code and data for the paper: Disentangling Uncertainty in Machine Translation Evaluation
Quick Installation
We are using Python 3.8.
Detailed usage examples and instructions for the COMET metric can be found in the Full Documentation.
To develop locally:
git clone https://github.com/deep-spin/uncertainties_MT_eval.git
pip install -r requirements.txt
pip install -e .
TL;DR
This repository is en extension of the original COMET metric, providing different options to enhance it with uncertainty predictors. It includes code for heteroscedastic losses (HTS and KL), as well as the option to use the same architecture for direct uncertainty prediction (DUP). We used COMET v1.0 as the basis for this extension.
Important commands
-
To train a new metric use:
comet-train --cfg config/models/model_config.yaml
-
To use a trained metric of a triplet of a source file <src.txt>, translation file <mt.txt> and reference file <ref.txt> and obtain predictions use:
comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt
Description of configurations and command options
COMET configuration
To train a plain COMET model on your data without using the uncertainty-related code, use the configuration file : uncertainties_MT_eval/configs/models/regression_metric_comet_plain.yaml
This model will use an MSE loss and will produce a single output for each segment, corresponding to the predicted quality score.
COMET with MC Dropout configuration
After having (any) trained COMET model you can apply MC Dropout during inference using the --mc_dropout
and specify the desired number N of the forward stochastic runs during comet-score
as follows:
comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt --mc_dropout N
This option can be used with models trained using any of the three loss options: hts, kl, mse.
If the option is used with a model trained with the MSE loss, then the model will pgenerateroduce a second output for each segment corresponding to the variance/uncertainty value for each segment's quality score prediction.
If the option is used in combination with any of the two heteroscedastic losses, the model will generate four outputs for each segment in total:
- The predicted quality score
- The estimated variance for the quality score
- The predicted aleatoric uncertainty
- The estimated variance of the aleatoric uncertainty
Then the total uncertainty value for the segment can be calculated as indicated in Eq. 4 in the paper.
Note that we used N=100 for all experiments in the paper. To reproduce other related works this number might have to be reduced.
COMET with aleatoric uncertainty predictions
There are two options to train COMET with aleatoric uncertainty prediction.
-
Heteroscedastic uncertainty (HTS) which can be used with any labelled dataset. It only requires setting the loss to "hts" in the configuration file; see uncertainties_MT_eval/configs/models/regression_metric_comet_heteroscedastic.yaml as an example.
-
KL-divergence minimisation based uncertainty (KL). To train a model with the KL setup requires access to labelled data with multiple annotator per segment that provides either (a) multiple human judgements per segment, or (b) the standard deviation of the multiple annotator scores per segment. See file uncertainties_MT_eval/data/mqm2020/mqm.train.z_score.csv as an example. To train a model on this data set the loss to "kl" in the configuration file. See uncertainties_MT_eval/configs/models/regression_metric_comet_kl.yaml
COMET-based direct uncertainty prediction (COMET-DUP)
It is possible train a COMET model to predict the uncertainty of a given prediction (casting uncertainty as the error/distance to the human judgement), henceforth referred to as COMET-DUP.
Training Setup:
To train a COMET-DUP model it is necessary to:
- Have access to human judgements $q^*$ on a train dataset $\mathcal{D}$
- Run a MT Evaluation or MT Quality Estimation model to obtain quality predictions $\hat{q}$ over $\mathcal{D}$
- Calculate $\epsilon = |q^*-\hat{q}|$ for $\mathcal{D}$
- Use $\epsilon$ as the target for the uncertainty predicting COMET, instead of the human quality judgements which is the default target
Provide the training data in a csv file using a column f1 that holds the values for the predicted quality scores $\hat{q}$ and a column score that contains the computed $\epsilon$ (target) for each <src, mt, ref> instance.
Losses
Upon calculating the above three different losses can be used for the COMET-DUP training:
- Typical MSE loss: $\mathcal{L}^\mathrm{E}_{\mathrm{ABS}}(\hat{\epsilon}; \epsilon^) = (\epsilon^ - \hat{\epsilon})^2$
Specify loss: "mse" in the yaml configuration file to use it - MSE loss with squared values: $\mathcal{L}^\mathrm{E}_{\mathrm{SQ}}(\hat{\epsilon}; \epsilon^) = ((\epsilon^)^2 - \hat{\epsilon}^2)^2 $ Specify loss: "squared" in the yaml configuration file to use it
- Heteroschedastic approximation loss:
$\mathcal{L}^\mathrm{E}_{\mathrm{HTS}}(\hat{\epsilon}; \epsilon^) = \frac{(\epsilon^)^2}{2 \hat{\epsilon}^2} + \frac{1}{2}\log(\hat{\epsilon})^2$
Specify loss: "hts_approx" in the yaml configuration file to use it
Bottleneck:
COMET-DUP unlike COMET uses a bottleneck layer to incorporate the initial quality predictions $\hat{q}$ as training. You need to specify the the size of the bottleneck layer in the configuration file.
Recommended value: 256
Full Train Configuration:
For an example of a configuration file to train COMET-DUP with $\mathcal{L}^\mathrm{E}_{\mathrm{HTS}}$ see the file uncertainties_MT_eval/configs/models/regression_metric_comet_dup.yaml
Inference
For inference with COMET-DUP use the same inference command (comet-score
) used for the other COMET models providing a trained COMET-DUP model in the --model
option. Remember that the output in this case will be uncertainty scores instead of quality scores.