

IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters

  title={{IMAE} for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters},
  author={Wang, Xinshao and Hua, Yang and Kodirov, Elyor and Robertson, Neil M},
  journal={arXiv preprint arXiv:1903.12141},

Open Reviews and Discussion

Since this paper is released, for your better reference, the ICCV-19 reviews results are released following the practice of OpenReview

A Open Question on whether clean or noisy validation set for ML/DL researchers caring about label noise

Positive comments we collected

Research questions:

Our work is a further study of robust losses following MAE [1] and GCE [2]. They proved MAE is more robust than CCE when noise exists. However, MAE’s underfitting phenomenon is not exposed and studied in the literature. We analysed it thoroughly and proposed a simple solution to embrace both high fitting ability (accurate) and test stability (robust).

IMAE is suitable for cases where inputs and labels may be unmatched.

Training DNNs requires rethinking data fitting and generalisation. Our main contribution is simple analysis and solution from the viewpoint of gradient magnitude with respect to logits.


<p float="left"> <img src="./fig/illustration_MAE_IMAE_CCE.png" width="400"> <img src="./fig/illustration_MAE_IMAE_CCE_caption.png" width="400"> </p> <p float="left"> <img src="./fig/introduction_table.png" width="400"> <img src="./fig/introduction_caption.png" width="400"> </p> <p float="left"> <img src="./fig/table2_caption.png" width="800"> <img src="./fig/table2.png" width="800"> </p> <p float="left"> <img src="./fig/figure3.png" width="400"> <img src="./fig/figure3_caption.png" width="400"> </p>

Please see our empirical evidences in the paper.

MAE’s fitting ability is much worse than CCE. In other words, CCE overfits to incorrect labels while MAE underfits to correct labels.


Label noise is one of the most explicit cases where some observations and their labels are not matched in the training data. In this case, it is quite crucial to make your models learn meaningful patterns instead of errors.

Synthetic noise

<p float="left"> <img src="./fig/table3.png" width="400"> <img src="./fig/table3_caption.png" width="400"> </p> <p float="left"> <img src="./fig/table4.png" width="400"> </p> <img src="./fig/table6.png" width="800"> <img src="./fig/figure4.png" width="800">

Real-world unknown noise

<img src="./fig/table5.png" width="800">

Hyper-paramter Analysis

<img src="./fig/illustration_IMAE.png" width="400"> <img src="./fig/train_dynamics_T.png" width="800"> <img src="./fig/test_dynamics_T.png" width="800">


1. The idea of this paper is quite close to "training deep neural-networks using a noise adaptation layer"? They both intend to change the weight of each sample before sending to softmax, definitely they do in different ways. It decreases the novelty of this paper?

Their critical differences are: 1) Noise Adaption explicitly estimates latent true labels by an additional softmax layer while our IMAE reweights examples based on their input-to-label relevance scores; 2) IMAE reweights samples after softmax, i.e., scaling their gradients as shown in Eq. (22) in our paper.

2. Why uniform noise (symmetric/class-independent noise )?

We choose uniform noise because it is more challenging than asymmetric (class-dependent) noise which was verified in [d] Vahdat et al. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, 2017.

3. Why is the performance still okay when noise rate is 80%?

By adding uniform noise, even up to 80%, the correct portion is still the majority, since the 80% are relocated to other 9 classes evenly.

Being natural and intuitive, the majority voting decides the meaningful data patterns to learn. We believe that if the noise accounts the majority, DNNs is hard to learn meaningful patterns. Therefore, the majority voting is our reasonable assumption.

4. The study from the gradient perspective is not new, e.g., Truncated Cauchy Non-Negative Matrix Factorization, ang GCE [2].

Yes, we agree the perspective itself is not new. However, we find how we analyse fundamentally and go to the simple solution via the gradient viewpoint is novel.

Truncated Cauchy Non-Negative Matrix Factorization (TPAMI-2017) and GCE [2] truncate large errors to filter out extreme outliers. Instead, our IMAE adjusts weighting variance without dropping any samples.

5. The robustness is not specific for label noise. I think the method works well for general noise, e.g., outliers.

Yes, that is a great point. Our IMAE is suitable for all cases where inputs and their labels are not semantically matched, which may come from noisy data or labels. Since we only evaluated on label noise, we did not exaggerate its efficacy.

We will test more cases in the future.

6. Is the validation data clean or not? If clean, this would greatly reduce the contribution of the paper.

Following the ML literature, a validation set should be clean as we should not expect a ML model to predict noisy data well. In other words, we cannot evaluate a model’s performance on noisy validation/test data. Our goal is to avoid learning faults from noisy data and generalise better during inference.

7. More experiments with comparison to prior work and more evaluation on real-world datasets with unknown noise?

Our focus is to analyse why CCE overfits while MAE underfits as presented in ablation studies in Table 2. Under unknown real-world noise in Table 3, we only compared with GCE [2] as it is the most related and demonstrated to be the state-of-the-art.

<img src="./fig/table5.png" width="800">


