Home

Awesome

Derivative Manipulation: Example Weighting via Emphasis Density Funtion in the context of DL

CVPR 2021 Review Logs

:+1: Glad to know that our recent papers have inspired an ICML 2020 paper: Normalized Loss Functions for Deep Learning with Noisy Labels

<!-- #### :+1: [Code is available now!](https://xinshaoamoswang.github.io/blogs/2020-06-14-code-releasing/) -->

:+1: Selected work partially impacted by our work

Citation

Please kindly cite us if you find our work useful and inspiring.

@article{wang2019derivative,
  title={Derivative Manipulation for General Example Weighting},
  author={Wang, Xinshao and Kodirov, Elyor and Hua, Yang and Robertson, Neil},
  journal={arXiv preprint arXiv:1905.11233},
  year={2019}
}

Questions

The importance of example weighting

Rethinking

Representative Questions on Why does it work? (#Tag: From ICML 2020, thanks to the reviewer)

0.Lack of theoretical backing: The idea of using example weighting to train robust models has been gaining a lot of traction, including prior approaches that automatically learn the re-weighting function to enable robustness to noise.

I find the high-level idea of interpreting example re-weighting as manipulation of the loss gradients to be interesting, and that this then allows us to alter the magnitude of the gradients at the example level. Moreover, the paper also does a nice job of designing the gradient manipulation scheme based on the "probability p_i assigned by a model to the true label", and arguing that points that end up with a lower p_i represent noisy/outlier examples and need to assigned lower a weight.

My main concern is that this scheme is one among several heuristics one could use and does not have a strong theoretical backing. For example, what is the effective optimization problem being solved with the manipulated gradients? Can you say something about the robustness of the resulting trained model under specific assumptions on the noise in the data?

1.Concerns about weighting scheme (local changes vs global effect): I have two main comments/concern about the re-weighting scheme:

Firstly, changes to the local gradients changes the landscape of the objective being optimized. I'm not sure if it's obvious that the effective optimization being solved does indeed train the model more robustly.

The authors are correct that they only change the magnitude of the gradient per-example and not its direction. However, when averaged over multiple examples, doesn't this effectively alter the direction of the average gradient, and thus alter the landscape of the objective being optimized?

Secondly, the intuition that examples with a low "probability p_i" might be outliers and should be assigned lower weights holds true for a fully trained model. However, early on in the training process, it's quite possible that even easy examples receive a lower p_i. Wouldn't your example re-weighting scheme result in such examples being ignored as a result of the gradient magnitudes being decreased by your scheme?

2.In other words, it's not entirely clear how local manipulations to gradients effect the overall objective being optimized, and this is where some theoretical results showcasing the global effects of the proposed local manipulations would provide greater credibility to the method. Feel free to correct me if I have misunderstood your scheme, or if this is concern of mine is already addressed in your paper.

Personal Answer (I will provide more detailed response later):

Introduction

Additional Information

More comments and comparison with related work

Extremely Simple and Effective

Without advanced training strategies: e.g.,

a. Iterative retraining on gradual data correction

b. Training based on carefully-designed curriculums

...

Without using extra networks: e.g.,

a. Decoupling" when to update" from" how to update"

b. Co-teaching: Robust training of deep neural networks with extremely noisy labels

c. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels

...

Without using extra validation sets for model optimisation: e.g.,

a. Learning to reweight examples for robust deep learning

b. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels

c. Toward robustness against label noise in training deep discriminative neural networks

d. Learning from noisy large-scale datasets with minimal supervision.

e. Learning from noisy labels with distillation.

f. Cleannet: Transfer learning for scalable image classifier training with label noise

...

Without data pruning: e.g.,

a. Generalized cross entropy loss for training deep neural networks with noisy labels.
...

Without relabelling: e.g.,

a. A semi-supervised two-stage approach to learning from noisy labels

b. Joint optimization framework for learning with noisy labels

...

Tables and Figures

Please see our paper:

<p float="left"> <img src="./figs/1.png" width="400"> <img src="./figs/2.png" width="400"> <img src="./figs/3.png" width="400"> <img src="./figs/4.png" width="400"> <img src="./figs/6.png" width="400"> <img src="./figs/5.png" width="800"> </p>

References