Well-Classified Examples are Underestimated...

Summary of the paper “Well-Classified Examples are Underestimated in Classification with Deep Neural Networks” of AAAI 2022

TL;DR;

I didn’t understand Energy related parts.

Paper Link

https://arxiv.org/abs/2110.06537

different losses/derivations w.r.t $p$ or $\theta$

where $p = \sigma(f(x))$ and $\sigma$ is sigmoid, and $f(x) \in \mathbb{R}^n$ is the output of the Neural Network.

MSE with Sigmoid Activation

$L = (p - y)^2$

$\frac{\partial}{\partial \theta} L = 2(p - y)p(1-p) \nabla_\theta f(x)$ Since y is either 0 or 1, gradients vanish quadratically as $p$ converges.

BCE with Sigmoid Activation

$L = -\log p(y \vert x)$

$\frac{\partial}{\partial \theta} L = (1-p) \nabla_\theta f(x)$ in BCE, gradients converges linearly as $p(y \vert x) \rightarrow 1$

Thus, BCE gives more steep graidents than MSE.

MAE with Sigmoid Activation

Note 1: not appear in the paper

$\frac{\partial}{\partial \theta} L = \text{sign}(p - y)p(1-p) \nabla_\theta f(x)$ See that $p(1 - p) \simeq p$ if $p \rightarrow 1$ and $p(1-p) \simeq (1 - p)$ if $p \rightarrow 0$, Therefore, MAE shows similar convergence behavior to BCE e.g., gradient convernges linearly.

It has been noted recently, smaller gradients for high-confidence samples are harmful in Rrepresentation Learning and Authors claims linear decay of gradients is still not good enough.

Proposed Solution

1. A bonus loss is proposed, symmetric to log-likelihood function

\[L_O = -\log p(y \vert x) + \log (1 - p(y \vert x))\]

2. The bonus function is truncated to linear function;

I think this is made for technical reasons. First, $\log (1 - p(y \vert x))$ diverges to minus inf if p gets to one, it’s the main objective of the learning. So, $\log (1 - p(y \vert x))$ is replaced by a linear function and to be continuous with $\log (1 - p(y \vert x))$.

$L_{LE} = -\log p(y \vert x) + C - p(y \vert x)$ when $p$ is close to 1. $C$ is determined such that $L_{O}$ and $L_{LE}$ are continuous.

Note 2: not appear in the paper Also, when $p$ is high enough so linear function is used, then it is exactly same when the loss is MAE with sigmoid activation. In this case, $p$ is close to $1$ so the gradients converge linearly.

implemenation

I implemented these losses based on https://github.com/kuangliu/pytorch-cifar and tuned some parameters.

https://gist.github.com/ita9naiwa/49ab8279d3277ab5d8b0795e1eb0ea1d

Loss Function	Accuracy on Test set
CE	92.67
mae + mse	92.33
CE + bonus CE	91.13

I couldn’t reproduce experiments on the paper, namely, “On same hyperparameter set, Bonus CE gives better in terms of accuracy…”. but I didn’t try to find good hyperparameter sets for bonus CE.

Thoughts

Even I failed to reproduce results, it gives a thoughtful view to look at various loss functions and their gradients.