理解 softmax 和 NLL 损失函数 (the negative log-likelihood) 以及求导过程

本文转载自 https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/
有空了再翻译一下吧。

In this notebook I will explain the softmax function, its relationship with the negative log-likelihood, and its derivative when doing the backpropagation algorithm. If there are any questions or clarifications, please leave a comment below.

Softmax Activation Function

The softmax activation function is often placed at the output layer of a neural network. It is commonly used in multi-class learning problems where a set of features can be related to one-of- K K K classes. For example, in the CIFAR-10 image classification problem, given a set of pixels as input, we need to classify if a particular sample belongs to one-of-ten available classes: i.e., cat, dog, airplane, etc.

Its equation is simple, we just have to compute for the normalized exponential function of all the units in the layer. In such case,
(1) S ( f y i ) = e f y i ∑ j e f j S(f_{y_i}) = \dfrac{e^{f_{y_i}}}{\sum_{j}e^{f_j}} \tag{1} S(fyi)=jefjefyi(1)

Intuitively, what the softmax does is that it squashes a vector of size K K K between 0 and 1. Furthermore, because it is a normalization of the exponential, the sum of this whole vector equates to 1. We can then interpret the output of the softmax as the probabilities that a certain set of features belongs to a certain class.

Thus, given a three-class example below, the scores y i y_i yi are computed from the forward propagation of the network. We then take the softmax and obtain the probabilities as shown:
Alt

Fig. 1. Softmax Computation for three classes

The output of the softmax describes the probability (or if you may, the confidence) of the neural network that a particular sample belongs to a certain class. Thus, for the first example above, the neural network assigns a confidence of 0.71 that it is a cat, 0.26 that it is a dog, and 0.04 that it is a horse. The same goes for each of the samples above.

We can then see that one advantage of using the softmax at the output layer is that it improves the interpretability of the neural network. By looking at the softmax output in terms of the network’s confidence, we can then reason about the behavior of our model.

Negative Log-Likelihood (NLL)

In practice, the softmax function is used in tandem with the negative log-likelihood (NLL). This loss function is very interesting if we interpret it in relation to the behavior of softmax. First, let’s write down our loss function:

(2) L ( y ) = − log ⁡ ( y ) L(\mathbf{y}) = -\log(\mathbf{y}) \tag{2} L(y)=log(y)(2)
This is summed for all the correct classes.

Recall that when training a model, we aspire to find the minima of a loss function given a set of parameters (in a neural network, these are the weights and biases). We can interpret the loss as the “unhappiness” of the network with respect to its parameters. The higher the loss, the higher the unhappiness: we don’t want that. We want to make our models happy.

So if we are using the negative log-likelihood as our loss function, when does it become unhappy? And when does it become happy? Let’s try to plot its range:
Alt

Fig. 2. The loss function reaches infinity when input is 0, and reaches 0 when input is 1.

The negative log-likelihood becomes unhappy at smaller values, where it can reach infinite unhappiness (that’s too sad), and becomes less unhappy at larger values. Because we are summing the loss function to all the correct classes, what’s actually happening is that whenever the network assigns high confidence at the correct class, the unhappiness is low, but when the network assigns low confidence at the correct class, the unhappiness is high.

alt

Fig. 3. When computing the loss, we can then see that higher confidence at the correct class leads to lower loss and vice-versa.

Derivative of the Softmax

In this part, we will differentiate the softmax function with respect to the negative log-likelihood. Following the convention at the CS231n course, we let f f f as a vector containing the class scores for a single example, that is, the output of the network. Thus f k f_k fk is an element for a certain class k k k in all j j j classes.

We can then rewrite the softmax output as
(3) p k = e f k ∑ j e f j p_k = \dfrac{e^{f_k}}{\sum_{j} e^{f_j}} \tag{3} pk=jefjefk(3)
and the negative log-likelihood as
(4) L i = − log ⁡ ( p y i ) L_i = -\log(p_{y_i}) \tag{4} Li=log(pyi)(4)

Now, recall that when performing backpropagation, the first thing we have to do is to compute how the loss changes with respect to the output of the network. Thus, we are looking for ∂ L i ∂ f k \dfrac{\partial L_i}{\partial f_k} fkLi.

Because L L L is dependent on p k p_k pk, and p p p is dependent on f k f_k fk, we can simply relate them via chain rule:
(5) ∂ L i ∂ f k = ∂ L i ∂ p k ∂ p k ∂ f k \dfrac{\partial L_i}{\partial f_k} = \dfrac{\partial L_i}{\partial p_k} \dfrac{\partial p_k}{\partial f_k} \tag{5} fkLi=pkLifkpk(5)
There are now two parts in our approach. First (the easiest one), we solve ∂ L i ∂ p k \dfrac{\partial L_i}{\partial p_k} pkLi, then we solve ∂ p y i ∂ f k \dfrac{\partial p_{y_i}}{\partial f_k} fkpyi. The first is simply the derivative of the log, the second is a bit more involved.

Let’s do the first one then,
(6) ∂ L i ∂ p k = − 1 p k \dfrac{\partial L_i}{\partial p_k} = -\dfrac{1}{p_k} \tag{6} pkLi=pk1(6)
For the second one, we have to recall the quotient rule for derivatives, let the derivative be represented by the operator D D D:
(7) f ( x ) g ( x ) = g ( x ) D f ( x ) − f ( x ) D g ( x ) g ( x ) 2 \dfrac{f(x)}{g(x)} = \dfrac{g(x) \mathbf{D} f(x) - f(x) \mathbf{D} g(x)}{g(x)^2} \tag{7} g(x)f(x)=g(x)2g(x)Df(x)f(x)Dg(x)(7)
We let ∑ j e f j = Σ \sum_{j} e^{f_j} = \Sigma jefj=Σ, and by substituting, we obtain
(8) ∂ p k ∂ f k = ∂ ∂ f k ( e f k ∑ j e f j ) = Σ D e f k − e f k D Σ Σ 2 = e f k ( Σ − e f k ) Σ 2 \begin{aligned} \dfrac{\partial p_k}{\partial f_k} &= \dfrac{\partial}{\partial f_k} \left(\dfrac{e^{f_k}}{\sum_{j} e^{f_j}}\right) \\ &= \dfrac{\Sigma \mathbf{D} e^{f_k} - e^{f_k} \mathbf{D} \Sigma}{\Sigma^2} \\ &= \dfrac{e^{f_k}(\Sigma - e^{f_k})}{\Sigma^2} \end{aligned} \tag{8} fkpk=fk(jefjefk)=Σ2ΣDefkefkDΣ=Σ2efk(Σefk)(8)

The reason why D Σ = e f k D\Sigma=e^{f_k} DΣ=efk is because if we take the input array f f f in the softmax function, we’re always “looking” or we’re always taking the derivative of the k-th element. In this case, the derivative with respect to the k k k-th element will always be 0 in those elements that are non- k k k, but e f k e^{f_k} efk at k k k.

Continuing our derivation,
(9) ∂ p k ∂ f k = e f k ( Σ − e f k ) Σ 2 = e f k Σ Σ − e f k Σ = p k ∗ ( 1 − p k ) \begin{aligned} \dfrac{\partial p_k}{\partial f_k} &= \dfrac{e^{f_k}(\Sigma - e^{f_k})}{\Sigma^2} \\ &=\dfrac{e^{f_k}}{\Sigma} \dfrac{\Sigma - e^{f_k}}{\Sigma} \\ &= p_k * (1-p_k) \end{aligned} \tag{9} fkpk=Σ2efk(Σefk)=ΣefkΣΣefk=pk(1pk)(9)

By combining the two derivatives we’ve computed earlier, we have:
(10) ∂ L i ∂ f k = ∂ L i ∂ p k ∂ p k ∂ f k = − 1 p k ( p k ∗ ( 1 − p k ) ) = ( p k − 1 ) \begin{aligned} \dfrac{\partial L_i}{\partial f_k} &= \dfrac{\partial L_i}{\partial p_k} \dfrac{\partial p_k}{\partial f_k} \\ &= -\dfrac{1}{p_k} (p_k * (1-p_k)) \\ &= (p_k - 1) \end{aligned} \tag{10} fkLi=pkLifkpk=pk1(pk(1pk))=(pk1)(10)

And thus we have differentatied the negative log likelihood with respect to the softmax layer.

  • 0
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值