softmax 损失函数推导

How to implement a neural network Intermezzo 2

This page is part of a 5 (+2) parts tutorial on how to implement a simple neural network model. You can find the links to the rest of the tutorial here:

Softmax classification function

This intermezzo will cover:

The previous intermezzo described how to do a classification of 2 classes with the help of the logistic function . For multiclass classification there exists an extension of this logistic function called the softmax function which is used in multinomial logistic regression . The following section will explain the softmax function and how to derive it.

In [1]:

Softmax function

The logistic output function described in the previous intermezzo can only be used for the classification between two target classes  t=1 t=1 and  t=0 t=0. This logistic function can be generalized to output a multiclass categorical probability distribution by the softmax function . This softmax function  ς ς takes as input a  C C-dimensional vector  z z and outputs a  C C-dimensional vector  y y of real values between  0 0 and  1 1. This function is a normalized exponential and is defined as:

yc=ς(z)c=ezcCd=1ezdforc=1C yc=ς(z)c=ezc∑d=1Cezdforc=1⋯C

The denominator  Cd=1ezd ∑d=1Cezd acts as a regularizer to make sure that  Cc=1yc=1 ∑c=1Cyc=1. As the output layer of a neural network, the softmax function can be represented graphically as a layer with  C C neurons.

We can write the probabilities that the class is  t=c t=c for  c=1C c=1…C given input  z z as:

P(t=1|z)P(t=C|z)=ς(z)1ς(z)C=1Cd=1ezdez1ezC [P(t=1|z)⋮P(t=C|z)]=[ς(z)1⋮ς(z)C]=1∑d=1Cezd[ez1⋮ezC]

Where  P(t=c|z) P(t=c|z) is thus the probability that that the class is  c c given the input  z z.

These probabilities of the output  P(t=1|z) P(t=1|z) for an example system with 2 classes ( t=1 t=1 t=2 t=2) and input  z=[z1,z2] z=[z1,z2] is shown in the figure below. The other probability  P(t=2|z) P(t=2|z) will be complementary.

In [2]:
# Define the softmax function
def softmax(z):
    return np.exp(z) / np.sum(np.exp(z))
In [3]:

Derivative of the softmax function

To use the softmax function in neural networks, we need to compute its derivative. If we define  ΣC=Cd=1ezdforc=1C ΣC=∑d=1Cezdforc=1⋯C so that  yc=ezc/ΣC yc=ezc/ΣC, then this derivative  yi/zj ∂yi/∂zj of the output  y y of the softmax function with respect to its input  z z can be calculated as:

ifi=j:ifij:yizi=eziΣCzi=eziΣC2ezi2ΣC=eziΣCΣCeziΣC=eziΣC(1eziΣC)=yi(1yi)yizj=eziΣCzj=0eziezj2ΣC=eziΣCezjΣC=yiyj ifi=j:∂yi∂zi=∂eziΣC∂zi=eziΣC−2ezi2ΣC=eziΣCΣC−eziΣC=eziΣC(1−eziΣC)=yi(1−yi)ifi≠j:∂yi∂zj=∂eziΣC∂zj=0−eziezj2ΣC=−eziΣCezjΣC=−yiyj

Note that if  i=j i=j this derivative is similar to the derivative of the logistic function.

Cross-entropy cost function for the softmax function

To derive the cost function for the softmax function we start out from the likelihood function that a given set of parameters  θ θ of the model can result in prediction of the correct class of each input sample, as in the derivation for the logistic cost function. The maximization of this likelihood can be written as:

argmaxθL(θ|t,z) argmaxθL(θ|t,z)

The likelihood  L(θ|t,z) L(θ|t,z) can be rewritten as the joint probability of generating  t t and  z z given the parameters  θ θ P(t,z|θ) P(t,z|θ). Which can be written as a conditional distribution:

P(t,z|θ)=P(t|z,θ)P(z|θ) P(t,z|θ)=P(t|z,θ)P(z|θ)

Since we are not interested in the probability of  z z we can reduce this to:  L(θ|t,z)=P(t|z,θ) L(θ|t,z)=P(t|z,θ). Which can be written as  P(t|z) P(t|z) for fixed  θ θ. Since each  ti ti is dependent on the full  z z, and only 1 class can be activated in the  t t we can write

P(t|z)=i=cCP(tc|z)tc=i=cCς(z)tcc=i=cCytcc P(t|z)=∏i=cCP(tc|z)tc=∏i=cCς(z)ctc=∏i=cCyctc

As was noted during the derivation of the cost function of the logistic function, maximizing this likelihood can also be done by minimizing the negative log-likelihood:

logL(θ|t,z)=ξ(t,z)=logi=cCytcc=i=cCtclog(yc) −logL(θ|t,z)=ξ(t,z)=−log∏i=cCyctc=−∑i=cCtc⋅log(yc)

Which is the cross-entropy error function  ξ ξ. Note that for a 2 class system output  t2=1t1 t2=1−t1 and this results in the same error function as for logistic regression:  ξ(t,y)=tclog(yc)(1tc)log(1yc) ξ(t,y)=−tclog(yc)−(1−tc)log(1−yc).

The cross-entropy error function over a batch of multiple samples of size  n n can be calculated as:

ξ(T,Y)=i=1nξ(ti,yi)=i=1ni=cCticlog(yic) ξ(T,Y)=∑i=1nξ(ti,yi)=−∑i=1n∑i=cCtic⋅log(yic)

Where  tic tic is 1 if and only if sample  i i belongs to class  c c, and  yic yic is the output probability that sample  i i belongs to class  c c.

Derivative of the cross-entropy cost function for the softmax function

The derivative  ξ/zi ∂ξ/∂zi of the cost function with respect to the softmax input  zi zi can be calculated as:

ξzi=j=1Ctjlog(yj)zi=j=1Ctjlog(yj)zi=j=1Ctj1yjyjzi=tiyiyizijiCtjyjyjzi=tiyiyi(1yi)jiCtjyj(yjyi)=ti+tiyi+jiCtjyi=ti+j=1Ctjyi=ti+yij=1Ctj=yiti ∂ξ∂zi=−∑j=1C∂tjlog(yj)∂zi=−∑j=1Ctj∂log(yj)∂zi=−∑j=1Ctj1yj∂yj∂zi=−tiyi∂yi∂zi−∑j≠iCtjyj∂yj∂zi=−tiyiyi(1−yi)−∑j≠iCtjyj(−yjyi)=−ti+tiyi+∑j≠iCtjyi=−ti+∑j=1Ctjyi=−ti+yi∑j=1Ctj=yi−ti

Note that we already derived  yj/zi ∂yj/∂zi for  i=j i=j and  ij i≠j above.

The result that  ξ/zi=yiti ∂ξ/∂zi=yi−ti for all  iC i∈C is the same as the derivative of the cross-entropy for the logistic function which had only one output node.

This post at peterroelants.github.io is generated from an IPython notebook file. Link to the full IPython notebook file


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值