由于深度学习网络课程中没有清楚讲softmax的梯度下降的推导过程,在这里记录如下:
softmax是在做多分类问题时候用在输出层的,函数定义为:
a
k
=
g
(
z
k
)
=
e
z
k
∑
i
=
1
C
e
z
i
a_k=g(z_k)=\frac {e^{z_k}}{\sum\limits_{i=1}^{C}{e^{z_i}}}
ak=g(zk)=i=1∑Ceziezk损失函数的定义为:
L
(
a
,
y
)
=
−
∑
j
=
1
C
y
j
log
a
j
L(a,y)=-\sum\limits_{j=1}^{C}y_{j}\log a_j
L(a,y)=−j=1∑Cyjlogaj
根据链式求导法则, ∂ L ∂ z j = ∂ L ∂ a j ⋅ ∂ a j ∂ z j \frac{\partial L}{\partial z_j} = \frac{\partial L} {\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} ∂zj∂L=∂aj∂L⋅∂zj∂aj,先来计算 ∂ L ∂ a j \frac{\partial L} {\partial a_j} ∂aj∂L,这里使用下标 j j j主要是为了和softmax函数定义中的下标 i i i区分开来: ∂ L ∂ a j = − ∑ j = 1 C y j a j \frac{\partial L} {\partial a_j}=-\sum\limits_{j=1}^{C}\frac{y_j}{a_j} ∂aj∂L=−j=1∑Cajyj然后计算 ∂ a j ∂ z j \frac{\partial a_j}{\partial z_j} ∂zj∂aj,这里有2种情况,当 j = k j=k j=k时(这里要注意 ∑ i = 1 C e z i \sum\limits_{i=1}^{C}{e^{z_i}} i=1∑Cezi中也含有 e z j e^{z_j} ezj): ∂ a j ∂ z j = ( e z k ∑ i = 1 C e z i ) ′ = ( e z k ( ∑ i = 1 C e z i ) − 1 ) ′ = e z k ( ∑ i = 1 C e z i ) − 1 + e z k ( − 1 ) e z j ( ∑ i = 1 C e z i ) 2 = a k − a k a j \frac{\partial a_j}{\partial z_j}=\left(\frac {e^{z_k}}{\sum\limits_{i=1}^{C}{e^{z_i}}}\right)'=\left({e^{z_k}}({\sum\limits_{i=1}^{C}{e^{z_i}}})^{-1}\right)'={e^{z_k}}{(\sum\limits_{i=1}^{C}{e^{z_i}}})^{-1}+\frac{e^{z_k}(-1)e^{z_j}}{(\sum\limits_{i=1}^{C}{e^{z_i}})^2}=a_k-a_ka_j ∂zj∂aj=⎝⎜⎜⎛i=1∑Ceziezk⎠⎟⎟⎞′=(ezk(i=1∑Cezi)−1)′=ezk(i=1∑Cezi)−1+(i=1∑Cezi)2ezk(−1)ezj=ak−akaj
当
j
≠
k
j \neq k
j=k,那么
∂
a
j
∂
z
j
\frac{\partial a_j}{\partial z_j}
∂zj∂aj中的分子项就认为是常数:
∂
a
j
∂
z
j
=
(
e
z
k
∑
i
=
1
C
e
z
i
)
′
=
(
e
z
k
(
∑
i
=
1
C
e
z
i
)
−
1
)
′
=
e
z
k
(
−
1
)
e
z
j
(
∑
i
=
1
C
e
z
i
)
2
=
−
a
k
a
j
\frac{\partial a_j}{\partial z_j}=\left(\frac {e^{z_k}}{\sum\limits_{i=1}^{C}{e^{z_i}}}\right)'=\left({e^{z_k}}({\sum\limits_{i=1}^{C}{e^{z_i}}})^{-1}\right)'=\frac{e^{z_k}(-1)e^{z_j}}{(\sum\limits_{i=1}^{C}{e^{z_i}})^2}=-a_ka_j
∂zj∂aj=⎝⎜⎜⎛i=1∑Ceziezk⎠⎟⎟⎞′=(ezk(i=1∑Cezi)−1)′=(i=1∑Cezi)2ezk(−1)ezj=−akaj所以
∂
L
∂
z
j
\frac{\partial L}{\partial z_j}
∂zj∂L中也把
j
=
k
j=k
j=k与
j
≠
k
j\neq k
j=k区分开:
∂
L
∂
z
j
=
∂
L
∂
a
j
⋅
∂
a
j
∂
z
j
=
−
[
y
j
a
j
(
a
k
−
a
k
a
j
)
+
∑
j
≠
k
C
y
j
a
j
(
−
a
k
a
j
)
]
=
a
k
y
j
−
y
j
+
∑
j
≠
k
C
a
k
y
j
=
∑
j
=
1
C
a
k
y
j
−
y
j
(这里的
k
=
j
)
=
a
j
∑
j
=
1
C
y
j
−
y
j
=
a
j
−
y
j
\begin{array}{ll} \frac{\partial L}{\partial z_j} = \frac{\partial L} {\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} &=-\left[\frac{y_j}{a_j}(a_k-a_ka_j)+\sum\limits_{j\neq k}^{C}\frac{y_j}{a_j}(-a_ka_j)\right]\\ &=a_ky_j-y_j+\sum\limits_{j\neq k}^{C}a_ky_j\\ &=\sum\limits_{j=1}^{C}a_ky_j-y_j\text{ (这里的}k=j\text{)}\\ &=a_j\sum\limits_{j=1}^{C}y_j-y_j\\ &=a_j-y_j \end{array}
∂zj∂L=∂aj∂L⋅∂zj∂aj=−[ajyj(ak−akaj)+j=k∑Cajyj(−akaj)]=akyj−yj+j=k∑Cakyj=j=1∑Cakyj−yj (这里的k=j)=ajj=1∑Cyj−yj=aj−yj
所以: d Z [ L ] = A − Y dZ^{[L]}=A-Y dZ[L]=A−Y。