最近一直在纠结的一个问题,为什么分类损失用交叉熵(cross entry)而不用看起来更简单的二次代价函数(square mean),偶然间搜到一篇博客(查看原文),貌似是领悟了一点,简单整理下。
前向传播
前向传播的流程都是一样的,以二分类为例,以 S i g m o i d Sigmoid Sigmoid 为激活函数,当到达网络最后一层时:
z = w ⋅ x + b z = w \cdot x + b z=w⋅x+b
a = σ ( z ) = s i g m o i d ( z ) a = \sigma (z) = sigmoid(z) a=σ(z)=sigmoid(z)
反向传播
为了方便计算,只取一个样本:
1. 二次代价函数
损失:
c
=
(
a
−
y
)
2
2
c = \frac{(a-y)^2}{2}
c=2(a−y)2
w
w
w 和
b
b
b 的梯度为:
∂
c
∂
w
=
(
a
−
y
)
⋅
a
′
=
(
a
−
y
)
⋅
[
σ
(
w
x
+
b
)
]
′
=
(
a
−
y
)
⋅
σ
′
(
z
)
⋅
x
\begin{aligned} \frac{\partial c}{\partial w} &= (a - y ) \cdot a' \\ &= (a - y) \cdot [\sigma(wx + b)]' \\ &= (a - y) \cdot \sigma'(z) \cdot x \end{aligned}
∂w∂c=(a−y)⋅a′=(a−y)⋅[σ(wx+b)]′=(a−y)⋅σ′(z)⋅x
∂ c ∂ b = ( a − y ) ⋅ a ′ = ( a − y ) ⋅ [ σ ( w x + b ) ] ′ = ( a − y ) ⋅ σ ′ ( z ) \begin{aligned} \frac{\partial c}{\partial b} &= (a - y ) \cdot a' \\ &= (a - y) \cdot [\sigma(wx + b)]' \\ &= (a - y) \cdot \sigma'(z) \end{aligned} ∂b∂c=(a−y)⋅a′=(a−y)⋅[σ(wx+b)]′=(a−y)⋅σ′(z)
2. 交叉熵
损失:
c
=
−
y
l
o
g
(
a
)
−
(
1
−
y
)
l
o
g
(
1
−
a
)
\begin{aligned} c = -y log(a) - (1 - y)log(1 - a) \end{aligned}
c=−ylog(a)−(1−y)log(1−a)
w
w
w 和
b
b
b 的梯度为:
∂
c
∂
w
=
−
(
y
a
−
1
−
y
1
−
a
)
∂
σ
(
z
)
∂
w
=
−
(
y
a
−
1
−
y
1
−
a
)
⋅
σ
′
(
z
)
⋅
x
=
−
(
y
σ
(
z
)
−
1
−
y
1
−
σ
(
z
)
)
⋅
σ
(
z
)
⋅
(
1
−
σ
(
z
)
)
⋅
x
=
−
(
(
1
−
σ
(
z
)
)
y
−
σ
(
z
)
(
1
−
y
)
σ
(
z
)
(
1
−
σ
(
z
)
)
⋅
σ
(
z
)
⋅
(
1
−
σ
(
z
)
)
⋅
x
=
(
σ
(
z
)
−
y
)
⋅
x
∂
c
∂
b
=
σ
(
z
)
−
y
\begin{aligned} \frac{\partial c}{\partial w} &= -(\frac{y}{a} - \frac{1 - y}{1 - a}) \frac{\partial \sigma(z)}{\partial w} \\ &= -(\frac{y}{a} - \frac{1 - y}{1 - a}) \cdot \sigma'(z) \cdot x \\ &= -( \frac{y}{\sigma(z)} - \frac{1 - y}{1 - \sigma(z)} ) \cdot \sigma(z) \cdot (1 - \sigma(z)) \cdot x \\ &= -( \frac{(1 - \sigma(z))y - \sigma(z)(1 - y)}{\sigma(z)(1 - \sigma(z)} ) \cdot \sigma(z) \cdot (1 - \sigma(z)) \cdot x \\ &= (\sigma(z) - y) \cdot x \\ \frac{\partial c}{\partial b} &= \sigma(z) - y \end{aligned}
∂w∂c∂b∂c=−(ay−1−a1−y)∂w∂σ(z)=−(ay−1−a1−y)⋅σ′(z)⋅x=−(σ(z)y−1−σ(z)1−y)⋅σ(z)⋅(1−σ(z))⋅x=−(σ(z)(1−σ(z)(1−σ(z))y−σ(z)(1−y))⋅σ(z)⋅(1−σ(z))⋅x=(σ(z)−y)⋅x=σ(z)−y
Note: Sigmoid的导数为 σ ′ ( z ) = σ ( z ) ⋅ ( 1 − σ ( z ) \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z) σ′(z)=σ(z)⋅(1−σ(z)
结论
在反向传播时:
-
对于square mean在更新 w w w, b b b 时候, w w w, b b b 的梯度跟激活函数的梯度成正比,激活函数梯度越大, w w w, b b b 调整就越快,训练收敛就越快,但是 S i m o i d Simoid Simoid 函数在值非常高时候,梯度是很小的,比较平缓。
-
对于cross entropy在更新 w w w, b b b 时候, w w w, b b b 的梯度跟激活函数的梯度没有关系了, σ ′ ( z ) \sigma'(z) σ′(z) 已经被抵消掉了,其中 ( σ ( z ) − y ) (\sigma(z) - y) (σ(z)−y) 表示的是预测值跟实际值差距,如果差距越大,那么 w w w, b b b 调整就越快,收敛就越快。