参考资料
cs231n Course Materials: Backprop
Derivatives, Backpropagation, and Vectorization
cs231n Lecture 4:Neural Networks and Backpropagation
cs231n Assignment 2
笔记: Batch Normalization及其反向传播
3. SoftMax 损失函数
"""
Inputs:
- X: Input data, of shape (N, C) where x[i, j] is the score for the jth
class for the ith input.
- Y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
0 <= y[i] < C
Returns a tuple of:
- L: Scalar giving the loss
- dx: Gradient of the loss with respect to x
"""
L
i
=
−
log
e
X
i
,
y
i
∑
j
e
X
i
,
j
(3.1)
L_i=-\log{\frac{e^{X_{i,y_i}}}{\sum_{j}e^{X_{i,j}}}}\tag{3.1}
Li=−log∑jeXi,jeXi,yi(3.1)
L
=
1
N
∑
i
L
i
(3.2)
L=\frac{1}{N}\sum_i{L_i}\tag{3.2}
L=N1i∑Li(3.2)
为了防止数值溢出,一般在实现时进行如下变形:
L
i
=
−
log
e
X
i
,
y
i
∑
j
e
X
i
,
j
=
−
log
e
max
{
e
i
,
⋅
X
}
e
(
X
i
,
y
i
−
max
{
e
i
,
⋅
X
}
)
e
max
{
e
i
,
⋅
X
}
∑
j
e
(
X
i
,
j
−
max
{
e
i
,
⋅
X
}
)
=
−
log
e
(
X
i
,
y
i
−
max
{
e
i
,
⋅
X
}
)
∑
j
e
(
X
i
,
j
−
max
{
e
i
,
⋅
X
}
)
(3.3)
\begin{aligned}L_i&=-\log{\frac{e^{X_{i,y_i}}}{\sum_{j}e^{X_{i,j}}}}\\&=-\log{\frac{e^{\max{\{e^X_{i,\cdot}\}}}e^{\left(X_{i,y_i}-\max{\{e^X_{i,\cdot}\}}\right)}}{e^{\max{\{e^X_{i,\cdot}\}}}\sum_{j}e^{\left(X_{i,j}-\max{\{e^X_{i,\cdot}\}}\right)}}}\\&=-\log{\frac{e^{\left(X_{i,y_i}-\max{\{e^X_{i,\cdot}\}}\right)}}{\sum_{j}e^{\left(X_{i,j}-\max{\{e^X_{i,\cdot}\}}\right)}}}\end{aligned}\tag{3.3}
Li=−log∑jeXi,jeXi,yi=−logemax{ei,⋅X}∑je(Xi,j−max{ei,⋅X})emax{ei,⋅X}e(Xi,yi−max{ei,⋅X})=−log∑je(Xi,j−max{ei,⋅X})e(Xi,yi−max{ei,⋅X})(3.3)
关于反向传播,推导如下:
由于式(3.3)的变形并不影响函数的值,所以可以使用变形前的形式进行推导。
分成两种情况进行讨论:
(1) 对
X
i
,
y
i
X_{i,y_{i}}
Xi,yi求梯度
∂
L
∂
X
i
,
y
i
=
1
N
∂
L
i
∂
X
i
,
y
i
=
−
1
N
∑
j
e
X
i
,
j
e
X
i
,
y
i
e
X
i
,
y
i
∑
j
e
X
i
,
j
−
(
e
X
i
,
y
i
)
2
(
∑
j
e
X
i
,
j
)
2
=
1
N
e
X
i
,
y
i
−
∑
j
e
X
i
,
j
∑
j
e
X
i
,
j
=
1
N
(
e
X
i
,
y
i
∑
j
e
X
i
,
j
−
1
)
(3.4)
\begin{aligned}\frac{\partial{L}}{\partial{X_{i,y_i}}}&=\frac{1}{N}\frac{\partial{L_i}}{\partial{X_{i,y_i}}}\\&=-\frac{1}{N}\frac{\sum_j{e^{X_{i,j}}}}{e^{X_{i,y_i}}}\frac{e^{X_{i,y_i}}\sum_{j}e^{X_{i,j}}-\left(e^{X_{i,y_i}}\right)^2}{\left(\sum_{j}e^{X_{i,j}}\right)^2}\\&=\frac{1}{N}\frac{e^{X_{i,y_i}}-\sum_j{e^{X_{i,j}}}}{\sum_j{e^{X_{i,j}}}}\\&=\frac{1}{N}\left(\frac{e^{X_{i,y_i}}}{\sum_{j}e^{X_{i,j}}}-1\right)\end{aligned}\tag{3.4}
∂Xi,yi∂L=N1∂Xi,yi∂Li=−N1eXi,yi∑jeXi,j(∑jeXi,j)2eXi,yi∑jeXi,j−(eXi,yi)2=N1∑jeXi,jeXi,yi−∑jeXi,j=N1(∑jeXi,jeXi,yi−1)(3.4)
(2) 对
X
i
,
k
(
k
≠
y
i
)
X_{i,k}(k\neq y_{i})
Xi,k(k=yi)求梯度
∂
L
∂
X
i
,
k
=
1
N
∂
L
i
∂
X
i
,
k
=
1
N
∑
j
e
X
i
,
j
e
X
i
,
y
i
e
X
i
,
y
i
e
X
i
,
k
(
∑
j
e
X
i
,
j
)
2
=
1
N
e
X
i
,
k
∑
j
e
X
i
,
j
(3.5)
\begin{aligned}\frac{\partial{L}}{\partial{X_{i,k}}}&=\frac{1}{N}\frac{\partial{L_i}}{\partial{X_{i,k}}}\\&=\frac{1}{N}\frac{\sum_j{e^{X_{i,j}}}}{e^{X_{i,y_i}}}\frac{e^{X_{i,y_i}}e^{X_{i,k}}}{\left(\sum_{j}e^{X_{i,j}}\right)^2}\\&=\frac{1}{N}\frac{e^{X_{i,k}}}{\sum_{j}e^{X_{i,j}}}\end{aligned}\tag{3.5}
∂Xi,k∂L=N1∂Xi,k∂Li=N1eXi,yi∑jeXi,j(∑jeXi,j)2eXi,yieXi,k=N1∑jeXi,jeXi,k(3.5)
令
p
i
,
k
=
e
X
i
,
k
∑
j
e
X
i
,
j
(3.6)
p_{i,k}=\frac{e^{X_{i,k}}}{\sum_{j}e^{X_{i,j}}}\tag{3.6}
pi,k=∑jeXi,jeXi,k(3.6)
则有
∂
L
∂
X
i
,
y
i
=
1
N
(
p
i
,
y
i
−
1
)
(3.7)
\frac{\partial{L}}{\partial{X_{i,y_i}}}=\frac{1}{N}\left(p_{i,y_i}-1\right)\tag{3.7}
∂Xi,yi∂L=N1(pi,yi−1)(3.7)
∂
L
∂
X
i
,
k
=
1
N
p
i
,
k
(3.8)
\frac{\partial{L}}{\partial{X_{i,k}}}=\frac{1}{N}p_{i,k}\tag{3.8}
∂Xi,k∂L=N1pi,k(3.8)