数据集:
使用mnist数据集,每一个样本 ( x , y ) (x,y) (x,y), x x x为原本为28*28的灰度图像,对像素进行归一化后,reshape成 ( 28 ∗ 28 , 1 ) (28*28, 1) (28∗28,1)的列向量,y原本是一个整数,表示该图片属于哪一类,对整数进行one-hot编码成一个 ( 10 , 1 ) (10,1) (10,1)的列向量(只在整数的位置是1,其余位置都是0)
一些函数:
σ
(
x
)
=
1
1
+
e
−
x
\sigma(x)=\frac 1{1+e^{-x}}
σ(x)=1+e−x1
σ
′
(
x
)
=
σ
(
x
)
⋅
(
1
−
σ
(
x
)
)
\sigma^{'}(x)=\sigma(x)\cdot (1-\sigma(x))
σ′(x)=σ(x)⋅(1−σ(x))
s
o
f
t
m
a
x
(
x
)
=
e
x
∑
i
=
1
n
e
x
i
=
e
x
1
T
e
x
softmax(x)=\frac {e^x}{\sum_{i=1}^{n}e^{x_i}}=\frac {e^x}{1^Te^x}
softmax(x)=∑i=1nexiex=1Texex
说明:
- ⊙ \odot ⊙为大小相同的矩阵逐元素点乘
- 一个数套上迹运算结果不变
- 一个数的转置还是本身
- 标量向矩阵求导时,根据导数和微分的关系,有 d f ( x ) = t r ( ( ∂ f ( x ) ∂ x ) T d ( x ) ) df(x) = tr((\frac {\partial f(x)}{\partial x})^T d(x)) df(x)=tr((∂x∂f(x))Td(x))
网络结构:
a
1
=
w
1
⋅
x
+
b
1
a_1=w_1\cdot x+b_1
a1=w1⋅x+b1
h
1
=
σ
(
a
1
)
h_1=\sigma (a_1)
h1=σ(a1)
a
2
=
w
2
⋅
h
1
+
b
2
a_2=w_2\cdot h_1+b_2
a2=w2⋅h1+b2
h
2
=
σ
(
a
2
)
h_2=\sigma (a_2)
h2=σ(a2)
a
3
=
w
3
⋅
h
2
+
b
3
a_3=w_3\cdot h_2+b_3
a3=w3⋅h2+b3
y
′
=
s
o
f
t
m
a
x
(
a
3
)
y^{'}=softmax(a_3)
y′=softmax(a3)
y
′
y^{'}
y′就是输出,与标签y有着相同的shape
损失函数:
使用交叉熵作为损失函数
l
(
y
′
,
y
)
=
−
y
T
l
n
(
y
′
)
l(y^{'},y)=-y^Tln(y^{'})
l(y′,y)=−yTln(y′)
损失函数l关于各个参数的梯度:
l
=
−
y
T
ln
e
a
3
1
T
e
a
3
=
−
y
T
(
a
3
−
1
⋅
l
n
(
1
T
e
a
3
)
)
=
−
y
T
a
3
+
l
n
(
1
T
⋅
e
a
3
)
\begin{aligned} l & = -y^T\ln{\frac {e^{a_3}} {1^Te^{a_3}}} \\ & = -y^T(a_3 - 1\cdot ln (1^Te^{a_3})) \\ & = -y^Ta_3 + ln(1^T \cdot e^{a_3}) \end{aligned}
l=−yTln1Tea3ea3=−yT(a3−1⋅ln(1Tea3))=−yTa3+ln(1T⋅ea3)
d l = t r ( − y T d ( a 3 ) ) + t r ( 1 T e a 3 ⊙ d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( 1 ⊙ e a 3 ) T d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( e a 3 ) T d ( a 3 ) 1 T e a 3 ) = t r ( − y T d ( a 3 ) ) + t r ( ( e a 3 1 T e a 3 ) T d ( a 3 ) ) = t r ( − y T d ( a 3 ) ) + t r ( ( s o f t m a x ( a 3 ) ) T d ( a 3 ) ) = t r [ ( ( s o f t m a x ( a 3 ) ) T − y T ) d ( a 3 ) ] = t r [ ( ( s o f t m a x ( a 3 ) ) − y ) T d ( a 3 ) ] \begin{aligned} dl & = tr(-y^Td(a_3)) + tr(\frac {1^Te^{a_3}\odot d(a_3)}{1^Te^{a_3}}) \\ & = tr(-y^Td(a_3)) + tr(\frac {(1\odot e^{a_3})^Td(a_3)}{1^Te^{a3}}) \\ & = tr(-y^Td(a_3)) + tr(\frac {(e^{a_3})^Td(a_3)}{1^Te^{a_3}}) \\ & = tr(-y^Td(a_3)) + tr((\frac {e^{a_3}}{1^Te^{a_3}})^T d(a_3)) \\ & = tr(-y^Td(a_3)) + tr((softmax(a_3))^T d(a_3)) \\ & = tr[((softmax(a_3))^T - y^T)\ d(a_3)] \\ & = tr[((softmax(a_3)) - y)^T d(a_3)] \end{aligned} dl=tr(−yTd(a3))+tr(1Tea31Tea3⊙d(a3))=tr(−yTd(a3))+tr(1Tea3(1⊙ea3)Td(a3))=tr(−yTd(a3))+tr(1Tea3(ea3)Td(a3))=tr(−yTd(a3))+tr((1Tea3ea3)Td(a3))=tr(−yTd(a3))+tr((softmax(a3))Td(a3))=tr[((softmax(a3))T−yT) d(a3)]=tr[((softmax(a3))−y)Td(a3)]
所以 ∂ l ∂ a 3 = s o f t m a x ( a 3 ) − y \frac {\partial l}{\partial a_3} = softmax(a_3) - y ∂a3∂l=softmax(a3)−y
同理,还有
d
l
=
t
r
[
(
∂
l
∂
a
3
)
T
d
(
a
3
)
]
=
t
r
[
(
∂
l
∂
a
3
)
T
d
(
w
3
)
h
2
]
+
t
r
[
(
∂
l
∂
a
3
)
T
w
3
d
(
h
2
)
]
+
t
r
[
(
∂
l
∂
a
3
)
T
d
(
b
3
)
]
=
t
r
[
h
2
(
∂
l
∂
a
3
)
T
d
(
w
3
)
]
+
t
r
[
(
∂
l
∂
a
3
)
T
w
3
d
(
h
2
)
]
+
t
r
[
(
∂
l
∂
a
3
)
T
d
(
b
3
)
]
\begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_3})^T d(a_3)] \\ & = tr[(\frac {\partial l}{\partial a_3})^T d(w_3)h_2] + tr[(\frac {\partial l}{\partial a_3})^T w_3 d(h_2)] + tr[(\frac {\partial l}{\partial a_3})^T d(b_3)] \\ & = tr[h_2(\frac {\partial l}{\partial a_3})^T d(w_3)] + tr[(\frac {\partial l}{\partial a_3})^T w_3 d(h_2)] + tr[(\frac {\partial l}{\partial a_3})^T d(b_3)] \\ \end{aligned}
dl=tr[(∂a3∂l)Td(a3)]=tr[(∂a3∂l)Td(w3)h2]+tr[(∂a3∂l)Tw3d(h2)]+tr[(∂a3∂l)Td(b3)]=tr[h2(∂a3∂l)Td(w3)]+tr[(∂a3∂l)Tw3d(h2)]+tr[(∂a3∂l)Td(b3)]
从第一个式子,得到
∂
l
∂
w
3
=
∂
l
∂
a
3
h
2
T
\frac {\partial l}{\partial w_3} = \frac {\partial l}{\partial a_3}h_2^T
∂w3∂l=∂a3∂lh2T
从第二个式子,得到
∂
l
∂
h
2
=
w
3
T
∂
l
∂
a
3
\frac {\partial l}{\partial h_2} = w_3^T \frac {\partial l}{\partial a_3}
∂h2∂l=w3T∂a3∂l,又由于
σ
(
x
)
\sigma(x)
σ(x)是逐元素函数,所以
∂
l
∂
a
2
=
∂
l
∂
h
2
⊙
σ
′
(
a
2
)
\frac {\partial l}{\partial a_2} = \frac {\partial l}{\partial h_2} \odot \sigma^{'} (a_2)
∂a2∂l=∂h2∂l⊙σ′(a2)
从第三个式子,得到
∂
l
∂
b
3
=
∂
l
∂
a
3
\frac {\partial l}{\partial b_3} = \frac {\partial l}{\partial a_3}
∂b3∂l=∂a3∂l
同理,还有
d
l
=
t
r
[
(
∂
l
∂
a
2
)
T
d
(
a
2
)
]
=
t
r
[
(
∂
l
∂
a
2
)
T
d
(
w
2
)
h
1
]
+
t
r
[
(
∂
l
∂
a
2
)
T
w
2
d
(
h
1
)
]
+
t
r
[
(
∂
l
∂
a
2
)
T
d
(
b
2
)
]
=
t
r
[
h
1
(
∂
l
∂
a
2
)
T
d
(
w
2
)
]
+
t
r
[
(
∂
l
∂
a
2
)
T
w
2
d
(
h
1
)
]
+
t
r
[
(
∂
l
∂
a
2
)
T
d
(
b
2
)
]
\begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_2})^T d(a_2)] \\ & = tr[(\frac {\partial l}{\partial a_2})^T d(w_2)h_1] + tr[(\frac {\partial l}{\partial a_2})^T w_2 d(h_1)] + tr[(\frac {\partial l}{\partial a_2})^T d(b_2)] \\ & = tr[h_1(\frac {\partial l}{\partial a_2})^T d(w_2)] + tr[(\frac {\partial l}{\partial a_2})^T w_2 d(h_1)] + tr[(\frac {\partial l}{\partial a_2})^T d(b_2)] \\ \end{aligned}
dl=tr[(∂a2∂l)Td(a2)]=tr[(∂a2∂l)Td(w2)h1]+tr[(∂a2∂l)Tw2d(h1)]+tr[(∂a2∂l)Td(b2)]=tr[h1(∂a2∂l)Td(w2)]+tr[(∂a2∂l)Tw2d(h1)]+tr[(∂a2∂l)Td(b2)]
从第一个式子,得到
∂
l
∂
w
2
=
∂
l
∂
a
2
h
1
T
\frac {\partial l}{\partial w_2} = \frac {\partial l}{\partial a_2}h_1^T
∂w2∂l=∂a2∂lh1T
从第二个式子,得到
∂
l
∂
h
1
=
w
2
T
∂
l
∂
a
2
\frac {\partial l}{\partial h_1} = w_2^T \frac {\partial l}{\partial a_2}
∂h1∂l=w2T∂a2∂l,又由于
σ
(
x
)
\sigma(x)
σ(x)是逐元素函数,所以
∂
l
∂
a
1
=
∂
l
∂
h
1
⊙
σ
′
(
a
1
)
\frac {\partial l}{\partial a_1} = \frac {\partial l}{\partial h_1} \odot \sigma^{'} (a_1)
∂a1∂l=∂h1∂l⊙σ′(a1)
从第三个式子,得到
∂
l
∂
b
2
=
∂
l
∂
a
2
\frac {\partial l}{\partial b_2} = \frac {\partial l}{\partial a_2}
∂b2∂l=∂a2∂l
同理,还有
d
l
=
t
r
[
(
∂
l
∂
a
1
)
T
d
(
a
1
)
]
=
t
r
[
(
∂
l
∂
a
1
)
T
d
(
w
1
)
x
]
+
t
r
[
(
∂
l
∂
a
1
)
T
w
1
d
(
x
)
]
+
t
r
[
(
∂
l
∂
a
1
)
T
d
(
b
1
)
]
=
t
r
[
x
(
∂
l
∂
a
1
)
T
d
(
w
1
)
]
+
t
r
[
(
∂
l
∂
a
1
)
T
w
1
d
(
x
)
]
+
t
r
[
(
∂
l
∂
a
1
)
T
d
(
b
1
)
]
\begin{aligned} dl & = tr[(\frac {\partial l}{\partial a_1})^T d(a_1)] \\ & = tr[(\frac {\partial l}{\partial a_1})^T d(w_1)x] + tr[(\frac {\partial l}{\partial a_1})^T w_1 d(x)] + tr[(\frac {\partial l}{\partial a_1})^T d(b_1)] \\ & = tr[x (\frac {\partial l}{\partial a_1})^T d(w_1)] + tr[(\frac {\partial l}{\partial a_1})^T w_1 d(x)] + tr[(\frac {\partial l}{\partial a_1})^T d(b_1)] \\ \end{aligned}
dl=tr[(∂a1∂l)Td(a1)]=tr[(∂a1∂l)Td(w1)x]+tr[(∂a1∂l)Tw1d(x)]+tr[(∂a1∂l)Td(b1)]=tr[x(∂a1∂l)Td(w1)]+tr[(∂a1∂l)Tw1d(x)]+tr[(∂a1∂l)Td(b1)]
从第一个式子,得到
∂
l
∂
w
1
=
∂
l
∂
a
1
x
T
\frac {\partial l}{\partial w_1} = \frac {\partial l}{\partial a_1}x^T
∂w1∂l=∂a1∂lxT
从第二个式子,由于
x
x
x为定值,所以该项为
0
0
0
从第三个式子,得到
∂
l
∂
b
1
=
∂
l
∂
a
1
\frac {\partial l}{\partial b_1} = \frac {\partial l}{\partial a_1}
∂b1∂l=∂a1∂l
至此,损失函数
l
l
l关于所有参数
w
i
w_i
wi和
b
i
b_i
bi的梯度都可以用已知矩阵表示,梯度下降时直接如下更新参数就可以了
w
i
=
w
i
−
l
r
∗
∂
l
∂
w
i
b
i
=
b
i
−
l
r
∗
∂
l
∂
b
i
w_i = w_i - lr * \frac {\partial l}{\partial w_i} \\ b_i = b_i - lr * \frac {\partial l}{\partial b_i}
wi=wi−lr∗∂wi∂lbi=bi−lr∗∂bi∂l
Code:
写的很乱,请见谅QAQ
传送门
参考文献:
矩阵求导术