前言
之前写过单层前馈神经网络,但是其中的推导是针对sigmoid函数的,本篇博客使用矩阵向量求导方式进行反向传播算法的推导
符号约定
符号 | 含义 |
---|---|
S i n i S_{in}^i Sini | 第 i i i层神经元的输入,若一层有n个神经元,则 S i n i S_{in}^i Sini是一个 n ∗ 1 n*1 n∗1的向量 |
S o u t i S_{out}^i Souti | 第 i i i层神经元的输出,若一层有n个神经元,则 S o u t i S_{out}^i Souti是一个 n ∗ 1 n*1 n∗1的向量 |
W i W^i Wi | 第 i i i层神经元对应的权重矩阵,若 i − 1 i-1 i−1层有 m m m个神经元,第 i i i层有 n n n个神经元,则 W i W^i Wi为 n ∗ m n*m n∗m的矩阵 |
B i B^i Bi | 第 i i i层的偏移矩阵,若一层有n个神经元,则 B i B^i Bi是一个 n ∗ 1 n*1 n∗1的向量 |
c o s t cost cost | 损失函数值 |
若 x x x表示 [ x 1 x 2 . . . . x n ] \begin{bmatrix} x_1\\ x_2\\ ....\\ x_n \end{bmatrix} ⎣⎢⎢⎡x1x2....xn⎦⎥⎥⎤,第i层的激活函数向量 f i ( x ) f^i(x) fi(x)表示为 [ f ( x 1 ) f ( x 2 ) . . . . f ( x n ) ] \begin{bmatrix} f(x_1)\\ f(x_2)\\ ....\\ f(x_n) \end{bmatrix} ⎣⎢⎢⎡f(x1)f(x2)....f(xn)⎦⎥⎥⎤, f ( x ) f(x) f(x)为激活函数, ( f i ( x ) ) ′ (f^i(x))' (fi(x))′表示为 [ ∂ f ( x 1 ) ∂ x 1 ∂ f ( x 2 ) ∂ x 2 . . . . ∂ f ( x n ) ∂ x n ] \begin{bmatrix} \frac{\partial{f(x_1)}}{{\partial x_1}}\\ \frac{\partial{f(x_2)}}{{\partial x_2}}\\ ....\\ \frac{\partial{f(x_n)}}{{\partial x_n}} \end{bmatrix} ⎣⎢⎢⎢⎡∂x1∂f(x1)∂x2∂f(x2)....∂xn∂f(xn)⎦⎥⎥⎥⎤
基于上述符号约定,对于第
i
i
i层的神经元,我们有
S
o
u
t
i
−
1
=
f
i
(
S
i
n
i
−
1
)
S
i
n
i
=
W
i
S
o
u
t
i
−
1
+
B
i
\begin{aligned} S_{out}^{i-1}=&f^i(S_{in}^{i-1})\\ S_{in}^i=&W^iS_{out}^{i-1}+B^i \end{aligned}
Souti−1=Sini=fi(Sini−1)WiSouti−1+Bi
标量对向量求导的链式法则
对于
n
n
n层前馈神经网络,我们有
c
o
s
t
←
S
i
n
n
←
S
i
n
n
−
1
.
.
.
.
.
←
S
i
n
1
cost\leftarrow S_{in}^n\leftarrow S_{in}^{n-1}.....\leftarrow S_{in}^1
cost←Sinn←Sinn−1.....←Sin1
左箭头表示映射,对于前馈神经网络,映射即为
S
i
n
i
+
1
=
W
i
+
1
f
i
(
S
i
n
i
)
+
B
i
+
1
\begin{aligned} S_{in}^{i+1}=W^{i+1}f^{i}(S_{in}^{i})+B^{i+1} \end{aligned}
Sini+1=Wi+1fi(Sini)+Bi+1
损失函数与最后一层的映射需要依据损失函数的类型决定(例如均方误差、交叉熵),在上述映射关系的基础上,标量对向量求导的链式法则定义为
∂
c
o
s
t
∂
S
i
n
i
=
(
∂
S
i
n
n
∂
S
i
n
n
−
1
∗
∂
S
i
n
n
−
1
∂
S
i
n
n
−
2
∗
.
.
.
.
.
.
.
∗
∂
S
i
n
i
+
1
∂
S
i
n
i
)
T
∗
∂
c
o
s
t
∂
S
i
n
n
=
(
∂
S
i
n
i
+
1
∂
S
i
n
i
)
T
∗
.
.
.
.
.
∗
(
∂
S
i
n
n
−
1
∂
S
i
n
n
−
2
)
T
∗
(
∂
S
i
n
n
∂
S
i
n
n
−
1
)
T
∗
∂
c
o
s
t
∂
S
i
n
n
=
(
∂
S
i
n
i
+
1
∂
S
i
n
i
)
T
∗
.
.
.
.
.
∗
(
∂
S
i
n
n
−
1
∂
S
i
n
n
−
2
)
T
∗
∂
c
o
s
t
∂
S
i
n
n
−
1
=
.
.
.
.
.
.
.
=
(
∂
S
i
n
i
+
1
∂
S
i
n
i
)
T
∂
c
o
s
t
∂
S
i
n
i
+
1
\begin{aligned} \frac{\partial cost}{\partial S_{in}^i}&=(\frac{\partial S_{in}^n}{\partial S_{in}^{n-1}}*\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}}*.......*\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*\frac{\partial cost}{\partial S_{in}^n}\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*.....*(\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}})^T*(\frac{\partial S_{in}^n}{\partial S_{in}^{n-1}})^T*\frac{\partial cost}{\partial S_{in}^n}\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*.....*(\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}})^T*\frac{\partial cost}{\partial S_{in}^{n-1}}\\ &=.......\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T\frac{\partial cost}{\partial S_{in}^{i+1}} \end{aligned}
∂Sini∂cost=(∂Sinn−1∂Sinn∗∂Sinn−2∂Sinn−1∗.......∗∂Sini∂Sini+1)T∗∂Sinn∂cost=(∂Sini∂Sini+1)T∗.....∗(∂Sinn−2∂Sinn−1)T∗(∂Sinn−1∂Sinn)T∗∂Sinn∂cost=(∂Sini∂Sini+1)T∗.....∗(∂Sinn−2∂Sinn−1)T∗∂Sinn−1∂cost=.......=(∂Sini∂Sini+1)T∂Sini+1∂cost
常用向量对向量求导的公式
若 Y = A X + B Y=AX+B Y=AX+B, Y Y Y、 X 、 B X、B X、B为向量, A A A为矩阵,使用分子布局,则有 ∂ Y ∂ X = A \frac{\partial{Y}}{\partial{X}}=A ∂X∂Y=A
反向传播算法推导
假设有一个n层前馈神经网络,则第
i
i
i层的梯度为
∂
c
o
s
t
∂
S
i
n
i
=
(
∂
S
i
n
i
+
1
∂
S
i
n
i
)
T
∗
∂
c
o
s
t
∂
S
i
n
i
+
1
=
(
W
i
+
1
)
T
∗
∂
c
o
s
t
∂
S
i
n
i
+
1
☉
(
f
i
(
S
i
n
i
)
)
′
(式1)
\begin{aligned} \frac{\partial cost}{\partial S_{in}^i}&=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}\\ &=(W^{i+1})^T*\frac{\partial cost}{\partial S_{in}^{i+1}} ☉ (f^{i}(S_{in}^i))' \end{aligned}\tag{式1}
∂Sini∂cost=(∂Sini∂Sini+1)T∗∂Sini+1∂cost=(Wi+1)T∗∂Sini+1∂cost☉(fi(Sini))′(式1)
☉为Hadamard乘积,用于矩阵或向量之间点对点的乘法运算,即相同位置的元素相乘,对于最后一步,具体的理解如下,假设第
i
i
i层有n个神经元
(
W
i
+
1
)
T
∗
∂
c
o
s
t
∂
S
i
n
i
+
1
☉
(
f
i
(
S
i
n
i
)
)
′
=
(
∂
S
i
n
i
+
1
∂
f
(
S
i
n
i
)
)
T
∗
∂
c
o
s
t
∂
S
i
n
i
+
1
☉
(
f
i
(
S
i
n
i
)
)
′
=
∂
c
o
s
t
∂
f
(
S
i
n
i
)
☉
(
f
i
(
S
i
n
i
)
)
′
=
[
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
1
)
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
2
)
.
.
.
.
.
.
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
n
)
]
☉
(
f
i
(
S
i
n
i
)
)
′
=
[
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
1
)
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
2
)
.
.
.
.
.
.
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
n
)
]
☉
[
∂
f
(
(
S
i
n
i
)
1
)
∂
(
(
S
i
n
i
)
1
)
∂
f
(
(
S
i
n
i
)
2
)
∂
(
(
S
i
n
i
)
2
)
.
.
.
.
.
.
∂
f
(
(
S
i
n
i
)
n
)
∂
(
(
S
i
n
i
)
n
)
]
=
[
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
1
)
∗
∂
f
(
(
S
i
n
i
)
1
)
∂
(
(
S
i
n
i
)
1
)
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
2
)
∗
∂
f
(
(
S
i
n
i
)
2
)
∂
(
(
S
i
n
i
)
2
)
.
.
.
.
.
.
∂
c
o
s
t
∂
f
(
(
S
i
n
i
)
n
)
∗
∂
f
(
(
S
i
n
i
)
n
)
∂
(
(
S
i
n
i
)
n
)
]
=
[
∂
c
o
s
t
∂
(
(
S
i
n
i
)
1
)
∂
c
o
s
t
∂
(
(
S
i
n
i
)
2
)
.
.
.
.
.
.
∂
c
o
s
t
∂
(
(
S
i
n
i
)
n
)
]
=
∂
c
o
s
t
∂
S
i
n
i
\begin{aligned} (W^{i+1})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}☉ (f^{i}(S_{in}^i))'=&(\frac{\partial S_{in}^{i+1}}{\partial f(S_{in}^{i})})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}☉ (f^{i}(S_{in}^i))'\\ =&\frac{\partial cost}{\partial f(S_{in}^{i})}☉ (f^{i}(S_{in}^i))'\\ =& \begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)} \end{bmatrix}☉ (f^{i}(S_{in}^i))'\\ =& \begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)} \end{bmatrix}☉ \begin{bmatrix} \frac{\partial {f((S_{in}^{i})_1)}}{\partial((S_{in}^{i})_1)}\\ \frac{\partial {f((S_{in}^{i})_2)}}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial {f((S_{in}^{i})_n)}}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}*\frac{\partial {f((S_{in}^{i})_1)}}{\partial((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}*\frac{\partial {f((S_{in}^{i})_2)}}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)}*\frac{\partial {f((S_{in}^{i})_n)}}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\begin{bmatrix} \frac{\partial cost}{\partial((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\frac{\partial cost}{\partial S_{in}^i} \end{aligned}
(Wi+1)T∗∂Sini+1∂cost☉(fi(Sini))′=======(∂f(Sini)∂Sini+1)T∗∂Sini+1∂cost☉(fi(Sini))′∂f(Sini)∂cost☉(fi(Sini))′⎣⎢⎢⎢⎡∂f((Sini)1)∂cost∂f((Sini)2)∂cost......∂f((Sini)n)∂cost⎦⎥⎥⎥⎤☉(fi(Sini))′⎣⎢⎢⎢⎡∂f((Sini)1)∂cost∂f((Sini)2)∂cost......∂f((Sini)n)∂cost⎦⎥⎥⎥⎤☉⎣⎢⎢⎢⎢⎡∂((Sini)1)∂f((Sini)1)∂((Sini)2)∂f((Sini)2)......∂((Sini)n)∂f((Sini)n)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎡∂f((Sini)1)∂cost∗∂((Sini)1)∂f((Sini)1)∂f((Sini)2)∂cost∗∂((Sini)2)∂f((Sini)2)......∂f((Sini)n)∂cost∗∂((Sini)n)∂f((Sini)n)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎡∂((Sini)1)∂cost∂((Sini)2)∂cost......∂((Sini)n)∂cost⎦⎥⎥⎥⎤∂Sini∂cost
接下来就是权重更新的梯度,推出第
i
i
i层的梯度后,对权重梯度与偏移的求导可以使用定义法求得到:
∂
c
o
s
t
∂
W
i
=
∂
c
o
s
t
∂
S
i
n
i
∗
(
S
o
u
t
i
−
1
)
T
(式2)
\begin{aligned} \frac{\partial cost}{\partial W^i}&=\frac{\partial cost}{\partial S_{in}^i}*(S_{out}^{i-1})^T\tag{式2} \end{aligned}
∂Wi∂cost=∂Sini∂cost∗(Souti−1)T(式2)
∂
c
o
s
t
∂
B
i
=
∂
c
o
s
t
∂
S
i
n
i
(式3)
\begin{aligned} \frac{\partial cost}{\partial B^i}&=\frac{\partial cost}{\partial S_{in}^i}\tag{式3} \end{aligned}
∂Bi∂cost=∂Sini∂cost(式3)
∂
c
o
s
t
∂
S
i
n
n
\frac{\partial cost}{\partial S_{in}^n}
∂Sinn∂cost需要依据矩阵求导的定义法自己求出,求出后,即可依据式1、2、3求出各参数的梯度,关于矩阵求导的定义法,可以查看快,快点我,我等不及了