反向传播法是神经网络的基础了,但是很多人在学的时候总是会遇到一些问题,或者说看书上一堆推导公式感觉很复杂,其实仔细看,就是一个链式求导法则反复用。本篇会以最详细的方式为大家讲解反向传播法,也会有简单的反向传播代码实现,咱们别急,等我慢慢道来。
1.前向传播
首先我们来看一张简单的两层神经网络图(自己制作的,有点丑)
我先给小伙伴们解释一下图中参数的意思
x
0
x_{0}
x0和
y
0
y_{0}
y0是输入层和隐藏层的偏置值(也可以先不看偏置值),下面推导默认使用偏置值,只不过b用
w
0
k
y
0
,
v
0
j
x
0
w_{0k}y_{0},v_{0j}x_{0}
w0ky0,v0jx0表示的,这里面
y
0
,
x
0
y_{0},x_{0}
y0,x0就是b,
w
0
k
,
v
0
j
w_{0k},v_{0j}
w0k,v0j默认为1,
x
1
x_{1}
x1,
x
2
x_{2}
x2…
x
i
x_{i}
xi…
x
n
x_{n}
xn是输入的n个特征值,
y
1
y_{1}
y1,
y
2
y_{2}
y2…
y
j
y_{j}
yj…
y
m
y_{m}
ym是隐藏层经过激活函数后的m个输出值,
o
1
o_{1}
o1…
o
k
o_{k}
ok…
o
l
o_{l}
ol是输出层经过激活函数后的
l
l
l个预测值
v
i
j
v_{ij}
vij表示从输入层第i个神经元到隐藏层第j个神经元的权重
w
j
k
w_{jk}
wjk表示从隐藏层第j个神经元到输出层第k个神经元的权重
了解了这些参数的意思,我们先简单讲解前向传播,这个对小伙伴们来说应该很简单吧!
(1)输入层---->隐藏层:
n
e
t
j
=
∑
i
=
0
n
v
i
j
x
i
net_{j}=\sum_{i=0}^{n}v_{ij}x_{i}
netj=∑i=0nvijxi
j
=
1
,
2
,
3
,
.
.
.
.
.
,
m
j=1, 2, 3,.....,m
j=1,2,3,.....,m
n
e
t
j
=
v
0
j
x
0
(
这
就
是
b
)
+
v
1
j
x
1
+
v
2
j
x
2
+
v
3
j
x
3
+
.
.
.
.
.
+
v
n
j
x
n
net_{j}=v_{0j}x_{0}(这就是b)+v_{1j}x_{1}+v_{2j}x_{2}+v_{3j}x_{3}+.....+v_{nj}x_{n}
netj=v0jx0(这就是b)+v1jx1+v2jx2+v3jx3+.....+vnjxn
y
j
=
f
(
n
e
t
j
)
y_{j}={f}{(net_{j})}
yj=f(netj)
j
=
1
,
2
,
3
,
.
.
.
.
.
,
m
j=1, 2, 3,.....,m
j=1,2,3,.....,m
(2)隐藏层---->输出层:
n
e
t
k
=
∑
j
=
0
m
w
j
k
y
j
net_{k}=\sum_{j=0}^{m}w_{jk}y_{j}
netk=∑j=0mwjkyj
k
=
1
,
2
,
3
,
.
.
.
.
.
,
l
k=1, 2, 3,.....,l
k=1,2,3,.....,l
n
e
t
k
=
w
0
k
y
0
(
这
就
是
b
)
+
w
1
k
y
1
+
w
2
k
y
2
+
w
3
k
y
3
+
.
.
.
.
.
+
w
m
k
y
m
net_{k}=w_{0k}y_{0}(这就是b)+w_{1k}y_{1}+w_{2k}y_{2}+w_{3k}y_{3}+.....+w_{mk}y_{m}
netk=w0ky0(这就是b)+w1ky1+w2ky2+w3ky3+.....+wmkym
o
k
=
f
(
n
e
t
k
)
o_{k}={f}{(net_{k})}
ok=f(netk)
k
=
1
,
2
,
3
,
.
.
.
.
.
,
l
k=1, 2, 3,.....,l
k=1,2,3,.....,l
上面公式中:
n
e
t
j
net_{j}
netj和
n
e
t
k
net_{k}
netk相当于线性回归(
Y
=
W
X
+
B
Y=WX+B
Y=WX+B)的输出结果,
f
(
x
)
f(x)
f(x)是激活函数,这里我们使用
s
i
g
m
o
i
d
sigmoid
sigmoid激活函数,也就是
f
(
x
)
=
1
1
+
e
−
x
f(x)= \frac{1}{1+e^{-x}}
f(x)=1+e−x1,
y
j
y_{j}
yj和
o
k
o_{k}
ok都是隐藏层和输出层经过激活函数后的结果,最后得到的
o
k
o_{k}
ok就是正向传播最后得到的预测值。
2.反向传播
接下来我们先定义一个损失函数,最简单也是大家最熟悉的均方差损失函数,也就是
1
2
∑
(
d
−
o
)
2
\frac{1}{2}\sum (d - o)^2
21∑(d−o)2,这里的
d
d
d就是真实值(标签值),
o
o
o就是上面正向传播得到的预测值,
1
2
\frac{1}{2}
21是为了求导方便,不影响最后的损失对结果的度量,那么这里的
l
o
s
s
=
1
2
∑
k
=
1
l
(
d
k
−
o
k
)
2
loss=\frac{1}{2}\sum_{k=1}^{l} (d_{k} - o_{k})^2
loss=21∑k=1l(dk−ok)2
又因为:
y
j
=
f
(
n
e
t
j
)
=
f
(
∑
i
=
0
n
v
i
j
x
i
)
y_{j}={f}{(net_{j})}=f(\sum_{i=0}^{n}v_{ij}x_{i})
yj=f(netj)=f(∑i=0nvijxi)
(
1.1
)
(1.1)
(1.1)
o
k
=
f
(
n
e
t
k
)
=
f
(
∑
j
=
0
m
w
j
k
y
j
)
o_{k}={f}{(net_{k})}=f(\sum_{j=0}^{m}w_{jk}y_{j})
ok=f(netk)=f(∑j=0mwjkyj)
(
1.2
)
(1.2)
(1.2)
那么最终的 l o s s loss loss就等价于:
l
o
s
s
=
1
2
∑
k
=
1
l
(
d
k
−
o
k
)
2
loss=\frac{1}{2}\sum_{k=1}^{l} (d_{k} - o_{k})^2
loss=21∑k=1l(dk−ok)2
(
1.3
)
(1.3)
(1.3)
=
1
2
∑
k
=
1
l
[
d
k
−
f
(
n
e
t
k
)
]
2
=\frac{1}{2}\sum_{k=1}^{l} [d_{k} - {f}{(net_{k})}]^2
=21∑k=1l[dk−f(netk)]2
=
1
2
∑
k
=
1
l
[
d
k
−
f
(
∑
j
=
0
m
w
j
k
y
j
)
]
2
=\frac{1}{2}\sum_{k=1}^{l} [d_{k} - f(\sum_{j=0}^{m}w_{jk}y_{j})]^2
=21∑k=1l[dk−f(∑j=0mwjkyj)]2
=
1
2
∑
k
=
1
l
[
d
k
−
f
(
∑
j
=
0
m
w
j
k
f
(
n
e
t
j
)
)
]
2
=\frac{1}{2}\sum_{k=1}^{l} [d_{k} - f(\sum_{j=0}^{m}w_{jk}{f}{(net_{j})})]^2
=21∑k=1l[dk−f(∑j=0mwjkf(netj))]2
=
1
2
∑
k
=
1
l
[
d
k
−
f
(
∑
j
=
0
m
w
j
k
f
(
∑
i
=
0
n
v
i
j
x
i
)
)
]
2
=\frac{1}{2}\sum_{k=1}^{l} [d_{k} - f(\sum_{j=0}^{m}w_{jk}f(\sum_{i=0}^{n}v_{ij}x_{i}))]^2
=21∑k=1l[dk−f(∑j=0mwjkf(∑i=0nvijxi))]2
得到 l o s s loss loss的完整式子后,我们的目标肯定是 l o s s loss loss越小越好,这就需要优化 w , v w,v w,v,用的最多的当然就是梯度下降法了,不停地迭代更新 w , v w,v w,v,直到得到最合适的 w , v w,v w,v值使得 l o s s loss loss最小,由梯度下降法的参数更新公式,可以得到:
w = w − ▽ w j k w = w - \bigtriangledown w_{jk} w=w−▽wjk ( 1.4 ) (1.4) (1.4)
v = v − ▽ v i j v = v - \bigtriangledown v_{ij} v=v−▽vij ( 1.5 ) (1.5) (1.5)
b w = b w − ▽ b 0 j b_{w} = b_{w} - \bigtriangledown b_{0j} bw=bw−▽b0j ( 1.6 ) (1.6) (1.6)
b v = b v − ▽ b 0 k b_{v} = b_{v}- \bigtriangledown b_{0k} bv=bv−▽b0k ( 1.7 ) (1.7) (1.7)
其中 η \eta η是学习率,自己设定,那么:
▽ w j k = − η ∂ l o s s ∂ w j k = − η ∂ l o s s ∂ n e t k × ∂ n e t k ∂ w j k \bigtriangledown w_{jk}=-\eta \frac{\partial loss}{\partial w_{jk}}=-\eta\frac{\partial loss}{\partial net_{k}}\times\frac{\partial net_{k}}{\partial w_{jk}} ▽wjk=−η∂wjk∂loss=−η∂netk∂loss×∂wjk∂netk ( 1.8 ) (1.8) (1.8) ∂ n e t k ∂ w j k = y j \frac{\partial net_{k}}{\partial w_{jk}}=y_{j} ∂wjk∂netk=yj
▽ v i j = − η ∂ l o s s ∂ v i j = − η ∂ l o s s ∂ n e t j × ∂ n e t j ∂ v i j \bigtriangledown v_{ij}=-\eta \frac{\partial loss}{\partial v_{ij}}=-\eta\frac{\partial loss}{\partial net_{j}}\times\frac{\partial net_{j}}{\partial v_{ij}} ▽vij=−η∂vij∂loss=−η∂netj∂loss×∂vij∂netj ( 1.9 ) (1.9) (1.9) ∂ n e t j ∂ v i j = x i \frac{\partial net_{j}}{\partial v_{ij}}=x_{i} ∂vij∂netj=xi
我们令:
δ k o = − ∂ l o s s ∂ n e t k \delta _{k}^{o}=-\frac{\partial loss}{\partial net_{k}} δko=−∂netk∂loss ( 2.0 ) (2.0) (2.0)
δ j y = − ∂ l o s s ∂ n e t j \delta _{j}^{y}=-\frac{\partial loss}{\partial net_{j}} δjy=−∂netj∂loss ( 2.1 ) (2.1) (2.1)
我们把 δ k o , δ j y \delta _{k}^{o},\delta _{j}^{y} δko,δjy称作学习信号,代入式子 ( 2.0 ) , ( 2.1 ) (2.0),(2.1) (2.0),(2.1):
▽ w j k = η δ k o y j \bigtriangledown w_{jk}=\eta\delta_{k}^{o}y_{j} ▽wjk=ηδkoyj
▽ v i j = η δ j y x i \bigtriangledown v_{ij}=\eta\delta _{j}^{y}x_{i} ▽vij=ηδjyxi
现在我们来看式子 ( 2.0 ) , ( 2.1 ) (2.0),(2.1) (2.0),(2.1):
δ k o = − ∂ l o s s ∂ n e t k = − ∂ l o s s ∂ o k × ∂ o k ∂ n e t k = − ∂ l o s s ∂ o k × f ′ ( n e t k ) \delta_{k}^{o}=-\frac{\partial loss}{\partial net_{k}}=-\frac{\partial loss}{\partial o_{k}}\times\frac{\partial o_{k}}{\partial net_{k}}=-\frac{\partial loss}{\partial o_{k}}\times f'(net_{k}) δko=−∂netk∂loss=−∂ok∂loss×∂netk∂ok=−∂ok∂loss×f′(netk)
δ j y = − ∂ l o s s ∂ n e t j = − ∂ l o s s ∂ y j × ∂ y j ∂ n e t j = − ∂ l o s s ∂ y j × f ′ ( n e t j ) \delta _{j}^{y}=-\frac{\partial loss}{\partial net_{j}}=-\frac{\partial loss}{\partial y_{j}}\times\frac{\partial y_{j}}{\partial net_{j}}=-\frac{\partial loss}{\partial y_{j}}\times f'(net_{j}) δjy=−∂netj∂loss=−∂yj∂loss×∂netj∂yj=−∂yj∂loss×f′(netj)
∂ l o s s ∂ o k = − ∑ k = 1 l ( d k − o k ) \frac{\partial loss}{\partial o_{k}}=-\sum_{k=1}^{l}(d_{k}-o_{k}) ∂ok∂loss=−∑k=1l(dk−ok)
∂ l o s s ∂ y j = ∂ l o s s ∂ o k × ∂ o k ∂ n e t k × ∂ n e t k ∂ y j = − ∑ k = 1 l ( d k − o k ) × f ′ ( n e t k ) × w j k \frac{\partial loss}{\partial y_{j}}=\frac{\partial loss}{\partial o_{k}}\times\frac{\partial o_{k}}{\partial net_{k}}\times\frac{\partial net_{k}}{\partial y_{j}}=-\sum_{k=1}^{l}(d_{k}-o_{k})\times f'(net_{k})\times w_{jk} ∂yj∂loss=∂ok∂loss×∂netk∂ok×∂yj∂netk=−∑k=1l(dk−ok)×f′(netk)×wjk
将上面两个式子代回 ( 2.0 ) , ( 2.1 ) (2.0),(2.1) (2.0),(2.1):
f ( x ) = 1 1 + e − x f(x)= \frac{1}{1+e^{-x}} f(x)=1+e−x1
f ′ ( x ) = f ( x ) × ( 1 − f ( x ) ) f'(x)= f(x)\times (1 - f(x)) f′(x)=f(x)×(1−f(x))
δ k o = ∑ k = 1 l ( d k − o k ) × f ′ ( n e t k ) = ∑ k = 1 l ( d k − o k ) × o k × ( 1 − o k ) \delta_{k}^{o}=\sum_{k=1}^{l}(d_{k}-o_{k})\times f'(net_{k})=\sum_{k=1}^{l}(d_{k}-o_{k})\times o_{k}\times (1 - o_{k}) δko=∑k=1l(dk−ok)×f′(netk)=∑k=1l(dk−ok)×ok×(1−ok) (仔细看会发现,这个式子就是损失函数的导数乘上激活函数的导数) (2.2)
δ
j
y
=
∑
k
=
1
l
(
d
k
−
o
k
)
×
f
′
(
n
e
t
k
)
×
w
j
k
×
f
′
(
n
e
t
j
)
\delta _{j}^{y}=\sum_{k=1}^{l}(d_{k}-o_{k})\times f'(net_{k})\times w_{jk}\times f'(net_{j})
δjy=∑k=1l(dk−ok)×f′(netk)×wjk×f′(netj)
=
δ
k
o
×
w
j
k
×
y
j
×
(
1
−
y
j
)
=\delta_{k}^{o}\times w_{jk}\times y_{j}\times (1 - y_{j})
=δko×wjk×yj×(1−yj) (2.3)
得到了最后的学习信号表达式,就可以代回梯度下降参数更新公式里面了:
▽ w j k = η δ k o y j = η ∑ k = 1 l ( d k − o k ) × o k × ( 1 − o k ) × y j \bigtriangledown w_{jk}=\eta\delta_{k}^{o}y_{j}=\eta\sum_{k=1}^{l}(d_{k}-o_{k})\times o_{k} \times(1 - o_{k})\times y_{j} ▽wjk=ηδkoyj=η∑k=1l(dk−ok)×ok×(1−ok)×yj (2.4)
▽ v i j = η δ j y x i = η δ k o × w j k × y j × ( 1 − y j ) × x i \bigtriangledown v_{ij}=\eta\delta _{j}^{y}x_{i}=\eta\delta_{k}^{o}\times w_{jk}\times y_{j}\times (1 - y_{j})\times x_{i} ▽vij=ηδjyxi=ηδko×wjk×yj×(1−yj)×xi (2.5)
w = w − ▽ w j k w = w - \bigtriangledown w_{jk} w=w−▽wjk
v = v − ▽ v i j v = v - \bigtriangledown v_{ij} v=v−▽vij
接下来求 ▽ b 0 j , ▽ b 0 k \bigtriangledown b_{0j},\bigtriangledown b_{0k} ▽b0j,▽b0k,这个比上面的更简单,因为我们知道偏置的权重系数是1,所以就有:
▽ b 0 k = − η ∂ l o s s ∂ b 0 k = − η ∂ l o s s ∂ o k × ∂ o k ∂ n e t k × ∂ n e t k ∂ b 0 k = − ∑ k = 1 l ( d k − o k ) × f ′ ( n e t k ) × 1 \bigtriangledown b_{0k}=-\eta \frac{\partial loss}{\partial b_{0k}}=-\eta\frac{\partial loss}{\partial o_{k}}\times\frac{\partial o_{k}}{\partial net_{k}}\times\frac{\partial net_{k}}{\partial b_{0k}}=-\sum_{k=1}^{l}(d_{k}-o_{k})\times f'(net_{k})\times1 ▽b0k=−η∂b0k∂loss=−η∂ok∂loss×∂netk∂ok×∂b0k∂netk=−∑k=1l(dk−ok)×f′(netk)×1 (2.6)
▽ b 0 j = − η ∂ l o s s ∂ b 0 j = − η ∂ l o s s ∂ o k × ∂ o k ∂ n e t k × ∂ n e t k ∂ y j × ∂ y j ∂ n e t j × ∂ n e t j ∂ b 0 j = − ∑ k = 1 l ( d k − o k ) × f ′ ( n e t k ) × w j k × f ′ ( n e t j ) × 1 \bigtriangledown b_{0j}=-\eta \frac{\partial loss}{\partial b_{0j}}=-\eta\frac{\partial loss}{\partial o_{k}}\times\frac{\partial o_{k}}{\partial net_{k}}\times\frac{\partial net_{k}}{\partial y_{j}}\times\frac{\partial y_{j}}{\partial net_{j}}\times\frac{\partial net_{j}}{\partial b_{0j}}=-\sum_{k=1}^{l}(d_{k}-o_{k})\times f'(net_{k})\times w_{jk}\times f'(net_{j})\times1 ▽b0j=−η∂b0j∂loss=−η∂ok∂loss×∂netk∂ok×∂yj∂netk×∂netj∂yj×∂b0j∂netj=−∑k=1l(dk−ok)×f′(netk)×wjk×f′(netj)×1 (2.7)
再更新偏置值:
b w = b w − ▽ b 0 j b_{w} = b_{w} - \bigtriangledown b_{0j} bw=bw−▽b0j
b v = b v − ▽ b 0 k b_{v} = b_{v}- \bigtriangledown b_{0k} bv=bv−▽b0k
这个时候两层的权重和偏置就更新了一次,小伙伴们要要使用其他激活函数或者损失函数,只需要修改对应位置激活函数的导函数和损失函数的导函数就可以了,我们也可以发现参数更新的速度(收敛速度)跟学习率 η \eta η有关系,所以lr初始值尽量从小开始设置,上面的推导过程写得很详细,第一次看可能会比较昏,不妨动手跟着算一遍就会很清楚了,矩阵的反向传播也是同样的,过程差不多,大家可以自行推导,下面我将用python代码实现反向传播过程
3.代码实现简单的反向传播
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-1 * x))
def d_sigmoid(x):
s = sigmoid(x)
return s * (np.ones(s.shape) - s)
def mean_square_loss(s, y):
return np.sum(np.square(s - y) / 2)
def d_mean_square_loss(s, y):
return s - y
def forward(W1, W2, b1, b2, X, y):
# 输入层到隐藏层
y1 = np.matmul(X, W1) + b1 # [2, 3]
z1 = sigmoid(y1) # [2, 3]
# 隐藏层到输出层
y2 = np.matmul(z1, W2) + b2 # [2, 2]
z2 = sigmoid(y2) # [2, 2]
# 求均方差损失
loss = mean_square_loss(z2, y)
return y1, z1, y2, z2, loss
def backward_update(epochs, lr=0.01):
# 随便创的数据和权重,偏置值,小伙伴们也可以使用numpy的ranodm()进行随机初始化
X = np.array([[0.6, 0.1], [0.3, 0.6]])
y = np.array([0, 1])
W1 = np.array([[0.4, 0.3, 0.6], [0.3, 0.4, 0.2]])
b1 = np.array([0.4, 0.1, 0.2])
W2 = np.array([[0.2, 0.3], [0.3, 0.4], [0.5, 0.3]])
b2 = np.array([0.1, 0.2])
# 先进行一次前向传播
y1, z1, y2, z2, loss = forward(W1, W2, b1, b2, X, y)
for i in range(epochs):
# 求得隐藏层的学习信号(损失函数导数乘激活函数导数)
ds2 = d_mean_square_loss(z2, y) * d_sigmoid(y2)
# 根据上面推导结果式子(2.4不看学习率)--->(学习信号乘隐藏层z1的输出结果),注意形状需要转置
dW2 = np.matmul(z1.T, ds2)
# 对隐藏层的偏置梯度求和(式子2.6),注意是对列求和
db2 = np.sum(ds2, axis=0)
# 式子(2.5)前两个元素相乘
dx = np.matmul(ds2, W2.T)
# 对照式子(2.3)
ds1 = d_sigmoid(y1) * dx
# 式子(2.5)
dW1 = np.matmul(X.T, ds1)
# 对隐藏层的偏置梯度求和(式子2.7),注意是对列求和
db1 = np.sum(ds1, axis=0)
# 参数更新
W1 = W1 - lr * dW1
b1 = b1 - lr * db1
W2 = W2 - lr * dW2
b2 = b2 - lr * db2
y1, z1, y2, z2, loss = forward(W1, W2, b1, b2, X, y)
# 每隔100次打印一次损失
if i % 100 == 0:
print('第%d批次' % (i / 100))
print('当前损失为:{:.4f}'.format(loss))
print(z2)
# sigmoid激活函数将结果大于0.5的值分为正类,小于0.5的值分为负类
z2[z2 > 0.5] = 1
z2[z2 < 0.5] = 0
print(z2)
if __name__ == '__main__':
backward_update(epochs=50001, lr=0.01)
训练50000次后,损失差不多收敛也快接近0了,这就是反向传播的代码,小伙伴们多理理上面的推导过程再对应着推导结果来看这个代码就会很清楚了,如果有什么不懂的地方可以留言噢,我会尽快回复的(* ̄︶ ̄)