注:转载请标明原文出处链接:https://xiongyiming.blog.csdn.net/article/details/99113544
1 BP神经网络简介
BP(back propagation) 神经网络是1986年由Rumelhart和McClelland为首的科学家提出的概念,是一种按照误差逆向传播算法训练的多层前馈神经网络,是目前应用最广泛的神经网络。
BP算法(Back Propagation algorithm, 反向传播算法)适合于多层神经元网络的一种学习算法,它建立在梯度下降法的基础上。BP网络的输入输出关系实质上是一种映射关系:一个
n
n
n输入
m
m
m输出的BP神经网络所完成的功能是从
n
n
n维欧氏空间向
m
m
m维欧氏空间中一有限域的连续映射,这一映射具有高度非线性。它的信息处理能力来源于简单非线性函数的多次复合,因此具有很强的函数复现能力。这是BP算法得以应用的基础。
(以上均来自百度百科)
2 BP神经网络结构与原理
一般的神经网络结构图如下图所示:
首先 定义符号说明
(1)
n
l
{n_l}
nl : 表示网络层数,对于上图中的网络,
L
=
4
L=4
L=4,
n
1
{n_1}
n1 表示输入层,
n
4
{n_4}
n4 表示输出层,其他为隐层。
(2)
b
i
(
l
)
b_i^{(l)}
bi(l) : 表示第
l
+
1
l+1
l+1 层的第
i
i
i 个单元 与 第
l
l
l 层的第
j
j
j 个单元 的连接权重 (从后往前看)
(3)
b
i
(
l
)
b_i^{(l)}
bi(l) : 表示第
l
l
l 层的第
i
i
i 个单元的偏置(激活阈值)
(4)
z
i
(
l
)
z_i^{(l)}
zi(l) : 表示第
l
l
l 层的第
i
i
i 个单元的权重累计(输入值)
(5)
a
i
(
l
)
a_i^{(l)}
ai(l) : 表示第
l
l
l 层的第
i
i
i 个单元的激活值(输出值)
(6)
h
w
,
b
(
X
)
{h_{w,b}}(X)
hw,b(X) : 表示最后输出层的输出值
(7)
S
l
{S_l}
Sl : 表示第
l
l
l 层的神经元个数
(8) 样本个数为
m
m
m,特征个数为
n
n
n
(9)
f
f
f : 表示神经元的激活函数
通过上面的定义,对照着网络结构图,则
第一层 n 1 {n_1} n1: a i ( 1 ) = x i (1) a_i^{(1)} = {x_i} \tag{1} ai(1)=xi(1)
第二层
n
1
{n_1}
n1 ,4个神经元:
z
1
(
2
)
=
(
∑
j
=
1
3
(
w
1
,
j
(
1
)
a
j
(
1
)
)
)
+
b
1
(
1
)
(2)
z_1^{(2)} = \left( {\sum\limits_{j = 1}^3 {(w_{1,j}^{(1)}a_j^{(1)})} } \right) + b_1^{(1)} \tag{2}
z1(2)=(j=1∑3(w1,j(1)aj(1)))+b1(1)(2)
a 1 ( 2 ) = f ( z 1 ( 2 ) ) = f ( w 1 , 1 ( 1 ) + w 1 , 2 ( 1 ) + w 1 , 3 ( 1 ) + b 2 ( 1 ) ) (3) a_1^{(2)} = f\left( {z_1^{(2)}} \right) = f\left( {w_{1,1}^{(1)} + w_{1,2}^{(1)} + w_{1,3}^{(1)} + b_2^{(1)}} \right) \tag{3} a1(2)=f(z1(2))=f(w1,1(1)+w1,2(1)+w1,3(1)+b2(1))(3)
a 2 ( 2 ) = f ( z 2 ( 2 ) ) = f ( w 2 , 1 ( 1 ) + w 2 , 2 ( 1 ) + w 2 , 3 ( 1 ) + b 2 ( 1 ) ) (4) a_2^{(2)} = f\left( {z_2^{(2)}} \right) = f\left( {w_{2,1}^{(1)} + w_{2,2}^{(1)} + w_{2,3}^{(1)} + b_2^{(1)}} \right) \tag{4} a2(2)=f(z2(2))=f(w2,1(1)+w2,2(1)+w2,3(1)+b2(1))(4)
a 3 ( 2 ) = f ( z 3 ( 2 ) ) = f ( w 3 , 1 ( 1 ) + w 3 , 2 ( 1 ) + w 3 , 3 ( 1 ) + b 3 ( 1 ) ) (5) a_3^{(2)} = f\left( {z_3^{(2)}} \right) = f\left( {w_{3,1}^{(1)} + w_{3,2}^{(1)} + w_{3,3}^{(1)} + b_3^{(1)}} \right) \tag{5} a3(2)=f(z3(2))=f(w3,1(1)+w3,2(1)+w3,3(1)+b3(1))(5)
a 4 ( 2 ) = f ( z 4 ( 2 ) ) = f ( w 4 , 1 ( 1 ) + w 4 , 2 ( 1 ) + w 4 , 3 ( 1 ) + b 4 ( 1 ) ) (6) a_4^{(2)} = f\left( {z_4^{(2)}} \right) = f\left( {w_{4,1}^{(1)} + w_{4,2}^{(1)} + w_{4,3}^{(1)} + b_4^{(1)}} \right) \tag{6} a4(2)=f(z4(2))=f(w4,1(1)+w4,2(1)+w4,3(1)+b4(1))(6)
第三层 ,4个神经元:
z
1
(
3
)
=
(
∑
j
=
1
4
(
w
1
,
j
(
2
)
a
j
(
2
)
)
)
+
b
1
(
2
)
(7)
z_1^{(3)} = \left( {\sum\limits_{j = 1}^4 {(w_{1,j}^{(2)}a_j^{(2)})} } \right) + b_1^{(2)} \tag{7}
z1(3)=(j=1∑4(w1,j(2)aj(2)))+b1(2)(7)
a 1 ( 3 ) = f ( z 1 ( 3 ) ) = f ( w 1 , 1 ( 2 ) + w 1 , 2 ( 2 ) + w 1 , 3 ( 2 ) + w 1 , 4 ( 2 ) + b 2 ( 2 ) ) (8) a_1^{(3)} = f\left( {z_1^{(3)}} \right) = f\left( {w_{1,1}^{(2)} + w_{1,2}^{(2)} + w_{1,3}^{(2)} + w_{1,4}^{(2)} + b_2^{(2)}} \right) \tag{8} a1(3)=f(z1(3))=f(w1,1(2)+w1,2(2)+w1,3(2)+w1,4(2)+b2(2))(8)
a 2 ( 3 ) = f ( z 2 ( 3 ) ) = f ( w 2 , 1 ( 2 ) + w 2 , 2 ( 2 ) + w 2 , 3 ( 2 ) + w 2 , 4 ( 2 ) + b 2 ( 2 ) ) (9) a_2^{(3)} = f\left( {z_2^{(3)}} \right) = f\left( {w_{2,1}^{(2)} + w_{2,2}^{(2)} + w_{2,3}^{(2)} + w_{2,4}^{(2)} + b_2^{(2)}} \right) \tag{9} a2(3)=f(z2(3))=f(w2,1(2)+w2,2(2)+w2,3(2)+w2,4(2)+b2(2))(9)
a 3 ( 3 ) = f ( z 3 ( 3 ) ) = f ( w 3 , 1 ( 2 ) + w 3 , 2 ( 2 ) + w 3 , 3 ( 2 ) + w 3 , 4 ( 2 ) + b 3 ( 2 ) ) (10) a_3^{(3)} = f\left( {z_3^{(3)}} \right) = f\left( {w_{3,1}^{(2)} + w_{3,2}^{(2)} + w_{3,3}^{(2)} + w_{3,4}^{(2)} + b_3^{(2)}} \right) \tag{10} a3(3)=f(z3(3))=f(w3,1(2)+w3,2(2)+w3,3(2)+w3,4(2)+b3(2))(10)
a 4 ( 3 ) = f ( z 4 ( 3 ) ) = f ( w 4 , 1 ( 2 ) + w 4 , 2 ( 2 ) + w 4 , 3 ( 2 ) + w 4 , 4 ( 2 ) + b 4 ( 2 ) ) (11) a_4^{(3)} = f\left( {z_4^{(3)}} \right) = f\left( {w_{4,1}^{(2)} + w_{4,2}^{(2)} + w_{4,3}^{(2)} + w_{4,4}^{(2)} + b_4^{(2)}} \right) \tag{11} a4(3)=f(z4(3))=f(w4,1(2)+w4,2(2)+w4,3(2)+w4,4(2)+b4(2))(11)
第四层 ,2个神经元:
z
1
(
4
)
=
(
∑
j
=
1
4
(
w
1
,
j
(
3
)
a
j
(
3
)
)
)
+
b
1
(
3
)
(12)
z_1^{(4)} = \left( {\sum\limits_{j = 1}^4 {(w_{1,j}^{(3)}a_j^{(3)})} } \right) + b_1^{(3)} \tag{12}
z1(4)=(j=1∑4(w1,j(3)aj(3)))+b1(3)(12)
a 1 ( 4 ) = f ( z 1 ( 4 ) ) = f ( w 1 , 1 ( 3 ) + w 1 , 2 ( 3 ) + w 1 , 3 ( 3 ) + w 1 , 4 ( 3 ) + b 2 ( 3 ) ) (13) a_1^{(4)} = f\left( {z_1^{(4)}} \right) = f\left( {w_{1,1}^{(3)} + w_{1,2}^{(3)} + w_{1,3}^{(3)} + w_{1,4}^{(3)} + b_2^{(3)}} \right) \tag{13} a1(4)=f(z1(4))=f(w1,1(3)+w1,2(3)+w1,3(3)+w1,4(3)+b2(3))(13)
a 2 ( 4 ) = f ( z 2 ( 4 ) ) = f ( w 2 , 1 ( 3 ) + w 2 , 2 ( 3 ) + w 2 , 3 ( 3 ) + w 2 , 4 ( 3 ) + b 2 ( 3 ) ) (14) a_2^{(4)} = f\left( {z_2^{(4)}} \right) = f\left( {w_{2,1}^{(3)} + w_{2,2}^{(3)} + w_{2,3}^{(3)} + w_{2,4}^{(3)} + b_2^{(3)}} \right) \tag{14} a2(4)=f(z2(4))=f(w2,1(3)+w2,2(3)+w2,3(3)+w2,4(3)+b2(3))(14)
h
w
,
b
(
X
)
=
(
a
1
(
4
)
,
a
2
(
4
)
)
T
(15)
{h_{w,b}}(X) = {\left( {a_1^{(4)},a_2^{(4)}} \right)^{\rm{T}}} \tag{15}
hw,b(X)=(a1(4),a2(4))T(15)
和机器学习求解类似,对于一个样本可以使用均方差作为损失函数(代价函数)
J
(
w
,
b
;
x
,
y
)
=
∥
h
w
,
b
(
x
)
−
y
∥
2
(16)
J(w,b;x,y) = {\left\| {{h_{w,b}}(x) - y} \right\|^2} \tag{16}
J(w,b;x,y)=∥hw,b(x)−y∥2(16)
对于所以样本,损失函数为
J
(
w
,
b
)
=
[
1
m
∑
i
=
1
m
J
(
w
,
b
;
x
(
i
)
,
y
(
i
)
)
]
+
λ
2
∑
l
=
1
n
l
−
1
∑
i
=
1
S
l
∑
j
=
1
S
l
+
1
(
w
i
,
j
(
i
)
)
2
(17)
J(w,b) = \left[ {{1 \over m}\sum\limits_{i = 1}^m {J(w,b;{x^{(i)}},{y^{(i)}})} } \right] + {\lambda \over 2}\sum\limits_{l = 1}^{{n_l} - 1} {\sum\limits_{i = 1}^{{S_l}} {\sum\limits_{j = 1}^{{S_{l + 1}}} {{{\left( {w_{i,j}^{(i)}} \right)}^2}} } } \tag{17}
J(w,b)=[m1i=1∑mJ(w,b;x(i),y(i))]+2λl=1∑nl−1i=1∑Slj=1∑Sl+1(wi,j(i))2(17)
则 J ( w , b ) = [ 1 m ∑ i = 1 m ∥ h w , b ( x ( i ) ) − y ( i ) ∥ 2 ] + λ 2 ∑ l = 1 n l − 1 ∑ i = 1 S l ∑ j = 1 S l + 1 ( w i , j ( i ) ) 2 (18) J(w,b) = \left[ {{1 \over m}\sum\limits_{i = 1}^m {{{\left\| {{h_{w,b}}({x^{(i)}}) - {y^{(i)}}} \right\|}^2}} } \right] + {\lambda \over 2}\sum\limits_{l = 1}^{{n_l} - 1} {\sum\limits_{i = 1}^{{S_l}} {\sum\limits_{j = 1}^{{S_{l + 1}}} {{{\left( {w_{i,j}^{(i)}} \right)}^2}} } } \tag{18} J(w,b)=[m1i=1∑m∥∥∥hw,b(x(i))−y(i)∥∥∥2]+2λl=1∑nl−1i=1∑Slj=1∑Sl+1(wi,j(i))2(18)
其中,第一项为均方差项,第二项为正则化项(惩罚项)。
可以看出,损失函数是关于所有权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l)和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)的方程,与机器学习求解同样的思路,需要通过求解损失函数的最小值得到最佳的权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l)和偏置
b
i
(
l
)
b_i^{(l)}
bi(l),可以采用梯度下降法,则求解得到的最佳权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l)和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)分别为:
w
i
,
j
(
i
)
=
w
i
,
j
(
i
)
−
α
∂
∂
w
i
,
j
(
i
)
J
(
w
,
b
)
(19)
w_{i,j}^{(i)} = w_{i,j}^{(i)} - \alpha {\partial \over {\partial w_{i,j}^{(i)}}}J\left( {w,b} \right) \tag{19}
wi,j(i)=wi,j(i)−α∂wi,j(i)∂J(w,b)(19)
b
i
(
l
)
=
b
i
(
l
)
−
α
∂
∂
b
i
(
l
)
J
(
w
,
b
)
(20)
b_i^{(l)} = b_i^{(l)} - \alpha {\partial \over {\partial b_i^{(l)}}}J\left( {w,b} \right) \tag{20}
bi(l)=bi(l)−α∂bi(l)∂J(w,b)(20)
虽然求得最佳的权重和偏置的式子看上去很简单,与机器学习中求解不同的是,上面的式子中的求导非常困难。
那么问题来了,那该如何求解得到最佳的权重和偏置呢?
此时反向传播算法 (Back Propagation algorithm) 就是解决这个问题的,它是一种方便求解偏导的方法。可以理解为是一种从后往前找规律的方法。下面就开始进行推导。
3 BP神经网络推导
既然是反向传播,所以从最后一层往前进行推导。
对于第
l
l
l层的参数,分别对权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l)和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)求偏导为:
∂
∂
w
i
,
j
(
i
)
J
(
w
,
b
)
=
[
1
m
∑
i
=
1
m
∂
∂
w
i
,
j
(
i
)
J
(
w
,
b
;
x
(
i
)
,
y
(
i
)
)
]
+
λ
w
i
,
j
(
i
)
(21)
{\partial \over {\partial w_{i,j}^{(i)}}}J(w,b) = \left[ {{1 \over m}\sum\limits_{i = 1}^m {{\partial \over {\partial w_{i,j}^{(i)}}}J(w,b;{x^{(i)}},{y^{(i)}})} } \right] + \lambda w_{i,j}^{(i)} \tag{21}
∂wi,j(i)∂J(w,b)=[m1i=1∑m∂wi,j(i)∂J(w,b;x(i),y(i))]+λwi,j(i)(21)
∂
∂
b
i
(
l
)
J
(
w
,
b
)
=
[
1
m
∑
i
=
1
m
∂
∂
b
i
(
l
)
J
(
w
,
b
;
x
(
i
)
,
y
(
i
)
)
]
(22)
{\partial \over {\partial b_i^{(l)}}}J(w,b) = \left[ {{1 \over m}\sum\limits_{i = 1}^m {{\partial \over {\partial b_i^{(l)}}}J(w,b;{x^{(i)}},{y^{(i)}})} } \right] \tag{22}
∂bi(l)∂J(w,b)=[m1i=1∑m∂bi(l)∂J(w,b;x(i),y(i))](22)
现在的问题就转化为,分别在每个样本下求解损失函数关于权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l)和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)的偏导,故对于任意一个样本均有:
J
(
w
,
b
;
x
,
y
)
=
∥
h
w
,
b
(
x
)
−
y
∥
2
(23)
J(w,b;x,y) = {\left\| {{h_{w,b}}(x) - y} \right\|^2} \tag{23}
J(w,b;x,y)=∥hw,b(x)−y∥2(23)
而 h w , b ( x ) {h_{w,b}}(x) hw,b(x)是关于上一层的 w w w和 b b b的函数,那么从最后一层开始计算,看是否能找到规律,故有 ∂ ∂ w i , j ( n l − 1 ) J ( w , b ) = ∂ ∂ w i , j ( n l − 1 ) 1 2 ∥ a n l − y ∥ 2 (24) {\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = {\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}{1 \over 2}{\left\| {{{\bf{a}}^{{n_l}}} - {\bf{y}}} \right\|^2} \tag{24} ∂wi,j(nl−1)∂J(w,b)=∂wi,j(nl−1)∂21∥anl−y∥2(24)
其中,此时的 a n l = [ a 1 n l ; a 2 n l ] {{\bf{a}}^{{n_l}}} = [a_1^{{n_l}};a_2^{{n_l}}] anl=[a1nl;a2nl], y = [ y 1 ; y 2 ] {\bf{y}} = [{y_1};{y_2}] y=[y1;y2]
∂
∂
w
i
,
j
(
n
l
−
1
)
J
(
w
,
b
)
=
∂
∂
w
i
,
j
(
n
l
−
1
)
1
2
∑
k
=
1
S
n
l
(
a
k
(
n
l
)
−
y
k
)
2
(25)
{\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = {\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}{1 \over 2}\sum\limits_{k = 1}^{{S_{{n_l}}}} {{{\left( {a_k^{({n_l})} - {y_k}} \right)}^2}} \tag{25}
∂wi,j(nl−1)∂J(w,b)=∂wi,j(nl−1)∂21k=1∑Snl(ak(nl)−yk)2(25)
∂
∂
w
i
,
j
(
n
l
−
1
)
J
(
w
,
b
)
=
∂
∂
w
i
,
j
(
n
l
−
1
)
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
(26)
{\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = {\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}{1 \over 2}\sum\limits_{k = 1}^{{S_{{n_l}}}} {{{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)}^2}} \tag{26}
∂wi,j(nl−1)∂J(w,b)=∂wi,j(nl−1)∂21k=1∑Snl(f(zk(nl))−yk)2(26)
由公式(27): z k ( n l ) = ( ∑ k = 1 S n l w k , p ( n l − 1 ) a p ( n l − 1 ) ) + b k ( n l − 1 ) (27) z_k^{({n_l})} = \left( {\sum\limits_{k = 1}^{{S_{{n_l}}}} {w_{k,p}^{({n_l} - 1)}a_p^{({n_l} - 1)}} } \right) + b_k^{({n_l} - 1)} \tag{27} zk(nl)=⎝⎛k=1∑Snlwk,p(nl−1)ap(nl−1)⎠⎞+bk(nl−1)(27)
可知,并且由 链式法则 得到
∂
∂
w
i
,
j
(
n
l
−
1
)
J
(
w
,
b
)
=
∂
∂
z
k
(
n
l
)
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
∂
z
k
(
n
l
)
∂
w
i
,
j
(
n
l
−
1
)
(28)
{\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = {\partial \over {\partial z_k^{({n_l})}}}{1 \over 2}\sum\limits_{k = 1}^{{S_{{n_l}}}} {{{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)}^2}} {{\partial z_k^{({n_l})}} \over {\partial w_{i,j}^{({n_l} - 1)}}} \tag{28}
∂wi,j(nl−1)∂J(w,b)=∂zk(nl)∂21k=1∑Snl(f(zk(nl))−yk)2∂wi,j(nl−1)∂zk(nl)(28)
求偏导得到
∂
∂
w
i
,
j
(
n
l
−
1
)
J
(
w
,
b
)
=
[
f
(
z
k
(
n
l
)
)
−
y
i
]
⋅
f
′
(
z
k
(
n
l
)
)
⋅
∂
z
k
(
n
l
)
∂
w
i
,
j
(
n
l
−
1
)
(29)
{\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = \left[ {f\left( {z_k^{({n_l})}} \right) - {y_i}} \right] \cdot f'\left( {z_k^{({n_l})}} \right) \cdot {{\partial z_k^{({n_l})}} \over {\partial w_{i,j}^{({n_l} - 1)}}} \tag{29}
∂wi,j(nl−1)∂J(w,b)=[f(zk(nl))−yi]⋅f′(zk(nl))⋅∂wi,j(nl−1)∂zk(nl)(29)
∂
∂
w
i
,
j
(
n
l
−
1
)
J
(
w
,
b
)
=
[
f
(
z
k
(
n
l
)
)
−
y
i
]
⋅
f
′
(
z
k
(
n
l
)
)
⋅
a
j
(
n
l
−
1
)
(30)
{\partial \over {\partial w_{i,j}^{({n_l} - 1)}}}J(w,b) = \left[ {f\left( {z_k^{({n_l})}} \right) - {y_i}} \right] \cdot f'\left( {z_k^{({n_l})}} \right) \cdot a_j^{({n_l} - 1)} \tag{30}
∂wi,j(nl−1)∂J(w,b)=[f(zk(nl))−yi]⋅f′(zk(nl))⋅aj(nl−1)(30)
此时,设
δ
i
(
n
l
)
=
[
f
(
z
k
(
n
l
)
)
−
y
i
]
⋅
f
′
(
z
k
(
n
l
)
)
(31)
\delta _i^{({n_l})} = \left[ {f\left( {z_k^{({n_l})}} \right) - {y_i}} \right] \cdot f'\left( {z_k^{({n_l})}} \right) \tag{31}
δi(nl)=[f(zk(nl))−yi]⋅f′(zk(nl))(31)
对于最后一层,由于函数 y i {y_i} yi和 都一直有函数的映射,则 函数唯一确定了,而这个 称之为误差 。
下面推导倒数第二层
∂
∂
w
i
,
j
(
n
l
−
2
)
J
(
w
,
b
)
=
∂
∂
w
i
,
j
(
n
l
−
2
)
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
(32)
{\partial \over {\partial w_{i,j}^{({n_l} - 2)}}}J(w,b) = {\partial \over {\partial w_{i,j}^{({n_l} - 2)}}}{1 \over 2}\sum\limits_{k = 1}^{{S_{{n_l}}}} {{{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)}^2}} \tag{32}
∂wi,j(nl−2)∂J(w,b)=∂wi,j(nl−2)∂21k=1∑Snl(f(zk(nl))−yk)2(32)
∂
∂
w
i
,
j
(
n
l
−
2
)
J
(
w
,
b
)
=
∂
∂
z
k
(
n
l
−
1
)
1
2
∑
k
=
1
S
n
l
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
∂
z
k
(
n
l
−
1
)
∂
w
i
,
j
(
n
l
−
2
)
(33)
{\partial \over {\partial w_{i,j}^{({n_l} - 2)}}}J(w,b) = {\partial \over {\partial z_k^{({n_l} - 1)}}}{1 \over 2}\sum\limits_{k = 1}^{{S_{{n_l}}}} {{{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)}^2}} {{\partial z_k^{({n_l} - 1)}} \over {\partial w_{i,j}^{({n_l} - 2)}}} \tag{33}
∂wi,j(nl−2)∂J(w,b)=∂zk(nl−1)∂21k=1∑Snl(f(zk(nl))−yk)2∂wi,j(nl−2)∂zk(nl−1)(33)
∂
∂
w
i
,
j
(
n
l
−
2
)
J
(
w
,
b
)
=
∑
k
=
1
S
n
l
1
2
∂
∂
z
k
(
n
l
−
1
)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
⋅
a
j
(
n
l
−
2
)
(34)
{\partial \over {\partial w_{i,j}^{({n_l} - 2)}}}J(w,b) = \sum\limits_{k = 1}^{{S_{{n_l}}}} {{1 \over 2}{\partial \over {\partial z_k^{({n_l} - 1)}}}{{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)}^2}} \cdot a_j^{({n_l} - 2)} \tag{34}
∂wi,j(nl−2)∂J(w,b)=k=1∑Snl21∂zk(nl−1)∂(f(zk(nl))−yk)2⋅aj(nl−2)(34)
下面只需要对求
1
2
∂
∂
z
k
(
n
l
−
1
)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
{1 \over 2}{\partial \over {\partial z_k^{({n_l} - 1)}}}{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)^2}
21∂zk(nl−1)∂(f(zk(nl))−yk)2即可,再次使用 链式法则
1
2
∂
∂
z
k
(
n
l
−
1
)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
=
1
2
∂
∂
f
(
z
k
(
n
l
)
)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
(35)
{1 \over 2}{\partial \over {\partial z_k^{({n_l} - 1)}}}{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)^2} = {1 \over 2}{\partial \over {\partial f\left( {z_k^{({n_l})}} \right)}}{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)^2} \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} \tag{35}
21∂zk(nl−1)∂(f(zk(nl))−yk)2=21∂f(zk(nl))∂(f(zk(nl))−yk)2⋅∂zk(nl−1)∂f(zk(nl))(35)
1
2
∂
∂
z
k
(
n
l
−
1
)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
2
=
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
(36)
{1 \over 2}{\partial \over {\partial z_k^{({n_l} - 1)}}}{\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right)^2} = \left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} \tag{36}
21∂zk(nl−1)∂(f(zk(nl))−yk)2=(f(zk(nl))−yk)⋅∂zk(nl−1)∂f(zk(nl))(36)
k
k
k表示当前层数的第
k
k
k个神经元,而第
k
k
k个神经元
f
(
z
k
(
n
l
)
)
f\left( {z_k^{({n_l})}} \right)
f(zk(nl))受到前一层所有的第
i
i
i个 影响,则
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
=
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
)
∂
z
k
(
n
l
)
∂
z
i
(
n
l
−
1
)
(37)
\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} = \left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l})}}}{{\partial z_k^{({n_l})}} \over {\partial z_i^{({n_l} - 1)}}} \tag{37}
(f(zk(nl))−yk)⋅∂zk(nl−1)∂f(zk(nl))=(f(zk(nl))−yk)⋅∂zk(nl)∂f(zk(nl))∂zi(nl−1)∂zk(nl)(37)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
=
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
f
′
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
)
∂
z
i
(
n
l
−
1
)
(38)
\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} = \left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot f'\left( {z_k^{({n_l})}} \right){{\partial z_k^{({n_l})}} \over {\partial z_i^{({n_l} - 1)}}} \tag{38}
(f(zk(nl))−yk)⋅∂zk(nl−1)∂f(zk(nl))=(f(zk(nl))−yk)⋅f′(zk(nl))∂zi(nl−1)∂zk(nl)(38)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
=
δ
k
(
n
l
)
⋅
∂
∂
z
k
(
n
l
−
1
)
[
(
∑
j
=
1
S
n
l
−
1
w
i
,
j
(
n
l
−
1
)
f
(
z
j
(
n
l
)
)
)
+
b
j
(
n
l
−
1
)
]
(39)
\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} = \delta _k^{({n_l})} \cdot {\partial \over {\partial z_k^{({n_l} - 1)}}}\left[ {\left( {\sum\limits_{j = 1}^{{S_{{n_l} - 1}}} {w_{i,j}^{({n_l} - 1)}f\left( {z_j^{({n_l})}} \right)} } \right) + b_j^{({n_l} - 1)}} \right] \tag{39}
(f(zk(nl))−yk)⋅∂zk(nl−1)∂f(zk(nl))=δk(nl)⋅∂zk(nl−1)∂⎣⎡⎝⎛j=1∑Snl−1wi,j(nl−1)f(zj(nl))⎠⎞+bj(nl−1)⎦⎤(39)
(
f
(
z
k
(
n
l
)
)
−
y
k
)
⋅
∂
f
(
z
k
(
n
l
)
)
∂
z
k
(
n
l
−
1
)
=
δ
k
(
n
l
)
⋅
w
k
,
i
(
n
l
−
1
)
⋅
f
′
(
z
i
(
n
l
−
1
)
)
(40)
\left( {f\left( {z_k^{({n_l})}} \right) - {y_k}} \right) \cdot {{\partial f\left( {z_k^{({n_l})}} \right)} \over {\partial z_k^{({n_l} - 1)}}} = \delta _k^{({n_l})} \cdot w_{k,i}^{({n_l} - 1)} \cdot f'\left( {z_i^{({n_l} - 1)}} \right) \tag{40}
(f(zk(nl))−yk)⋅∂zk(nl−1)∂f(zk(nl))=δk(nl)⋅wk,i(nl−1)⋅f′(zi(nl−1))(40)
注:对于公式(39)中的 w k , i ( n l − 1 ) w_{k,i}^{({n_l} - 1)} wk,i(nl−1), k k k表示第 n l {n_l} nl层的第 k k k个神经元, i i i表示第 n l − 1 {n_l} - 1 nl−1层的第 i i i个神经元。
则对于公式(34)所有的,故
∂
∂
w
i
,
j
(
n
l
−
2
)
J
(
w
,
b
)
=
[
∑
k
=
1
S
n
l
(
δ
k
(
n
l
)
⋅
w
k
,
i
(
n
l
−
1
)
⋅
f
′
(
z
i
(
n
l
−
1
)
)
)
]
⋅
a
j
(
n
l
−
2
)
(41)
{\partial \over {\partial w_{i,j}^{({n_l} - 2)}}}J(w,b) = \left[ {\sum\limits_{k = 1}^{{S_{{n_l}}}} {\left( {\delta _k^{({n_l})} \cdot w_{k,i}^{({n_l} - 1)} \cdot f'\left( {z_i^{({n_l} - 1)}} \right)} \right)} } \right] \cdot a_j^{({n_l} - 2)} \tag{41}
∂wi,j(nl−2)∂J(w,b)=⎣⎡k=1∑Snl(δk(nl)⋅wk,i(nl−1)⋅f′(zi(nl−1)))⎦⎤⋅aj(nl−2)(41)
进一步归纳得到(对于隐藏层有规律,可进行归纳):
δ
i
l
=
[
∑
k
=
1
S
n
l
+
1
(
δ
k
(
n
l
+
1
)
⋅
w
k
,
i
(
n
l
)
)
]
f
′
(
z
i
(
l
)
)
(42)
\delta _i^l = \left[ {\sum\limits_{k = 1}^{{S_{{n_l} + 1}}} {\left( {\delta _k^{({n_l} + 1)} \cdot w_{k,i}^{({n_l})}} \right)} } \right]f'\left( {z_i^{(l)}} \right) \tag{42}
δil=⎣⎡k=1∑Snl+1(δk(nl+1)⋅wk,i(nl))⎦⎤f′(zi(l))(42)
从而,对于隐藏层,有
∂
∂
w
i
,
j
(
l
)
J
(
w
,
b
;
x
,
y
)
=
a
j
(
l
)
δ
i
(
l
+
1
)
(43)
{\partial \over {\partial w_{i,j}^{(l)}}}J(w,b;x,y) = a_j^{(l)}\delta _i^{(l + 1)} \tag{43}
∂wi,j(l)∂J(w,b;x,y)=aj(l)δi(l+1)(43)
∂
∂
b
i
(
l
)
J
(
w
,
b
)
=
δ
i
(
l
+
1
)
(44)
{\partial \over {\partial b_i^{(l)}}}J(w,b) = \delta _i^{(l + 1)} \tag{44}
∂bi(l)∂J(w,b)=δi(l+1)(44)
最终的梯度下降方程为:
w
i
,
j
(
l
)
=
w
i
,
j
(
l
)
−
α
⋅
a
j
(
l
)
δ
i
(
l
+
1
)
(45)
w_{i,j}^{(l)} = w_{i,j}^{(l)} - \alpha \cdot a_j^{(l)}\delta _i^{(l + 1)} \tag{45}
wi,j(l)=wi,j(l)−α⋅aj(l)δi(l+1)(45)
b
i
(
l
)
=
b
i
(
l
)
−
α
⋅
δ
i
(
l
+
1
)
(46)
b_i^{(l)} = b_i^{(l)} - \alpha \cdot \delta _i^{(l + 1)} \tag{46}
bi(l)=bi(l)−α⋅δi(l+1)(46)
上面的推导是对BP算法如何一步步进行计算的,可能过于繁琐。
结合上面的所有推导,BP算法的算法流程 为:
(1) 从后往前计算,得到每层的激活函数值;
(2) 最后一层输出层(
n
l
{n_l}
nl ),计算误差
δ
i
(
n
l
)
\delta _i^{({n_l})}
δi(nl)
δ
i
(
n
l
)
=
−
(
y
i
−
a
i
(
n
l
)
)
⋅
f
′
(
z
i
(
n
l
)
)
(47)
\delta _i^{({n_l})} = - \left( {{y_i} - a_i^{({n_l})}} \right) \cdot f'\left( {z_i^{({n_l})}} \right) \tag{47}
δi(nl)=−(yi−ai(nl))⋅f′(zi(nl))(47)
注:最后一层(输出层)不同于隐藏层,所以需要单独写出来。
(3) 对于隐藏层 l = n l − 1 , n l − 2 , … , 2 l = {n_l} - 1,{n_l} - 2, \ldots ,2 l=nl−1,nl−2,…,2,计算误差 δ i ( l ) \delta _i^{(l)} δi(l)
δ i l = [ ∑ k = 1 S n l + 1 ( δ k ( n l + 1 ) ⋅ w k , i ( n l ) ) ] f ′ ( z i ( l ) ) (48) \delta _i^l = \left[ {\sum\limits_{k = 1}^{{S_{{n_l} + 1}}} {\left( {\delta _k^{({n_l} + 1)} \cdot w_{k,i}^{({n_l})}} \right)} } \right]f'\left( {z_i^{(l)}} \right) \tag{48} δil=⎣⎡k=1∑Snl+1(δk(nl+1)⋅wk,i(nl))⎦⎤f′(zi(l))(48)
(4) 更新 权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l) 和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)
w
i
,
j
(
l
)
=
w
i
,
j
(
l
)
−
α
⋅
a
j
(
l
)
δ
i
(
l
+
1
)
(49)
w_{i,j}^{(l)} = w_{i,j}^{(l)} - \alpha \cdot a_j^{(l)}\delta _i^{(l + 1)} \tag{49}
wi,j(l)=wi,j(l)−α⋅aj(l)δi(l+1)(49)
b
i
(
l
)
=
b
i
(
l
)
−
α
⋅
δ
i
(
l
+
1
)
(50)
b_i^{(l)} = b_i^{(l)} - \alpha \cdot \delta _i^{(l + 1)} \tag{50}
bi(l)=bi(l)−α⋅δi(l+1)(50)
若考虑到正则化,则权重的更新方程为
w
i
,
j
(
l
)
=
w
i
,
j
(
l
)
(
1
−
α
λ
)
−
α
⋅
a
j
(
l
)
δ
i
(
l
+
1
)
(51)
w_{i,j}^{(l)} = w_{i,j}^{(l)}(1 - \alpha \lambda ) - \alpha \cdot a_j^{(l)}\delta _i^{(l + 1)} \tag{51}
wi,j(l)=wi,j(l)(1−αλ)−α⋅aj(l)δi(l+1)(51)
4 实验
实验1——实现简单的BP神经网络
代码流程
输入(Input):输入层输入向量
向前传播 (Feed Forward)
输出层误差(Output Error)
反向传播误差(Back propagate Error):
隐藏层误差 输出(Output):
输出损失函数的偏置
代码示例
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=4)
# 定义神经网络的模型架构 [input, hidden, output]
network_sizes = [3, 4, 2]
# 初始化该神经网络的参数
sizes = network_sizes
num_layers = len(sizes)
biases = [np.random.randn(h, 1) for h in sizes[1:]]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
def loss_der(network_y, real_y):
"""
返回损失函数的偏导,损失函数使用 MSE
L = 1/2(network_y-real_y)^2
delta_L = network_y-real_y
"""
return (network_y - real_y)
def sigmoid(z):
"""激活函数使用 sigmoid."""
return 1.0 / (1.0 + np.exp(-z))
def sigmoid_der(z):
"""sigmoid函数的导数 derivative of sigmoid."""
return sigmoid(z) * (1 - sigmoid(z))
def backprop(x, y):
"""
根据损失函数 C通过反向传播算法返回
"""
"""Return a tuple "(nabla_b, nabla_w)" representing the
gradient for the cost function C_x. "nabla_b" and
"nabla_w" are layer-by-layer lists of numpy arrays, similar
to "self.biases" and "self.weights"."""
# 初始化网络参数的导数 权重w的偏导和偏置b的偏导
delta_w = [np.zeros(w.shape) for w in weights]
delta_b = [np.zeros(b.shape) for b in biases]
# 向前传播 feed forward
activation = x # 把输入的数据作为第一次激活值
activations = [x] # 存储网络的激活值
zs = [] # 存储网络的加权输入值 (z=wx+b)
for w, b in zip(weights, biases):
z = np.dot(w, activation) + b
activation = sigmoid(z)
activations.append(activation)
zs.append(z)
# 反向传播 back propagation
# BP1 计算输出层误差
delta_L = loss_der(activations[-1], y) * sigmoid_der(zs[-1])
# BP3 损失函数在输出层关于偏置的偏导
delta_b[-1] = delta_L
# BP4 损失函数在输出层关于权值的偏导
delta_w[-1] = np.dot(delta_L, activations[-2].transpose())
delta_l = delta_L
for l in range(2, num_layers):
# BP2 计算第l层误差
z = zs[-l]
sp = sigmoid_der(z)
delta_l = np.dot(weights[-l + 1].transpose(), delta_l) * sp
# BP3 损失函数在l层关于偏置的偏导
delta_b[-l] = delta_l
# BP4 损失函数在l层关于权值的偏导
delta_w[-l] = np.dot(delta_l, activations[-l - 1].transpose())
return (delta_w, delta_b)
##### 生成数据 并进行训练
# 输入(Input):输入层输入向量
# 向前传播 (Feed Forward)
# 输出层误差(Output Error)
# 反向传播误差(Back propagate Error):隐藏层误差
# 输出(Output):输出损失函数的偏置
training_x = np.random.rand(3).reshape(3, 1)
training_y = np.array([0, 1]).reshape(2, 1)
print("training data x:\n{},\n training data y:\n{}".format(training_x, training_y))
delta_w, delta_b=backprop(training_x, training_y)
print("delta_w:\n{},\n delta_b:\n{}".format(delta_w, delta_b))
运行结果
实验2——医疗数据诊断
一般情况下,病人去医院抽血、进行细胞病变等检查之后,如下表所示,检查室开出一张医疗诊断表格, 上面有白细胞、链球菌 、血小板等参数。医生通过检查这些参数值的大小和变化, 可以顶测并判断该病人是否带有某种病原体。
下面我们将以类似的医疗数据作为背景:首先把医疗检测 的结果进行数学解释,通过构建一个3层的人工神经网络 ANN 模型对提前收集到的医疗数据进行训练,得到该医疗数据的分类模型,最后利用新的医疗数据,顶测其所属病理分类。
注:在实际中 使用交叉熵作为损失函数
(1) 创建数据
代码示例
from sklearn import linear_model
from sklearn import datasets
import sklearn
import numpy as np
import matplotlib.pyplot as plt
# 创建数据
def generate_data():
np.random.seed(0)
X, y = datasets.make_moons(200, noise=0.20) # 300个数据点,噪声设定0.3
return X, y
# 读取数据并显示
data,labels=generate_data() # 读取数据
plt.scatter(data[:, 0], data[:, 1], s=50, c=labels,cmap=plt.cm.Spectral, edgecolors="#313131")
plt.title("Medical data")
plt.show()
运行结果
(2) 构建网络模型
使用3层的神经网络,隐藏层神经节点数量为3,基本模型如下图所示
隐藏层需要一个激活函数,激活函数把输出层转换为下一层的输入层。非线性的激活函数能够让我们去处理非线性的问题。常用的非线性激活函数有 Tanh 函数、Si gmoid 函数和 ReLU 函数。我们选择使用 Tanh 函数,同样也可以尝试把 Tanh 函数换成其他函数查看输出,最后通过 Softmax 层把激活函数的输出转换为概率。
注:在神经网络中 Softmax 函数常常作用于输出层,将神经网络的输出向量转换成同分布的概率分布。
模型加入Softmax 层结构框架如下图所示
构建网络及测试结果代码如下
#!/usr/bin/python
# -*- coding: UTF-8 -*-
from sklearn import linear_model
from sklearn import datasets
import sklearn
import numpy as np
import matplotlib.pyplot as plt
# # 创建数据
# def generate_data():
# np.random.seed(0)
# X, y = datasets.make_moons(200, noise=0.20) # 300个数据点,噪声设定0.3
# return X, y
# # 读取数据并显示
# data,labels=generate_data() # 读取数据
# plt.scatter(data[:, 0], data[:, 1], s=50, c=labels,cmap=plt.cm.Spectral, edgecolors="#313131")
# plt.title("Medical data")
# plt.show()
class Config:
input_dim = 2 # 输入的维度
output_dim = 2 # 输出的分类数
epsilon = 0.01 # 梯度下降学习速度
reg_lambda = 0.01 # 正则化强度
def generate_data():
np.random.seed(0)
X, y = datasets.make_moons(200, noise=0.20) # 300个数据点,噪声设定0.3
return X, y
def display_model(model):
print("W1 {}: \n{}\n".format(model['W1'].shape, model['W1']))
print("b1 {}: \n{}\n".format(model['b1'].shape, model['b1']))
print("W2 {}: \n{}\n".format(model['W2'].shape, model['W2']))
print("b1 {}: \n{}\n".format(model['b2'].shape, model['b2']))
def plot_decision_boundary(pred_func, data, labels):
'''绘制分类边界图'''
# 设置最大值和最小值并增加0.5的边界(0.5 padding)
x_min, x_max = data[:, 0].min() - 0.5, data[:, 0].max() + 0.5
y_min, y_max = data[:, 1].min() - 0.5, data[:, 1].max() + 0.5
h = 0.01
# 生成一个点阵网格,点阵间距离为h
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# 预测整个网格当中的函数值
z = pred_func(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
# 绘制轮廓和训练样本
plt.contourf(xx, yy, z, cmap=plt.cm.Spectral,alpha=0.2) # 透明度alpha=0.2
plt.scatter(data[:, 0], data[:, 1], s=40, c=labels, cmap=plt.cm.Spectral)
# plt.show()
def calculate_loss(model, X, y):
'''
损失函数
'''
num_examples = len(X) # 训练集大小
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# 正向传播计算预测值
z1 = X.dot(W1) + b1
a1 = np.tanh(z1)
z2 = a1.dot(W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# 计算损失值
corect_logprobs = -np.log(probs[range(num_examples), y])
data_loss = np.sum(corect_logprobs)
# 对损失值进行归一化(可以不加)
data_loss += Config.reg_lambda / 2 * \
(np.sum(np.square(W1)) + np.sum(np.square(W2)))
return 1. / num_examples * data_loss
def predict(model, x):
'''
预测函数
'''
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# 向前传播
z1 = x.dot(W1) + b1
a1 = np.tanh(z1)
z2 = a1.dot(W2) + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
return np.argmax(probs, axis=1)
def ANN_model(X, y, nn_hdim, num_passes=20000, print_loss=False):
'''
网络学习函数,并返回网络
- nn_hdim: 隐层的神经元节点(隐层的数目)
- num_passes: 梯度下降迭代次数
- print_loss: 是否显示损失函数值
'''
num_examples = len(X) # 训练的数据集
model = {} # 模型存储定义
# 随机初始化参数
np.random.seed(0)
W1 = np.random.randn(Config.input_dim, nn_hdim) / np.sqrt(Config.input_dim)
b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, Config.output_dim) / np.sqrt(nn_hdim)
b2 = np.zeros((1, Config.output_dim))
# display_model({'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2})
# 批量梯度下降法
for i in range(0, num_passes + 1):
# 向前传播
z1 = X.dot(W1) + b1 # M_200*2 .* M_2*3 --> M_200*3
a1 = np.tanh(z1)
z2 = a1.dot(W2) + b2 # M_200*3 .* M_3*2 --> M_200*2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# 向后传播
delta3 = probs # 得到的预测值
delta3[range(num_examples), y] -= 1 # 预测值减去实际值
delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
dW2 = (a1.T).dot(delta3) # W2的导数
db2 = np.sum(delta3, axis=0, keepdims=True) # b2的导数
dW1 = np.dot(X.T, delta2) # W1的导数
db1 = np.sum(delta2, axis=0) # b1的导数
# 添加正则化项
dW1 += Config.reg_lambda * W1
dW2 += Config.reg_lambda * W2
# 根据梯度下降值更新权重
W1 += -Config.epsilon * dW1
b1 += -Config.epsilon * db1
W2 += -Config.epsilon * dW2
b2 += -Config.epsilon * db2
# 把新的参数加入模型当中
model = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
if print_loss and i % 1000 == 0:
print("Loss after iteration %i: %f" %
(i, calculate_loss(model, X, y)))
return model
### 创建数据并进行网络训练
data, labels = generate_data()
model = ANN_model(data, labels, 3, print_loss=True) # 建立三个神经元的隐层
# print(display_model(model))
plot_decision_boundary(lambda x: predict(model, x), data, labels)
plt.title("Hidden Layer size 3")
运行结果
设置不同数量的隐层神经元,看看对模型有什么的影响。
隐层神经元个数分别设置为: 1, 2, 3, 4, 30, 10 。其分类结果如下图所示:
由上图可以看出,随着隐层神经元个数的增加,出现过拟合的概率越大,如隐层神经元个数为100时。所以并不是隐层神经元个数越多越好。
5 总结
BP神经网络是一个从后往前计算的思路,BP算法的核心记住下面4个表达式即可:
(1) 从后往前计算,得到每层的激活函数值;
(2) 最后一层输出层(
n
l
{n_l}
nl ),计算误差
δ
i
(
n
l
)
\delta _i^{({n_l})}
δi(nl)
δ
i
(
n
l
)
=
−
(
y
i
−
a
i
(
n
l
)
)
⋅
f
′
(
z
i
(
n
l
)
)
\delta _i^{({n_l})} = - \left( {{y_i} - a_i^{({n_l})}} \right) \cdot f'\left( {z_i^{({n_l})}} \right)
δi(nl)=−(yi−ai(nl))⋅f′(zi(nl))
(3) 对于隐藏层
l
=
n
l
−
1
,
n
l
−
2
,
…
,
2
l = {n_l} - 1,{n_l} - 2, \ldots ,2
l=nl−1,nl−2,…,2,计算误差
δ
i
(
l
)
\delta _i^{(l)}
δi(l)
δ
i
l
=
[
∑
k
=
1
S
n
l
+
1
(
δ
k
(
n
l
+
1
)
⋅
w
k
,
i
(
n
l
)
)
]
f
′
(
z
i
(
l
)
)
\delta _i^l = \left[ {\sum\limits_{k = 1}^{{S_{{n_l} + 1}}} {\left( {\delta _k^{({n_l} + 1)} \cdot w_{k,i}^{({n_l})}} \right)} } \right]f'\left( {z_i^{(l)}} \right)
δil=⎣⎡k=1∑Snl+1(δk(nl+1)⋅wk,i(nl))⎦⎤f′(zi(l))
(4) 更新 权重
w
i
,
j
(
l
)
w_{i,j}^{(l)}
wi,j(l) 和偏置
b
i
(
l
)
b_i^{(l)}
bi(l)
w
i
,
j
(
l
)
=
w
i
,
j
(
l
)
−
α
⋅
a
j
(
l
)
δ
i
(
l
+
1
)
w_{i,j}^{(l)} = w_{i,j}^{(l)} - \alpha \cdot a_j^{(l)}\delta _i^{(l + 1)}
wi,j(l)=wi,j(l)−α⋅aj(l)δi(l+1)
b i ( l ) = b i ( l ) − α ⋅ δ i ( l + 1 ) b_i^{(l)} = b_i^{(l)} - \alpha \cdot \delta _i^{(l + 1)} bi(l)=bi(l)−α⋅δi(l+1)
参考资料
[1] 图解深度学习
[2] 深度学习原理与实践
[3] TensorFlow实战Google深度学习框架(第2版)
[4] https://www.bilibili.com/video/av36982926