本系列目地:
从0开始写一个神经网路的手写数字识别,随着本系列升级的过程中,加强对基本知识的理解
1. 数据集
方便起见,采用tensorflow的内置数据集,当然也可以在此处下载数据集,
import tensorflow as tf
import matplotlib.pyplot as plt
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
plt.imshow(x_train[0])
print('训练集长度为{},每一个的大小为{}像素'.format(len(x_train), x_train[0].shape))
结果
训练集长度为60000,每一个的大小为(28, 28)像素
2. 神经网络前向传播
简单的神经网络:
本网络一共三层,第一层可以看做是一个样本的三个属性,记为
a
1
1
,
a
2
1
,
a
3
1
a_1^1,a_2^1,a_3^1
a11,a21,a31,上标代表第一层,下表代表第几个数字,
a
a
a代表本神经元的结果,就是激活值。
a
1
2
=
σ
(
a
1
1
w
11
2
+
a
2
1
w
12
2
+
a
3
1
w
11
2
+
b
1
2
)
a_1^2=\sigma(a_1^1w_{11}^2+a_2^1w_{12}^2+a_3^1w_{11}^2+b_1^2)
a12=σ(a11w112+a21w122+a31w112+b12),这里
w
i
j
2
w_{ij}^2
wij2代表着连接第一层到第二层的权重,
b
1
2
b_1^2
b12代表计算到第二层第一个神经元的偏置,且
σ
(
x
)
=
1
1
+
e
−
x
(1)
\sigma(x) = \frac{1}{1+e^{-x}}\tag{1}
σ(x)=1+e−x1(1)因此,我们可以给出
a
j
l
+
1
a_j^{l+1}
ajl+1:
a
j
l
+
1
=
σ
(
z
j
l
+
1
)
,
其
中
z
j
l
+
1
=
∑
k
w
k
j
l
+
1
a
k
l
+
b
j
l
+
1
a_j^{l+1}=\sigma(z^{l+1}_j) ,其中z^{l+1}_j=\sum_{k}w_{kj}^{l+1}a_k^l+b_j^{l+1}
ajl+1=σ(zjl+1),其中zjl+1=k∑wkjl+1akl+bjl+1
写成矩阵的形式即为:
a
l
+
1
=
σ
(
z
l
+
1
)
,
其
中
z
l
+
1
=
w
l
+
1
a
l
+
b
l
+
1
a^{l+1}=\sigma(z^{l+1}),其中z^{l+1}=w^{l+1}a^l+b^{l+1}
al+1=σ(zl+1),其中zl+1=wl+1al+bl+1
即,以上图为例子,假设一个样本的维度为
1
∗
3
1 * 3
1∗3,通过权重矩阵
3
∗
4
3 * 4
3∗4,将其变成一个
1
∗
4
1 * 4
1∗4的,然后再通过一个
5
∗
1
5 * 1
5∗1的矩阵,转变为
1
∗
1
1*1
1∗1
那么假设现在有 n n n个样本,则输入为 n ∗ 3 n*3 n∗3,通过第一个权重矩阵计算,就可以有 n n n个样本的中间隐藏层,即 n ∗ 4 n*4 n∗4,最后再通过第二个权重矩阵,可以得到 n ∗ 1 n*1 n∗1的一个结果矩阵
在前向传播中,输入是不定的,但是其中的权重矩阵的维度是固定的。其实不难发现,此维度是和神经元的个数是有关系的。
第一个权重矩阵的行的维度肯定等于第一列的神经元个数,列为第二列神经元的个数。
第二个权重矩阵的行的维度肯定等于第二列的神经元个数,列为第三列神经元的个数。
…
第n-1个权重矩阵的行的维度肯定等于第n-1列的神经元个数,列为第n列神经元的个数。
代码:
import numpy as np
class NetWork():
def __init__(self, layers):
self.layers = layers
selfl.weights = [np.random.randn(x, y) for x, y in zip(layers[:-1], layers[1:])]
self.bias = [np.random.randn(1, y) for y in layers[1:]]
def __call__(self, x):
for w, b in zip(self.weights, self.bias):
x = np.dot(x, w) + b
return x
现在我们给10个样本,建立一个上图的网络
x = np.random.randn(10,3)
model = NetWork([3,5,1])
model(x).shape
可以看到结果即为(10,1)
3. 神经网络反向传播
3.1 负梯度方向
反向传播最主要的目的就是利用梯度下降法来更新这些权重,那么下面讲解为什么可以利用负梯度方向
假设目标函数为
C
(
v
)
C(v)
C(v),权重参数为
v
=
(
v
1
,
v
2
)
v=(v_1,v_2)
v=(v1,v2),更新的变化为
Δ
v
\Delta v
Δv, 为了使得
Δ
C
=
C
(
v
+
Δ
v
)
−
C
(
v
)
<
0
\Delta C = C(v+\Delta v)-C(v)<0
ΔC=C(v+Δv)−C(v)<0,根据一阶泰勒展开,
Δ
C
≈
C
′
(
v
10
)
Δ
v
1
+
C
′
(
v
20
)
Δ
v
2
=
∇
C
(
v
0
)
Δ
v
\Delta C \approx C'(v_{10})\Delta v_1+ C'(v_{20})\Delta v_2=\nabla C(v_0)\Delta v
ΔC≈C′(v10)Δv1+C′(v20)Δv2=∇C(v0)Δv
这里带一个0的意思表明是一个固定值,不是变量。很明显,当
Δ
v
=
−
∇
C
(
v
0
)
\Delta v =-\nabla C(v_0)
Δv=−∇C(v0)时,即当更新的变化为负梯度方向的时候,可以有
Δ
C
≈
−
∇
C
2
<
0
\Delta C\approx-\nabla C^2<0
ΔC≈−∇C2<0,可以保证
C
C
C是减小的。 然后更新
v
v
v的取值
v
=
v
+
η
Δ
v
=
v
−
η
∇
C
(
v
0
)
(2)
v=v+\eta\Delta v=v-\eta\nabla C(v_0) \tag{2}
v=v+ηΔv=v−η∇C(v0)(2)这里
η
\eta
η代表着一个学习率。因此在神经网络传播的过程中,我们需要知道梯度的值是多少
3.2 如何求导
求导的过程是反着来的,通过最后一层的导数不停的向前传导。从下面可以看这个过程,最简答的导数,就是对最后一层变量的求导。
3.2.1 标量形式
假设网络有
L
L
L层,则最后一层的输出为
a
j
L
=
σ
(
z
j
L
)
,
z
j
L
=
∑
k
w
k
j
L
a
k
L
−
1
+
b
j
L
(3)
a^L_j=\sigma(z^{L}_j),z^{L}_j=\sum_kw_{kj}^La_k^{L-1}+b_j^L\ \tag{3}
ajL=σ(zjL),zjL=k∑wkjLakL−1+bjL (3)
使用最淳朴的目标函数
m
i
n
C
=
1
2
n
∑
j
(
y
j
t
r
u
e
−
a
j
L
)
2
(4)
min \space C=\frac{1}{2n}\sum_{j}(y^{true}_j-a_j^L)^2\tag{4}
min C=2n1j∑(yjtrue−ajL)2(4)
其向量形式为
m
i
n
C
=
1
2
n
∣
∣
y
t
r
u
e
−
a
L
∣
∣
2
(5)
min \space C=\frac{1}{2n}||y^{true}-a^L||^2\tag{5}
min C=2n1∣∣ytrue−aL∣∣2(5)
在3.1中,
∇
C
=
(
∂
C
∂
w
11
L
,
∂
C
∂
w
12
L
,
.
.
.
,
∂
C
∂
w
11
L
−
1
,
.
.
.
)
(6)
\nabla C=(\frac{\partial C}{\partial w_{11}^L},\frac{\partial C}{\partial w_{12}^L},...,\frac{\partial C}{\partial w_{11}^{L-1}},...)\tag{6}
∇C=(∂w11L∂C,∂w12L∂C,...,∂w11L−1∂C,...)(6)
为了方便起见,我们记录目标函数对第
L
L
L层第
j
j
j个神经元的导数为
δ
j
L
=
∂
C
∂
z
j
L
=
∑
k
∂
C
∂
a
k
L
∂
a
k
L
∂
z
j
L
=
∂
C
∂
a
j
L
σ
′
(
z
j
L
)
(7)
\delta^L_{j}=\frac{\partial C}{\partial z_j^L}=\sum_k\frac{\partial C}{\partial a_k^L}\frac{\partial a_k^L}{\partial z_j^L}=\frac{\partial C}{\partial a_j^L}\sigma'(z_j^L)\tag{7}
δjL=∂zjL∂C=k∑∂akL∂C∂zjL∂akL=∂ajL∂Cσ′(zjL)(7)
对于最后一层而言,这里
C
=
C
(
a
1
L
,
a
2
L
,
.
.
.
,
a
n
L
)
C=C(a_1^L,a_2^L,...,a_n^L)
C=C(a1L,a2L,...,anL),
a
j
L
=
σ
(
z
j
L
)
a^L_j=\sigma(z^{L}_j)
ajL=σ(zjL),所以只有当
k
=
j
k=j
k=j的时候,
∂
a
k
L
/
∂
z
j
L
{\partial a_k^L}/{\partial z_j^L}
∂akL/∂zjL才不为
0
0
0。根据式子(3),对于最后一层的系数
w
i
j
L
w_{ij}^L
wijL的导数为:
∂
C
∂
w
i
j
L
=
∂
C
∂
z
j
L
∂
z
j
L
∂
w
i
j
L
=
δ
j
L
a
i
L
−
1
(8)
\begin{aligned} \frac{\partial C}{\partial w_{ij}^L}&=\frac{\partial C}{\partial z^L_j}\frac{\partial z^L_j}{\partial w_{ij}^L} \\&=\delta_{j}^La_i^{L-1} \end{aligned}\tag{8}
∂wijL∂C=∂zjL∂C∂wijL∂zjL=δjLaiL−1(8)也就是说我们需要保存上一层的输出,来计算对最后一层
w
w
w的求导
同样的根据(3),我们可以得到
∂
C
∂
b
j
L
=
∂
C
∂
z
j
L
∂
z
j
L
∂
b
j
L
=
δ
j
L
(9)
\begin{aligned} \frac{\partial C}{\partial b_{j}^L}&=\frac{\partial C}{\partial z^L_j}\frac{\partial z^L_j}{\partial b_{j}^L} \\&=\delta_{j}^L \end{aligned}\tag{9}
∂bjL∂C=∂zjL∂C∂bjL∂zjL=δjL(9)
现在我们只要解决对倒数第二层的导数就可以了。因为所有的
L
L
L层神经元都是连接着
L
−
1
L-1
L−1层的任何神经元,因此,对于任何一个倒数第二层的神经元权重,我们有
∂
C
∂
w
i
j
L
−
1
=
∂
C
∂
z
1
L
∂
z
1
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
w
i
j
L
−
1
+
∂
C
∂
z
2
L
∂
z
2
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
w
i
j
L
−
1
+
.
.
.
+
∂
C
∂
z
k
L
∂
z
k
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
w
i
j
L
−
1
=
∑
k
δ
k
L
w
k
j
L
σ
′
(
z
j
L
−
1
)
a
i
L
−
2
(10)
\begin{aligned} \frac{\partial C}{\partial w_{ij}^{L-1}}&=\frac{\partial C}{\partial z^L_1}\frac{\partial z^L_1}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial w_{ij}^{L-1}}+\frac{\partial C}{\partial z^L_2}\frac{\partial z^L_2}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial w_{ij}^{L-1}}&+...+\frac{\partial C}{\partial z^L_k}\frac{\partial z^L_k}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial w_{ij}^{L-1}} \\&=\sum_k\delta_{k}^Lw_{kj}^{L}\sigma'(z^{L-1}_j)a^{L-2}_i \end{aligned}\tag{10}
∂wijL−1∂C=∂z1L∂C∂ajL−1∂z1L∂zjL−1∂ajL−1∂wijL−1∂zjL−1+∂z2L∂C∂ajL−1∂z2L∂zjL−1∂ajL−1∂wijL−1∂zjL−1=k∑δkLwkjLσ′(zjL−1)aiL−2+...+∂zkL∂C∂ajL−1∂zkL∂zjL−1∂ajL−1∂wijL−1∂zjL−1(10)
这里
z
j
L
=
w
1
j
L
a
1
L
−
1
+
w
2
j
L
a
2
L
−
1
+
.
.
.
+
w
j
−
1
j
L
a
j
−
1
L
−
1
+
w
j
j
L
a
j
L
−
1
+
w
j
+
1
j
L
a
j
+
1
L
−
1
+
.
.
.
+
w
k
j
L
a
k
L
−
1
+
b
j
L
z_j^L=w_{1j}^La_1^{L-1}+w_{2j}^La_2^{L-1}+...+w_{j-1j}^La_{j-1}^{L-1}+w_{jj}^La_{j}^{L-1}+w_{j+1j}^La_{j+1}^{L-1}+...+w_{kj}^La_{k}^{L-1}+b_j^L
zjL=w1jLa1L−1+w2jLa2L−1+...+wj−1jLaj−1L−1+wjjLajL−1+wj+1jLaj+1L−1+...+wkjLakL−1+bjL
a
j
L
−
1
=
σ
(
z
j
L
−
1
)
a_j^{L-1}=\sigma(z_j^{L-1})
ajL−1=σ(zjL−1)
z
j
L
−
1
=
w
1
j
L
−
1
a
1
L
−
2
+
w
2
j
L
−
1
a
2
L
−
2
+
.
.
.
+
w
i
−
1
j
L
−
1
a
i
−
1
L
−
2
+
w
i
j
L
−
1
a
i
L
−
2
+
w
i
+
1
j
L
−
1
a
i
+
1
L
−
2
+
.
.
.
+
w
k
j
L
−
1
a
k
L
−
2
+
b
j
L
−
1
z_j^{L-1}=w_{1j}^{L-1}a_1^{L-2}+w_{2j}^{L-1}a_2^{L-2}+...+w_{i-1j}^{L-1}a_{i-1}^{L-2}+w_{ij}^{L-1}a_{i}^{L-2}+w_{i+1j}^{L-1}a_{i+1}^{L-2}+...+w_{kj}^{L-1}a_{k}^{L-2}+b_j^{L-1}
zjL−1=w1jL−1a1L−2+w2jL−1a2L−2+...+wi−1jL−1ai−1L−2+wijL−1aiL−2+wi+1jL−1ai+1L−2+...+wkjL−1akL−2+bjL−1
k
k
k的值是根据上一层的神经元个数确定的。我们还可以把
(
10
)
(10)
(10)重新写为
∂
C
∂
w
i
j
L
−
1
=
∂
C
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
w
i
j
L
−
1
=
δ
j
L
−
1
a
i
L
−
2
\begin{aligned} \frac{\partial C}{\partial w_{ij}^{L-1}}&=\frac{\partial C}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial w_{ij}^{L-1}}\\&=\delta^{L-1}_ja_i^{L-2} \end{aligned}
∂wijL−1∂C=∂zjL−1∂C∂wijL−1∂zjL−1=δjL−1aiL−2因此我们有了每一层的导数递推公式
δ
j
l
−
1
=
∑
k
δ
k
l
w
k
j
l
σ
′
(
z
j
l
−
1
)
(11)
\delta^{l-1}_j=\sum_k\delta^{l}_kw_{kj}^l\sigma'(z_j^{l-1})\tag{11}
δjl−1=k∑δklwkjlσ′(zjl−1)(11)
以及对
b
j
L
−
1
b_j^{L-1}
bjL−1的导数:
∂
C
∂
b
j
L
−
1
=
∂
C
∂
z
1
L
∂
z
1
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
b
j
L
−
1
+
∂
C
∂
z
2
L
∂
z
2
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
b
j
L
−
1
+
.
.
.
+
∂
C
∂
z
k
L
∂
z
k
L
∂
a
j
L
−
1
∂
a
j
L
−
1
∂
z
j
L
−
1
∂
z
j
L
−
1
∂
b
j
L
−
1
=
∑
k
δ
k
L
w
k
j
L
σ
′
(
z
j
L
−
1
)
=
δ
j
L
−
1
(12)
\begin{aligned} \frac{\partial C}{\partial b_{j}^{L-1}}&=\frac{\partial C}{\partial z^L_1}\frac{\partial z^L_1}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial b_{j}^{L-1}}+\frac{\partial C}{\partial z^L_2}\frac{\partial z^L_2}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial b_{j}^{L-1}}+...+\frac{\partial C}{\partial z^L_k}\frac{\partial z^L_k}{\partial a_{j}^{L-1}}\frac{\partial a^{L-1}_j}{\partial z_j^{L-1}}\frac{\partial z_j^{L-1}}{\partial b_{j}^{L-1}} \\&=\sum_k\delta_{k}^Lw_{kj}^{L}\sigma'(z^{L-1}_j) \\&=\delta_{j}^{L-1} \end{aligned}\tag{12}
∂bjL−1∂C=∂z1L∂C∂ajL−1∂z1L∂zjL−1∂ajL−1∂bjL−1∂zjL−1+∂z2L∂C∂ajL−1∂z2L∂zjL−1∂ajL−1∂bjL−1∂zjL−1+...+∂zkL∂C∂ajL−1∂zkL∂zjL−1∂ajL−1∂bjL−1∂zjL−1=k∑δkLwkjLσ′(zjL−1)=δjL−1(12)其实通过
(
8
)
−
(
12
)
(8)-(12)
(8)−(12)我们可以发现,其实在计算导数的时候,我们求解的重点就是
δ
\delta
δ,以及保留各个层是输出
a
a
a。
3.2.2 向量形式
把标量的计算改为向量的计算,可以有效的提高计算的速度。我们接下来把
3.2.1
3.2.1
3.2.1的标量方程写为向量的形式:
目标函数的向量形式已经由
(
5
)
(5)
(5)给出,对于导数的定义
(
7
)
(7)
(7)式,假设L层有n个神经元,我们可以写为
δ
L
=
(
δ
1
L
,
δ
2
L
,
.
.
.
,
δ
n
−
1
L
,
δ
n
L
)
=
(
∂
C
∂
a
1
L
σ
′
(
z
1
L
)
,
∂
C
∂
a
2
L
σ
′
(
z
2
L
)
,
.
.
.
,
∂
C
∂
a
n
−
1
L
σ
′
(
z
n
−
1
L
)
,
∂
C
∂
a
n
L
σ
′
(
z
n
L
)
)
=
∇
a
L
C
⊙
σ
′
(
z
L
)
(12)
\begin{aligned} \delta^L&=(\delta_1^L,\delta_2^L,...,\delta_{n-1}^L,\delta_n^L)\\ &=(\frac{\partial C}{\partial a_1^L}\sigma'(z_1^L),\frac{\partial C}{\partial a_2^L}\sigma'(z_2^L),...,\frac{\partial C}{\partial a_{n-1}^L}\sigma'(z_{n-1}^L),\frac{\partial C}{\partial a_n^L}\sigma'(z_n^L))\\ &=\nabla_{a^L}C\space\odot\sigma'(z^L) \end{aligned}\tag{12}
δL=(δ1L,δ2L,...,δn−1L,δnL)=(∂a1L∂Cσ′(z1L),∂a2L∂Cσ′(z2L),...,∂an−1L∂Cσ′(zn−1L),∂anL∂Cσ′(znL))=∇aLC ⊙σ′(zL)(12)这里
⊙
\odot
⊙表示向量相应的位置相乘,
∇
a
L
\nabla_{a^L}
∇aL代表对最后一层的梯度。
下面写一下对变量
w
,
b
w,b
w,b的求导的向量书写形式
根据式子
(
8
)
(8)
(8), 并假设本层的神经元个数为
n
n
n,上一层的神经元个数
m
m
m
∂
C
∂
w
L
=
(
∂
C
∂
w
11
L
∂
C
∂
w
12
L
.
.
.
∂
C
∂
w
1
m
L
∂
C
∂
w
21
L
∂
C
∂
w
22
L
.
.
.
∂
C
∂
w
2
m
L
.
.
.
∂
C
∂
w
n
1
L
∂
C
∂
w
n
2
L
.
.
.
∂
C
∂
w
n
m
L
)
=
(
a
1
L
−
1
,
a
2
L
−
1
,
.
.
.
a
m
L
−
1
)
T
(
δ
1
L
,
δ
2
L
,
.
.
.
,
δ
n
−
1
L
,
δ
n
L
)
=
(
a
L
−
1
)
T
δ
L
(13)
\begin{aligned} \frac{\partial C}{\partial w^L}&=\begin{pmatrix}\frac{\partial C}{\partial w_{11}^L} \frac{\partial C}{\partial w_{12}^L}...\frac{\partial C}{\partial w_{1m}^L}\\ \frac{\partial C}{\partial w_{21}^L} \frac{\partial C}{\partial w_{22}^L}...\frac{\partial C}{\partial w_{2m}^L}\\...\\\frac{\partial C}{\partial w_{n1}^L}\frac{\partial C}{\partial w_{n2}^L}...\frac{\partial C}{\partial w_{nm}^L}\end{pmatrix}\\ &=(a^{L-1}_1,a^{L-1}_2,...a^{L-1}_m)^T(\delta_1^L,\delta_2^L,...,\delta_{n-1}^L,\delta_n^L)\\ &=(a^{L-1})^T\delta^L\tag{13} \end{aligned}
∂wL∂C=⎝⎜⎜⎜⎛∂w11L∂C∂w12L∂C...∂w1mL∂C∂w21L∂C∂w22L∂C...∂w2mL∂C...∂wn1L∂C∂wn2L∂C...∂wnmL∂C⎠⎟⎟⎟⎞=(a1L−1,a2L−1,...amL−1)T(δ1L,δ2L,...,δn−1L,δnL)=(aL−1)TδL(13)
根据式子
(
9
)
(9)
(9),我们有
∂
C
∂
b
L
=
(
∂
C
∂
b
1
L
,
∂
C
∂
b
2
L
,
.
.
.
,
∂
C
∂
b
n
−
1
L
,
∂
C
∂
b
n
L
)
=
(
δ
1
L
,
δ
2
L
,
.
.
.
,
δ
n
−
1
L
,
δ
n
L
)
=
δ
L
(14)
\begin{aligned} \frac{\partial C}{\partial b^L}&=(\frac{\partial C}{\partial b_1^L},\frac{\partial C}{\partial b_2^L},...,\frac{\partial C}{\partial b^L_{n-1}},\frac{\partial C}{\partial b_n^L})\\ &=(\delta_1^L,\delta_2^L,...,\delta_{n-1}^L,\delta_n^L)\\ &=\delta^L \end{aligned}\tag{14}
∂bL∂C=(∂b1L∂C,∂b2L∂C,...,∂bn−1L∂C,∂bnL∂C)=(δ1L,δ2L,...,δn−1L,δnL)=δL(14)
对于式子
(
11
)
(11)
(11)的向量形式,我们可以逐项观察,首先对于任意的
l
,
l
=
1...
L
−
1
,
l,l=1...L-1,
l,l=1...L−1,
δ
l
=
(
δ
1
l
,
δ
2
l
,
.
.
.
,
δ
n
−
1
l
,
δ
n
l
)
\delta^l=(\delta_1^l,\delta_2^l,...,\delta_{n-1}^l,\delta_n^l)
δl=(δ1l,δ2l,...,δn−1l,δnl)
观察第一项:
δ
1
l
=
(
δ
1
l
+
1
w
11
l
+
1
+
δ
2
l
+
1
w
21
l
+
1
+
.
.
.
+
δ
k
l
+
1
w
k
1
l
+
1
)
σ
′
(
z
1
l
)
=
(
δ
1
l
+
1
,
δ
2
l
+
1
,
.
.
.
,
δ
k
l
+
1
)
(
w
11
l
+
1
w
21
l
+
1
.
.
.
w
k
1
l
+
1
)
σ
′
(
z
1
l
)
\begin{aligned}\delta_1^l&=(\delta_1^{l+1}w_{11}^{l+1}+\delta_2^{l+1}w_{21}^{l+1}+...+\delta_k^{l+1}w_{k1}^{l+1})\sigma'(z_1^l) \\&=\begin{pmatrix} \delta_1^{l+1},\delta_2^{l+1},..., \delta_k^{l+1}\end{pmatrix}\begin{pmatrix} w_{11}^{l+1}\\w_{21}^{l+1}\\...\\w_{k1}^{l+1} \end{pmatrix} \sigma'(z_1^l) \end{aligned}
δ1l=(δ1l+1w11l+1+δ2l+1w21l+1+...+δkl+1wk1l+1)σ′(z1l)=(δ1l+1,δ2l+1,...,δkl+1)⎝⎜⎜⎛w11l+1w21l+1...wk1l+1⎠⎟⎟⎞σ′(z1l)
观察第二项:
δ
2
l
=
(
δ
1
l
+
1
w
12
l
+
1
+
δ
2
l
+
1
w
22
l
+
1
+
.
.
.
+
δ
k
l
+
1
w
k
2
l
+
1
)
σ
′
(
z
2
l
)
=
(
δ
1
l
+
1
,
δ
2
l
+
1
,
.
.
.
,
δ
k
l
+
1
)
(
w
12
l
+
1
w
22
l
+
1
.
.
.
w
k
2
l
+
1
)
σ
′
(
z
2
l
)
\begin{aligned}\delta_2^l&=(\delta_1^{l+1}w_{12}^{l+1}+\delta_2^{l+1}w_{22}^{l+1}+...+\delta_k^{l+1}w_{k2}^{l+1})\sigma'(z_2^l) \\&=\begin{pmatrix} \delta_1^{l+1},\delta_2^{l+1},..., \delta_k^{l+1}\end{pmatrix}\begin{pmatrix} w_{12}^{l+1}\\w_{22}^{l+1}\\...\\w_{k2}^{l+1} \end{pmatrix} \sigma'(z_2^l) \end{aligned}
δ2l=(δ1l+1w12l+1+δ2l+1w22l+1+...+δkl+1wk2l+1)σ′(z2l)=(δ1l+1,δ2l+1,...,δkl+1)⎝⎜⎜⎛w12l+1w22l+1...wk2l+1⎠⎟⎟⎞σ′(z2l)
…
因为之前我们所定义的
w
i
j
w_{ij}
wij矩阵表示的是前层第i个神经元到本层第j个神经元的权重,而在上式
w
w
w的行数是和
l
+
1
l+1
l+1层的神经元个数一样,因此我们需要进行转置,
δ
l
\delta^l
δl可以简单的写为:
δ
l
=
δ
l
+
1
(
w
l
+
1
)
T
⊙
σ
′
(
z
l
)
(15)
\delta^{l}=\delta^{l+1}(w^{l+1})^T\odot\sigma'(z^l)\tag{15}
δl=δl+1(wl+1)T⊙σ′(zl)(15)
3.3 神经网络传播过程
- 输入, 其实就是对于第一层的激活值
- 前向传播,对每一个 l = 2 , 3 , . . . , L l=2,3,...,L l=2,3,...,L计算相应的 z l = a l − 1 w l + b l z^l=a^{l-1}w^l+b^l zl=al−1wl+bl和 a l = σ ( z l ) a^l=\sigma(z^l) al=σ(zl)
- 计算各层的导数 δ l \delta^l δl,计算 C w l , C b l C_{w^l},C_{b^l} Cwl,Cbl
- 更新权重, w l = w l − η C w l , b l = b l − η C b l w^l=w^l-\eta C_{w^l},b^l=b^l-\eta C_{b^l} wl=wl−ηCwl,bl=bl−ηCbl
4. 手写数字最简单代码
结果,经过10 轮反复训练, 最高可以达到86%的准确度
大致过程:
对 每一张图片 都进行
3.3
3.3
3.3的过程,更新权重,因此一轮就更新了6万次, 10轮更新了60万次。
class NetWork():
def __init__(self, layers):
self.layers = layers
self.weights = [np.random.randn(x, y) for x, y in zip(layers[:-1], layers[1:])]
self.bias = [np.random.randn(1, y) for y in layers[1:]]
self.eta = 0.1
self.z_record = []
self.delta = []
self.activates = []
self.cws = []
self.cbs = []
print('weights[0].shape={},weights[1].shape={}'.format(self.weights[0].shape, self.weights[1].shape))
print('bias[0].shape={},bias[1].shape={}'.format(self.bias[0].shape, self.bias[1].shape))
# 前向传播
def __call__(self, x):
for w, b in zip(self.weights, self.bias):
x = self.sigmoid(np.dot(x, w) + b)
return x
## sigmoid 函数定义
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
## 定义 sigmoid 导函数
def sigmoid_prime(self, x):
return self.sigmoid(x) * (1 - self.sigmoid(x))
def train_helper(self, x_train, y_train, epochs, x_test=None, y_test=None):
# 转换为one-hot编码
y_train = pd.get_dummies(y_train)
# 添加epoch
for epoch in range(epochs):
for i in range(len(x_train)):
# for i in range(200):
self.train(i, x_train[i], y_train.loc[i].tolist())
print("epoch:{}, acc:{}".format(epoch, self.evaluate(x_test,y_test)))
print("training completed")
def evaluate(self, x_test, y_test):
pre = [np.argmax(self.__call__(i)) for i in x_test]
return sum(pre == y_test)
def train(self,index, x_train, y_train):
x = x_train[np.newaxis,:]
z_record = [x] # 添加第一列的值,用于更新第一个w和b
activates = []
c_ws = []
c_bs = []
## 记录导数
for w, b in zip(self.weights, self.bias):
z = np.dot(x, w) + b
z_record.append(z)
x = self.sigmoid(z)
activates.append(x)
# 求出delta导数,最后一层,对应于(12)
delta_record = [self.cost_derivative(y_train, x) * self.sigmoid_prime(z_record[-1])]
if index % 1000 == 0:
print('index:{}, MSE loss:{},'.format(index, np.mean((y_train - x) ** 2) ))
# 计算(13)、(14)
c_w = np.dot(z_record[-2].T, delta_record[0])
c_b = delta_record[0]
c_ws.append(c_w)
c_bs.append(c_b)
# 更新权重
self.weights[-1] = self.weights[-1] - self.eta * c_w
self.bias[-1] = self.bias[-1] - self.eta * c_b
# 记录其他层的导数delta,计算相应的导数,并更新
for ly in range(2, len(self.layers)):
# 计算到神经元的导数,对于于(15)
this_delta = np.dot(delta_record[0], self.weights[-ly+1].T) * self.sigmoid_prime(z_record[-ly])
delta_record.insert(0, this_delta) #
# 计算对本层w 和 b的导数,并且更新
c_w = np.dot(z_record[-(ly+1)].T, delta_record[0]) # (13)
c_b = delta_record[0] # (14)
# 更新权重
self.weights[-ly] = self.weights[-ly] - self.eta * c_w
self.bias[-ly] = self.bias[-ly] - self.eta * c_b
c_ws.append(c_w)
c_bs.append(c_b)
self.z_record = z_record
self.delta = delta_record
self.activates = activates
self.cws = c_ws
self.cbs = c_bs
## 定义对目标函数的导数
def cost_derivative(self, y_true, y_pre):
return y_pre - y_true
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
## plt.imshow(x_train[0])
x_train = x_train.reshape(x_train.shape[0],-1)
x_test = x_test.reshape(x_test.shape[0],-1)
# x_test = (np.max(x_test) - x_test) / np.max(x_test)
print('训练集长度为{},每一个的大小为{}像素'.format(len(x_train), x_train[0].shape))
model = NetWork([784,13,10])
# 由于数据太大了,很有可能算出来的值太大了,计算的时候容易出现梯度消失啊
# 因此把数据进行归一化
x_train = (np.max(x_train) - x_train) / np.max(x_train)
# 开始训练
model.train_helper(x_train, y_train, 10, x_test, y_test)
结果
weights[0].shape=(784, 13),weights[1].shape=(13, 10)
bias[0].shape=(1, 13),bias[1].shape=(1, 10)
epoch:0, acc:8078
epoch:1, acc:8266
epoch:2, acc:8611
epoch:3, acc:8034
epoch:4, acc:8413
epoch:5, acc:8069
epoch:6, acc:8573
epoch:7, acc:8682
epoch:8, acc:8258
epoch:9, acc:8283