从零搭建神经网络解决异或问题
PS: 本篇文章不涉及调用任何有关机器学习的库,例如TensorFlow,Pytorch等
- 通过阅读本篇文章,你可能收获如下:
- 了解神经网络的结构
- 理解前向传播和反向传播的本质
- 理解激活函数的意义以及它在反向传播中的作用
数据集
很简单就四个数据点 (x,y) label
- (0,0) 0
- (1,0) 1
- (0,1) 1
- (0,0) 0
网络结构
由于单层神经网络不能解决2维异或问题,所以设计含1层隐藏层的神经网络
n
1
=
w
11
∗
I
1
+
w
21
∗
I
2
+
b
1
(1)
n_1 = w_{11}*I_1+w_{21}*I_2 + b_1 \tag{1}
n1=w11∗I1+w21∗I2+b1(1)
n
3
=
w
12
∗
I
1
+
w
22
∗
I
2
+
b
2
(2)
n_3 = w_{12}*I_1+w_{22}*I_2 +b_2 \tag{2}
n3=w12∗I1+w22∗I2+b2(2)
n
2
=
t
a
n
h
(
n
1
)
(3)
n_2=tanh(n1) \tag{3}
n2=tanh(n1)(3)
n
4
=
t
a
n
h
(
n
3
)
(4)
n_4=tanh(n3) \tag{4}
n4=tanh(n3)(4)
n
5
=
w
3
∗
n
2
+
w
4
∗
n
4
+
b
3
(5)
n_5=w_3*n_2+w_4*n_4+b_3 \tag{5}
n5=w3∗n2+w4∗n4+b3(5)
n
6
=
s
i
g
m
o
i
d
(
n
5
)
(6)
n_6=sigmoid(n_5) \tag{6}
n6=sigmoid(n5)(6)
- 这里给出激活函数的求导性质(后面要用到)
{ f ( x ) = s i g m o i d ( x ) f ( x ) ′ = f ( x ) ( 1 − f ( x ) ) (7) \left\{\begin{matrix}f(x)=sigmoid(x)\\f(x)'=f(x)(1-f(x))\end{matrix}\right. \tag{7} {f(x)=sigmoid(x)f(x)′=f(x)(1−f(x))(7)
{ g ( x ) = t a n h ( x ) g ( x ) ′ = 1 − g ( x ) 2 (8) \left\{\begin{matrix}g(x)=tanh(x)\\g(x)'=1-g(x)^2\end{matrix}\right. \tag{8} {g(x)=tanh(x)g(x)′=1−g(x)2(8)
前向传播和反向传播
- 前向传播很好理解,就是给定输入和神经网络,得到输出的过程,在这里就是给定 I 1 I_1 I1 和 I 2 I_2 I2 得到 n 6 n_6 n6 的过程,具体计算过程已经由上面公式给出
- 下面重点介绍反向传播
- 首先给出损失函数的定义
l o s s = ( n 6 − l a b e l ) 2 (9) loss = (n_6-label)^2 \tag{9} loss=(n6−label)2(9) - 在进行反向传播之前,应通过前向传播求得输出的值,即 n 6 n_6 n6,进而求出 l o s s loss loss
- 然后根据
l
o
s
s
loss
loss 的值更新神经网络中所有的权重参数,即
w
w
w,注意这里不包括偏置参数,更新公式如下
w = w − l e a r n i n g _ r a t e ∗ ∂ l o s s ∂ w w = w-learning\_rate*\frac{\partial loss}{\partial w} w=w−learning_rate∗∂w∂loss
反向传播就是要求出 ∂ l o s s ∂ w \frac{\partial loss}{\partial w} ∂w∂loss - 而这里要用到求导链式法则
这里给出 ∂ l o s s ∂ w 3 \frac{\partial loss}{\partial w_3} ∂w3∂loss推导过程
∂ l o s s ∂ w 3 = ∂ l o s s ∂ n 6 ∗ ∂ n 6 ∂ n 5 ∗ ∂ n 5 ∂ w 3 (10) \frac{\partial loss}{\partial w_3}=\frac{\partial loss}{\partial n_6}*\frac{\partial n_6}{\partial n_5}*\frac{\partial n_5}{\partial w_3} \tag{10} ∂w3∂loss=∂n6∂loss∗∂n5∂n6∗∂w3∂n5(10)
由公式(9)得
∂ l o s s ∂ n 6 = 2 ( n 6 − l a b e l ) (11) \frac{\partial loss}{\partial n_6}=2(n_6-label) \tag{11} ∂n6∂loss=2(n6−label)(11)
由公式(6)和(7)得
∂ n 6 ∂ n 5 = n 6 ( 1 − n 6 ) (12) \frac{\partial n_6}{\partial n_5}=n_6(1-n_6) \tag{12} ∂n5∂n6=n6(1−n6)(12)
由公式(5)得
∂ n 5 ∂ w 3 = n 2 (13) \frac{\partial n_5}{\partial w_3}=n2 \tag{13} ∂w3∂n5=n2(13)
所以由公式(10)(11)(12)(13)得
∂ l o s s ∂ w 3 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) n 2 (14) \frac{\partial loss}{\partial w_3}=2(n_6-label)n_6(1-n_6)n_2 \tag{14} ∂w3∂loss=2(n6−label)n6(1−n6)n2(14)
同理可以求得
∂ l o s s ∂ w 4 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) n 4 (15) \frac{\partial loss}{\partial w_4}=2(n_6-label)n_6(1-n_6)n_4 \tag{15} ∂w4∂loss=2(n6−label)n6(1−n6)n4(15)
这里再给出 ∂ l o s s ∂ w 11 \frac{\partial loss}{\partial w_{11}} ∂w11∂loss的求法
∂ l o s s ∂ w 11 = ∂ l o s s ∂ n 6 ∗ ∂ n 6 ∂ n 5 ∗ ∂ n 5 ∂ n 2 ∗ ∂ n 2 ∂ n 1 ∗ ∂ n 1 ∂ w 11 (16) \frac{\partial loss}{\partial w_{11}}=\frac{\partial loss}{\partial n_6}*\frac{\partial n_6}{\partial n_5}*\frac{\partial n_5}{\partial n_2}*\frac{\partial n_2}{\partial n_1}*\frac{\partial n_1}{\partial w_{11}} \tag{16} ∂w11∂loss=∂n6∂loss∗∂n5∂n6∗∂n2∂n5∗∂n1∂n2∗∂w11∂n1(16)
由公式(5)得
∂ n 5 ∂ n 2 = w 3 (17) \frac{\partial n_5}{\partial n_2}=w_3 \tag{17} ∂n2∂n5=w3(17)
由公式(3)和(8)得
∂ n 2 ∂ n 1 = 1 − n 2 2 (18) \frac{\partial n_2}{\partial n_1}=1-n_2^2 \tag{18} ∂n1∂n2=1−n22(18)
由公式(1)得
∂ n 1 ∂ w 11 = I 1 (19) \frac{\partial n1}{\partial w_{11}}=I_1 \tag{19} ∂w11∂n1=I1(19)
所以由公式(11)(12)(17)(18)(19)得
∂ l o s s ∂ w 11 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) w 3 ( 1 − n 2 2 ) I 1 (20) \frac{\partial loss}{\partial w_{11}}=2(n_6-label)n_6(1-n_6)w_3(1-n_2^2)I_1 \tag{20} ∂w11∂loss=2(n6−label)n6(1−n6)w3(1−n22)I1(20)
同理可以求得
∂ l o s s ∂ w 12 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) w 4 ( 1 − n 4 2 ) I 1 (21) \frac{\partial loss}{\partial w_{12}}=2(n_6-label)n_6(1-n_6)w_4(1-n_4^2)I_1 \tag{21} ∂w12∂loss=2(n6−label)n6(1−n6)w4(1−n42)I1(21)
∂ l o s s ∂ w 21 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) w 3 ( 1 − n 2 2 ) I 2 (22) \frac{\partial loss}{\partial w_{21}}=2(n_6-label)n_6(1-n_6)w_3(1-n_2^2)I_2 \tag{22} ∂w21∂loss=2(n6−label)n6(1−n6)w3(1−n22)I2(22)
∂ l o s s ∂ w 22 = 2 ( n 6 − l a b e l ) n 6 ( 1 − n 6 ) w 4 ( 1 − n 4 2 ) I 2 (23) \frac{\partial loss}{\partial w_{22}}=2(n_6-label)n_6(1-n_6)w_4(1-n_4^2)I_2 \tag{23} ∂w22∂loss=2(n6−label)n6(1−n6)w4(1−n42)I2(23)
- 首先给出损失函数的定义
其他细节
- 轮数:1000
- 学习率:初始化为1,随着轮数的增加递减,公式如下
l e a r n i n g _ r a t e = l e a r n i n g _ r a t e _ i n i t ∗ ( 1 − c u r _ e p o c h e p o c h s ) learning\_rate = learning\_rate\_init*(1 - \frac{cur\_epoch}{epochs}) learning_rate=learning_rate_init∗(1−epochscur_epoch)
其中 l e a r n i n g _ r a t e _ i n i t learning\_rate\_init learning_rate_init为学习率初始化值,即1, c u r _ e p o c h cur\_epoch cur_epoch为当前是第几轮(从0开始), e p o c h s epochs epochs为总轮数 - 偏置初始化为
b 1 = 2 b 2 = 2 b 3 = 4 b_1=2\\b_2=2\\b_3=4 b1=2b2=2b3=4 - 权重为随机初始化
- 每一轮训练打乱数据集
代码
import random
import math
from datetime import datetime
random.seed(datetime.now())
learn_rate_init = 1
learn_rate = 1
epochs = 1000
data_len = 4
data = [[0,0,0],[1,0,1],[0,1,1],[1,1,0]]
epoch_print = 10
def sigmoid(x):
return 1/(1 + math.e**(-x))
def tanh(x):
return math.tanh(x)
'''
0 1 2
b1 b2 b3
'''
bias = []
'''
0 1 2 3 4 5 6 7
I1 I2 n1 n2 n3 n4 n5 n6
'''
node = None
'''
0 1 2 3 4 5
w11 w12 w21 w22 w3 w4
'''
w = None
'''
input: I1 I2
n1 = w11 * I1 + w21 * I2 + b1
n2 = tanh(n1)
n3 = w12 * I1 + w22 * I2 + b2
n4 = tanh(n3)
n5 = w3 * n2 + w4 * n4 + b3
n6 = sigmoid(n5)
output: n6
'''
def forward(I1,I2,label,epoch):
global node,w
node = []
#node[0] I1
node.append(I1)
#node[1] I2
node.append(I2)
#node[2] n1
node.append(w[0]*I1+w[2]*I2+bias[0])
#node[3] n2
node.append(tanh(node[2]))
#node[4] n3
node.append(w[1]*I1+w[3]*I2+bias[1])
#node[5] n4
node.append(tanh(node[4]))
#node[6] n5
node.append(w[4]*node[3]+w[5]*node[5]+bias[2])
#node[7] n6
node.append(sigmoid(node[6]))
#计算error
err = (label-node[7])**2
if (epoch+1)%epoch_print == 0:
#打印loss
print("loss for input(",str(I1),",",str(I2),"):",str(err))
return err
def backward(label):
global node,w
d_w3 = 2*(node[7]-label)*node[7]*(1-node[7])*node[3]
d_w4 = 2*(node[7]-label)*node[7]*(1-node[7])*node[5]
d_w11 = 2*(node[7]-label)*node[7]*(1-node[7])*w[4]*(1-node[3]**2)*node[0]
d_w21 = 2*(node[7]-label)*node[7]*(1-node[7])*w[4]*(1-node[3]**2)*node[1]
d_w12 = 2*(node[7]-label)*node[7]*(1-node[7])*w[5]*(1-node[5]**2)*node[0]
d_w22 = 2*(node[7]-label)*node[7]*(1-node[7])*w[5]*(1-node[5]**2)*node[1]
w[0] = w[0] - learn_rate * d_w11
w[1] = w[1] - learn_rate * d_w12
w[2] = w[2] - learn_rate * d_w21
w[3] = w[3] - learn_rate * d_w22
w[4] = w[4] - learn_rate * d_w3
w[5] = w[5] - learn_rate * d_w4
def init():
'''
在多次尝试中发现该模型的偏置
对其最终结果的收敛有着很大的影响
较好的偏置设置有[2,2,4]
可以尝试将偏置用随机初始化
观察模型的最终收敛程度
'''
global w,bias
w = []
bias = []
learn_rate = learn_rate_init
for i in range(6):
w.append(random.random())
# for i in range(3):
# bias.append(1+random.random())
bias = [2,2,4]#设置偏置
print("权重初始化")
print(w)
print("偏置初始化")
print(bias)
init()
for _ in range(epochs):
if (_+1) % epoch_print == 0:
print("epoch:"+str(_+1))
err_t = 0
tmp = [0,1,2,3]
random.shuffle(tmp)
for i in tmp:
err_t += forward(data[i][0],data[i][1],data[i][2],_)
backward(data[i][2])
if (_+1)%epoch_print==0:
print("total error:",str(err_t))
learn_rate -= learn_rate_init/epochs #学习率随着轮数的增加而递减
print("--------------------------")
print("最终训练结果")
#test input(0,0)
forward(0,0,0,1)
print("output(0,0):",str(node[7]))
#test input(0,1)
forward(0,1,1,1)
print("output(0,1):",str(node[7]))
#test input(1,0)
forward(1,0,1,1)
print("output(1,0):",str(node[7]))
#test input(1,1)
forward(1,1,0,1)
print("output(1,1):",str(node[7]))
print("训练得到的权重为:")
print(w)