误差反向传播算法
输出层
对训练例
(
x
k
,
y
k
)
\left(\boldsymbol{x}_{k}, \boldsymbol{y}_{k}\right)
(xk,yk), 假定神经网络的输出为
y
^
k
=
(
y
^
1
k
,
y
^
2
k
,
…
,
y
^
l
k
)
\hat{\boldsymbol{y}}_{k}=\left(\hat{y}_{1}^{k}, \hat{y}_{2}^{k}, \ldots, \hat{y}_{l}^{k}\right)
y^k=(y^1k,y^2k,…,y^lk), 即
y
^
j
k
=
f
(
β
j
−
θ
j
)
,
\hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right),
y^jk=f(βj−θj),
则网络在
(
x
k
,
y
k
)
\left(\boldsymbol{x}_{k}, \boldsymbol{y}_{k}\right)
(xk,yk) 上的均方误差为
E
k
=
1
2
∑
j
=
1
l
(
y
^
j
k
−
y
j
k
)
2
.
E_{k}=\frac{1}{2} \sum_{j=1}^{l}\left(\hat{y}_{j}^{k}-y_{j}^{k}\right)^{2} .
Ek=21j=1∑l(y^jk−yjk)2.
BP 算法基于梯度下降(gradient descent)策略, 以目标的负梯度方向对参数 进行调整. 对误差
E
k
E_{k}
Ek, 给定学习率
η
\eta
η, 有
Δ
w
h
j
=
−
η
∂
E
k
∂
w
h
j
.
\Delta w_{h j}=-\eta \frac{\partial E_{k}}{\partial w_{h j}} .
Δwhj=−η∂whj∂Ek.
注意到
w
h
j
w_{h j}
whj 先影响到第
j
j
j 个输出层神经元的输入值
β
j
\beta_{j}
βj, 再影响到其输出值
y
^
j
k
\hat{y}_{j}^{k}
y^jk, 然后影响到
E
k
E_{k}
Ek,那么根据链式法则有,
∂
E
k
∂
w
h
j
=
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
⋅
∂
β
j
∂
w
h
j
.
\frac{\partial E_{k}}{\partial w_{h j}}=\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial w_{h j}} .
∂whj∂Ek=∂y^jk∂Ek⋅∂βj∂y^jk⋅∂whj∂βj.
因为有
β
j
=
∑
h
=
1
q
w
h
j
b
h
\beta_{j}= \sum\limits_{h=1}^{q} w_{hj}b_{h}
βj=h=1∑qwhjbh
我们将
β
j
\beta_j
βj抽象为斜率为
b
h
b_h
bh的一条直线,那么自然有
∂
β
j
∂
w
h
j
=
b
h
.
\frac{\partial \beta_{j}}{\partial w_{h j}}=b_{h} .
∂whj∂βj=bh.
g
j
=
−
∂
E
k
∂
y
^
j
k
⋅
∂
y
^
j
k
∂
β
j
g_{j} =-\frac{\partial E_{k}}{\partial \hat{y}_{j}^{k}} \cdot \frac{\partial \hat{y}_{j}^{k}}{\partial \beta_{j}}
gj=−∂y^jk∂Ek⋅∂βj∂y^jk
=
−
(
y
^
j
k
−
y
j
k
)
f
′
(
β
j
−
θ
j
)
=-\left(\hat{y}_{j}^{k}-y_{j}^{k}\right) f^{\prime}\left(\beta_{j}-\theta_{j}\right)
=−(y^jk−yjk)f′(βj−θj)
=
y
^
j
k
(
1
−
y
^
j
k
)
(
y
j
k
−
y
^
j
k
)
.
=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right) .
=y^jk(1−y^jk)(yjk−y^jk).
结合上式得
Δ
w
h
j
\Delta w_{h j}
Δwhj
Δ
w
h
j
=
η
g
j
b
h
.
\Delta w_{h j}=\eta g_{j} b_{h} .
Δwhj=ηgjbh.
类似可得
Δ θ j = − η g j , \Delta \theta_{j} =-\eta g_{j}, Δθj=−ηgj,
Δ γ h = − η e h , \Delta \gamma_{h} =-\eta e_{h}, Δγh=−ηeh,
隐藏层
同理得出
e
h
=
−
∂
E
k
∂
b
h
⋅
∂
b
h
∂
α
h
=
−
∑
j
=
1
l
∂
E
k
∂
β
j
⋅
∂
β
j
∂
b
h
f
′
(
α
h
−
γ
h
)
=
f
′
(
α
h
−
γ
h
)
∑
j
=
1
l
w
h
j
g
j
\begin{equation} \begin{split} e_{h} = &\ -\frac{\partial E_{k}}{\partial b_{h}} \cdot \frac{\partial b_{h}}{\partial \alpha_{h}} \\ =&\ -\sum_{j=1}^{l} \frac{\partial E_{k}}{\partial \beta_{j}} \cdot \frac{\partial \beta_{j}}{\partial b_{h}} f^{\prime}\left(\alpha_{h}-\gamma_{h}\right)\\ = &\ f^{\prime}\left(\alpha_{h}-\gamma_{h}\right)\sum\limits_{j=1}^{l}w_{hj}g_{j} \end{split} \end{equation}
eh=== −∂bh∂Ek⋅∂αh∂bh −j=1∑l∂βj∂Ek⋅∂bh∂βjf′(αh−γh) f′(αh−γh)j=1∑lwhjgj
Δ
v
i
h
=
η
e
h
x
i
,
\Delta v_{i h} =\eta e_{h} x_{i},
Δvih=ηehxi,
Δ
γ
h
=
−
η
e
h
,
\Delta \gamma_{h} =-\eta e_{h},
Δγh=−ηeh,
代码实现
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
target = iris.target
class NeuralNetwork:
def __init__(self, in_size, o_size, h_size):
# 初始化层的数量
self.in_size = in_size
self.o_size = o_size
self.h_size = h_size
self.W1 = np.random.randn(in_size, h_size) # n x b的矩阵
self.W2 = np.random.randn(h_size, o_size) # b x k的矩阵
def sigmod(self, x):
return 1 / (1 + np.exp(-x))
# 映射函数,将连续值变成离散值
def ref(self, x):
if x <= (1 / 3):
return 0
elif x <= (2 / 3):
return 1
else:
return 2
# 设输入X为 m x n的矩阵
def forward(self, X):
vec_rule = np.vectorize(self.ref)
self.z2 = np.dot(X, self.W1) # m x b
self.act2 = self.sigmod(self.z2)
self.z3 = np.dot(self.act2, self.W2)# m x k
self.y_hat = self.sigmod(self.z3)
self.y_hat = vec_rule(self.y_hat)
return self.y_hat
# 设y为 m x k 的矩阵
def backward(self, X, y, y_hat, leraning_rate):
# 算出输出层的梯度顶
Grd_1 = (y - y_hat) * self.sigmod(self.z3) * (1 - self.sigmod(self.z3)) # m x k
# 输出层的Δ值
Delta_W2 = np.dot(self.act2.T, Grd_1) # b x k
# 隐藏层的梯度顶
Grd_2 = np.dot(Grd_1, self.W2.T) * self.sigmod(self.z2) * (1 - self.sigmod(self.z2)) # m x b
# 隐藏层的Δ值
Delta_W1 = np.dot(X.T, Grd_2) # n x b
# 更新权值
self.W1 += leraning_rate * Delta_W1
self.W2 += leraning_rate * Delta_W2
def tarin(self, X, y, learning_rate, num_epochs):
# 检查形状
if(X.shape[0] != y.shape[0]):
return -1;
for i in range(1, num_epochs + 1):
y_hat = self.forward(X)
self.backward(X, y, self.y_hat, learning_rate)
# 输出均方误差
loss = np.mean((y - y_hat) ** 2)
print(f"loss = {loss}, epochs/num_epochs:{i}/{num_epochs}")
def predict(self, X):
y_pred = self.forward(X)
return y_pred
注: 部分公式来自周志华的西瓜书