Backpropagation-反向传播算法

 算法通过神经网络往输入端方向传递信息,计算对应参数造成损失的梯度,是以误差为主导,根据损失函数梯度指明的方向,不断前进,从而不断减少误差,达到局部最优的算法。

  本文主要详细阐述BP算法的迭代计算过程,并不包含各公式定理的证明,旨在展示BP算法的运行过程,即“怎么做”的问题。

运行前

首先列举一下有关于ANN的一些概念,输入层/输入神经元,隐含层/隐含神经元,输出层/输出神经元,权值,偏置,激活函数,超参数(层数,学习率等不可调参数)。

然后列出本次算法运行实例的样本数据和神经网络结构

样本数据:

X1X2Y
0.52.51
210
3.142.110

神经网络结构:

初始参数:

权值 w1=w2=w3=w4=w5=w6=0.5  偏置 b1=b2=0.5

激活函数:S(x)=\frac{1}{1+e^{-x}}(sigmoid函数)

学习率:\eta =0.5

迭代次数100000

 

样本数据归一化:

(X1,X2)/10=([0.05,0.2,0.314],[0.25,0.1,0.211])

向前传播

 net1=w1*X1+w3*X2+b1=\bigl(\begin{smallmatrix} 0.5*0.05\\ 0.5*0.2\\ 0.5*0.314 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.25\\ 0.5*0.1\\ 0.5*0.211 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5\\ 0.5\\ 0.5 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 0.65\\ 0.65\\ 0.7625 \end{smallmatrix}\bigr)

net2=w2*X1+w4*X2+b1=\bigl(\begin{smallmatrix} 0.5*0.05\\ 0.5*0.2\\ 0.5*0.314 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.25\\ 0.5*0.1\\ 0.5*0.211 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5\\ 0.5\\ 0.5 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 0.65\\ 0.65\\ 0.7625 \end{smallmatrix}\bigr)

 out1=sigmoid(net1)=\bigl(\begin{smallmatrix} 0.65701046\\ 0.65701046\\ 0.68189626\end{smallmatrix}\bigr)

out2=sigmoid(net2)=\bigl(\begin{smallmatrix} 0.65701046\\ 0.65701046\\ 0.68189626\end{smallmatrix}\bigr)

 \begin{align*} r1&=w5*out1+w6*out2+b2 \\ \qquad&=\bigl(\begin{smallmatrix} 0.5*0.65701046\\ 0.5*0.65701046\\ 0.5*0.68189626 \end{smallmatrix}\bigr)+\bigl(\begin{smallmatrix} 0.5*0.65701046\\ 0.5*0.65701046\\ 0.5*0.68189626 \end{smallmatrix}\bigr)=\bigl(\begin{smallmatrix} 1.15701046\\1.15701046\\ 1.18189626 \end{smallmatrix}\bigr) \end{align*}

 

 t1=sigmoid(r1)=\bigl(\begin{smallmatrix} 0.76078908\\ 0.76078908\\ 0.76528859\end{smallmatrix}\bigr)

以上就是神经网络的向前传播过程。

向后求偏导

算法理论基础

  根据神经网络的向前传播过程,我们得出了一个输出,显然,实际输出和我们期望的输出是有误差的。于是对于相对固定的输入,我们可以构建函数C,有C(w,b)=(f(w,b)-Y)^{2},其中f即神经网络向前传播的计算过程。于是我们的目的可以描述为,对输入X,求w和b,使得C(w,b)取极小值。

  那么如何求C(w,b)的极小值,我们需要知道两样东西:梯度和链式法则。
在微积分里面,对多元函数的参数求偏导数,把求得的各个参数的偏导数以向量的形式写出来,就是梯度。梯度的意义从几何意义上讲,就是函数变化增加最快的地方。所以,我们只要往这个方向的反方向走,就可以保证函数值快速递减。所以,在每一次向前传播得出输出后,我们即可以对C(w,b)求偏导,然后对变量w,b在梯度相反的方向上进行调整。
  那么,如何对C(w,b)求w,b的偏导呢?我们得知道链式法则:

\frac{\partial e}{\partial a}=\frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial a} \qquad \& \qquad \frac{\partial e}{\partial b}=\frac{\partial e}{\partial c}\cdot \frac{\partial c}{\partial b} +\frac{\partial e}{\partial d}\cdot \frac{\partial d}{\partial b}

(在以上公式中,由于c和d都包含了b的计算结果,所以在对e求b的导数时,需要分开c和d,分别对b求导,然后再相加。)
利用链式法则,我们可以计算出C(w,b)对w,b的梯度方向。

  若神经网络结构复杂,存在大量的w和b,则计算C(w,b)对所有的w,b的偏导时,计算量无疑是巨大的。而BP算法从输出层开始往后求偏导,利用了每一层的计算结果,使得计算量大大减少。具体的做法是,BP算法首先计算出输出层w,b的偏导,然后,往后一层的隐藏层继续求偏导,根据隐藏层-输出层的组合计算过程,和刚刚计算输出层的w,b的偏导时的部分结果,利用链式法则,即可计算出隐藏层w,b的偏导结果。如若需要再往后一层求偏导,则可利用对前一层求偏导结果。如此,每一层的误差偏导结果,都可以被后一层使用,所以BP算法是层层向后计算,大大节省了求导计算量。

算法公式简单推导

(注:由于一个简单的神经网络也涉及多个参数,所以这里的公式推导,并没有严格地使用上下标,去标明符号所代表的参数(因为推导公式并不是本文重点,而严格使用符号上下标固然严谨,但是看起来复杂缭乱,所以这里就不标了))

输出层:

根据链式法则,有\frac{\partial E}{\partial w}=\frac{\partial E}{\partial t1}\cdot \frac{\partial t1}{\partial r1} \cdot \frac{\partial r1}{\partial w} \qquad \& \qquad \frac{\partial E}{\partial b}=\frac{\partial E}{\partial t1}\cdot \frac{\partial t1}{\partial r1} \cdot \frac{\partial r1}{\partial b},另设误差函数E=\frac{1}{2}(Y-t1)^{2}

则有:

\frac{\partial E}{\partial t1}=-(Y-t1)\quad \& \quad \frac{\partial t1}{\partial r1}=t1(1-t1)\quad \& \quad \frac{\partial r1}{\partial w}=out1\quad \& \quad \frac{\partial r1}{\partial b}=1

 隐含层:

 (注:若隐含层某节点通向右侧有多个节点,则说明对应连通节点的误差都受该隐含层的节点影响,此时,所有受影响的误差都需要对该隐含层节点求导,即:

若E1,E2,E(i)都受w影响,则有:

\begin{align*} \frac{\partial E}{\partial w}&=\frac{\partial E1}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w} +\frac{\partial E2}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\\ \qquad&=(\frac{\partial E1}{\partial out}+\frac{\partial E2}{\partial out}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\\ \qquad&=(\frac{\partial E1}{\partial t1}\cdot \frac{\partial t1}{\partial r1}\cdot \frac{\partial r1}{\partial out} +\cdot \cdot \cdot +\frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot \frac{\partial r(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w} \end{align*}
同理,有:

\begin{align*} \frac{\partial E}{\partial b}&=\frac{\partial E1}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b} +\frac{\partial E2}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out}\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}\\ \qquad&=(\frac{\partial E1}{\partial out}+\frac{\partial E2}{\partial out}+\cdot \cdot \cdot +\frac{\partial E(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b}\\ \qquad&=(\frac{\partial E1}{\partial t1}\cdot \frac{\partial t1}{\partial r1}\cdot \frac{\partial r1}{\partial out} +\cdot \cdot \cdot +\frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot \frac{\partial r(i)}{\partial out})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial b} \end{align*}

通过列举这些等式,我们可以看出输出层和隐含层的误差对参数的偏导式子有相似之处,于是我们总结出以下结论:

\frac{\partial t1}{\partial r1}=\frac{\partial out}{\partial net},都是激活函数的导数

进而有

输出层:

∂E/∂w=-(期望输出-实际输出)* 激活函数的导数* out

∂E/∂b=-(期望输出-实际输出)* 激活函数的导数

隐含层:

根据公式推导有\frac{\partial E}{\partial w}=(\sum \frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot w_{i})\cdot \frac{\partial out}{\partial net}\cdot \frac{\partial net}{\partial w}\frac{\partial E}{\partial b}=(\sum \frac{\partial E(i)}{\partial t(i)}\cdot \frac{\partial t(i)}{\partial r(i)}\cdot w_{i})\cdot \frac{\partial out}{\partial net}\cdot\frac{\partial net}{\partial b}其中\frac{\partial net}{\partial b}=1

所以设δ,δ=(-(期望输出-实际输出))*激活函数的导数(即\delta =\frac{\partial E}{\partial t}\cdot \frac{\partial t}{\partial r}),则有:

输出层的误差对参数求偏导:

∂E/∂w=(对应w的隐藏层输出)*δ

∂E/∂b=δ

隐含层的误差对参数求偏导:

∂E/∂w=(对应w的隐藏层(输入层)输出)*(右层每个节点的δ加权求和)*激活函数的导数

∂E/∂b=(右层每个节点的δ加权求和)*激活函数的导数

有了求偏导的公式,根据前文提到的梯度向量,我们可以确定参数w,b的调整公式:

w^{'}=w-\eta \frac{\partial E}{\partial w} \quad \& \quad b^{'}=b-\eta \frac{\partial E}{\partial b}
其中η为学习率,学习率指定了反向传播过程中梯度下降的步长。

算法运行过程

接下来,把向后传播的过程计算一遍:

 \begin{align*} \frac{\partial E}{\partial w5}&=(-(Y-t1)*t1*(1-t1)*out1) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.65701046\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.65701046\\ -(0-0.76528859)*0.76528859*(1-0.76528859)*0.68189626 \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.02860214\\ 0.09096657\\ 0.09373526 \end{pmatrix} \end{align*}
w5^{'}=\begin{pmatrix} 0.5-0.5*-0.02860214\\ 0.5-0.5*-0.09096657\\ 0.5-0.5*-0.09373526 \end{pmatrix}=\begin{pmatrix} 0.51430107\\ 0.45451671\\ 0.45313237\end{pmatrix}
\begin{align*} \frac{\partial E}{\partial b2}&=(-(Y-t1)*t1*(1-t1)) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)\\ -(0-0.76528859)*0.76528859*(1-0.76528859) \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.04353377\\ 0.13845529\\ 0.13746264 \end{pmatrix} \end{align*}
b2^{'}=\begin{pmatrix} 0.5-0.5*-0.04353377\\ 0.5-0.5*-0.13845529\\ 0.5-0.5*-0.13746264\end{pmatrix}=\begin{pmatrix} 0.52176689\\ 0.43077236\\ 0.43126868\end{pmatrix}
同理可得:

w6^{'}=\begin{pmatrix} 0.5-0.5*-0.02860214\\ 0.5-0.5*-0.09096657\\ 0.5-0.5*-0.09373526 \end{pmatrix}=\begin{pmatrix} 0.51430107\\ 0.45451671\\ 0.45313237\end{pmatrix}

接下来继续往后算

\begin{align*} \frac{\partial E}{\partial w1}&=(X1*(-(Y-t1)*t1*(1-t1)*w5)*out1*(1-out1)) \\ \qquad&=\begin{pmatrix}0.05*(-(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\0.2*(-(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ 0.314*(-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\ \qquad&=\begin{pmatrix} -0.00024526\\ 0.00312006\\ 0.00468135 \end{pmatrix} \end{align*}

w1^{'}=\begin{pmatrix} 0.5-0.5*-0.00024526\\ 0.5-0.5*-0.00312006\\ 0.5-0.5*-0.00468135 \end{pmatrix}=\begin{pmatrix} 0.50012263\\ 0.49843997\\ 0.49765932\end{pmatrix}

\begin{align*} \frac{\partial E}{\partial b1}&=(-(Y-t1)*t1*(1-t1)*w5)*out1*(1-out1) \\ \qquad&+(-(Y-t1)*t1*(1-t1)*w6)*out2*(1-out2) \\ \qquad&=\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\ \qquad&+\begin{pmatrix} -(1-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\ -(0-0.76078908)*0.76078908*(1-0.76078908)*0.5*0.65701046*(1-0.65701046)\\-(0-0.76528859)*0.76528859*(1-0.76528859)*0.5*0.68189626*(1-0.68189626) \end{pmatrix}\\\qquad&=\begin{pmatrix} -0.00981024\\ 0.03120058\\ 0.02981754\end{pmatrix} \end{align*}
b1^{'}=\begin{pmatrix} 0.5-0.5*-0.00981024\\ 0.5-0.5*0.03120058\\ 0.5-0.5*0.02981754 \end{pmatrix}=\begin{pmatrix} 0.50490512\\ 0.48439971\\ 0.48509123\end{pmatrix}

同理可计算得:

w2^{'}=\begin{pmatrix} 0.50061314\\ 0.49921999\\ 0.49842712\end{pmatrix} \quad \& \quad w3^{'}=\begin{pmatrix} 0.50012263\\ 0.49843997\\ 0.49765932\end{pmatrix} \quad \& \quad w4^{'}=\begin{pmatrix} 0.50061314\\ 0.49921999\\ 0.49842712\end{pmatrix} \quad

至此,已完成一次BP算法的迭代过程。

接下来,根据调整后的w'和b',将上述向前传播和BP向后传播的过程迭代,直至误差在可接受范围内即可。

根据我的代码,迭代了100000次,最后结果是:

b1=[1.4829065 , 1.36039658, 1.312224]

b2=[2.67796477, -3.03037886, -2.94488393]

w1=[0.52457266, 0.58603966, 0.62751917]

w2=[0.62286331, 0.54301983, 0.58568963]

w3=[0.52457266, 0.58603966, 0.62751917]

w4=[0.62286331, 0.54301983, 0.58568963]

w5=[ 2.12102269, -1.94604351, -1.96760757]

w6=[ 2.12102269, -1.94604351, -1.96760757]

实际输出=[0.99806371 0.00196405 0.00195205]

import numpy as np


# "pd" 偏导
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoidDerivationx(y):
    return y * (1 - y)


if __name__ == "__main__":
    bias = [0.5, 0.5]
    weight = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
    # 输入
    X1 = np.array([0.05, 0.2, 0.314])
    X2 = np.array([0.25, 0.1, 0.211])
    # 期望输出
    target1 = np.array([1, 0, 0])
    alpha = 0.5  # 学习速率
    numIter = 10000  # 迭代次数
    for i in range(numIter):
        # 正向传播
        net1 = X1 * weight[1 - 1] + X2 * weight[2 - 1] + bias[0]
        net2 = X1 * weight[3 - 1] + X2 * weight[4 - 1] + bias[0]
        out1 = sigmoid(net1)
        out2 = sigmoid(net2)
        r1 = out1 * weight[5 - 1] + out2 * weight[6 - 1] + bias[1]
        t1 = sigmoid(r1)

        print(str(i) + ", target1 : " + str(target1 - t1))
        if i == numIter - 1:
            print("lastst result : " + str(t1))
        # 反向传播
        # 计算w5-w8(输出层权重)的误差
        pdEt1 = - (target1 - t1)
        pdt1r1 = sigmoidDerivationx(t1)
        pdr1W5 = out1
        pdEW5 = pdEt1 * pdt1r1 * pdr1W5
        pdr1W6 = out2
        pdEW6 = pdEt1 * pdt1r1 * pdr1W6

        # 计算b2
        pdEB2 = pdEt1 * pdt1r1

        # 计算w1-w4(输出层权重)的误差
        pdEt1 = - (target1 - t1)
        pdt1r1 = sigmoidDerivationx(t1)
        pdr1out1 = weight[5 - 1]
        pdEout1 = pdEt1 * pdt1r1 * pdr1out1
        pdout1net1 = sigmoidDerivationx(out1)
        pdnet1W1 = X1
        pdnet1W2 = X2
        pdEW1 = pdEout1 * pdout1net1 * pdnet1W1
        pdEW2 = pdEout1 * pdout1net1 * pdnet1W2
        pdr1out2 = weight[6 - 1]
        pdout2net2 = sigmoidDerivationx(out2)
        pdnet2W3 = X1
        pdnet2W4 = X2
        pdEout2 = pdEt1 * pdt1r1 * pdr1out2
        pdEW3 = pdEout2 * pdout2net2 * pdnet2W3
        pdEW4 = pdEout2 * pdout2net2 * pdnet2W4

        # 计算b1
        pdEB1 = pdEout1 * pdout1net1 + pdEout2 * pdout2net2

        # 权重更新
        weight[1 - 1] = weight[1 - 1] - alpha * pdEW1
        weight[2 - 1] = weight[2 - 1] - alpha * pdEW2
        weight[3 - 1] = weight[3 - 1] - alpha * pdEW3
        weight[4 - 1] = weight[4 - 1] - alpha * pdEW4
        weight[5 - 1] = weight[5 - 1] - alpha * pdEW5
        weight[6 - 1] = weight[6 - 1] - alpha * pdEW6

        bias[1 - 1] = bias[1 - 1] - alpha * pdEB1
        bias[2 - 1] = bias[2 - 1] - alpha * pdEB2

参考

https://www.cnblogs.com/wlzy/p/7751297.html
https://www.zhihu.com/question/27239198?rf=24827633
http://www.cnblogs.com/charlotte77/p/5629865.html
http://baijiahao.baidu.com/s?id=1600509705305690820&wfr=spider&for=pc
https://blog.csdn.net/lyl771857509/article/details/78990215
https://blog.csdn.net/u014303046/article/details/78200010
https://www.cnblogs.com/fuqia/p/8982405.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值