神经网络反向传播数学推导


title: Mathematic Process of Back Propagation Derivatives in Neural Network
date: 2020-04-06 12:25:09
tags: Machine Learning
mathjax: true


I took several days to try to figure out the process of Back Propagation in Neural Network, and after I pull through I review and record the passage here.

General Speaking

As we all know, in order to do some mathematics on Neural Network, we need to define some formulas first.

We define the Training Set as
{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ( x ( 3 ) , y ( 3 ) ) , . . . ( x ( m ) , y ( m ) ) } {\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\} {(x(1),y(1)),(x(2),y(2)),(x(3),y(3)),...(x(m),y(m))}
which means we totally have m m m samples in training set.

In the neural network we consider now has n l n_l nl layers totally, and the 1 s t 1^{st} 1st layer is the Input Layer and the last layer i.e. the n l n_l nl layer is the Output Layer. For i i i layer the number of Neurons is S i S_i Si. So the dimension of the y ( i ) y^{(i)} y(i) is S n l S_{n_l} Snl i.e. y ( i ) = [ y 1 ( i ) , y 2 ( i ) , y 3 ( i ) , . . . y S n l ( i ) ] T y^{(i)}=[y^{(i)}_1,y^{(i)}_2,y^{(i)}_3,...y^{(i)}_{S_{n_l}}]^T y(i)=[y1(i),y2(i),y3(i),...ySnl(i)]T and the dimension of the x ( i ) x^{(i)} x(i) is S 1 S_{1} S1 i.e. x ( i ) = [ x 1 ( i ) , x 2 ( i ) , x 3 ( i ) , . . . x S 1 ( i ) ] T x^{(i)}=[x^{(i)}_1,x^{(i)}_2,x^{(i)}_3,...x^{(i)}_{S_{1}}]^T x(i)=[x1(i),x2(i),x3(i),...xS1(i)]T. The connection weight from i i i nodes in l l l layer to j j j nodes in l − 1 l-1 l1 layer is defined as w j i ( l ) w_{ji}^{(l)} wji(l) and the bias from l l l layer to l l l layer to l − 1 l-1 l1 layer is defined as b l b_l bl, in which l ∈ { n l , n l − 1 , . . . , 2 } l\in\{n_l, n_l-1, ... ,2\} l{nl,nl1,...,2}. That means in this neural network we have

The Cost Function for a particular ( x , y ) (x,y) (x,y) is defined in formula ( 1 ) (1) (1):

J ( w , b ; x , y ) = 1 2 ∣ ∣ h w , b ( x ) − y ∣ ∣ 2 (1) J(w,b; x,y)=\frac{1}{2}||h_{w,b}(x)-y||^2\tag{1} J(w,b;x,y)=21hw,b(x)y2(1)

We use the Mean Square Error as the error criteria. For a set of { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ( x ( 3 ) , y ( 3 ) ) , . . . ( x ( m ) , y ( m ) ) } {\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\} {(x(1),y(1)),(x(2),y(2)),(x(3),y(3)),...(x(m),y(m))}, the Cost Function is defined as:
J ( w , b ) = [ ∑ i = 1 m J ( w , b ; x ( i ) , y ( i ) ) ] + λ 2 ∑ l = 2 n l ∑ i = 1 S l − 1 ∑ j = 1 S l ( w j i ( l ) ) 2 (2) J(w,b)=[\sum^{m}_{i=1}J(w,b;x^{(i)},y^{(i)})]+\frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji})^2\tag{2} J(w,b)=[i=1mJ(w,b;x(i),y(i))]+2λl=2nli=1Sl1j=1Sl(wji(l))2(2)
The second term to the right of the equation λ 2 ∑ l = 2 n l ∑ i = 1 S l − 1 ∑ j = 1 S l ( w j i ( l ) ) \frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji}) 2λl=2nli=1Sl1j=1Sl(wji(l)) actually is a additional term in order to avoiding “overfitting problem”, and its derivatives of any parameters is simple. So in the following section let’s just ignore it.
The target we want to solve is the ∂ J ( w , b ) ∂ w j i ( l ) \frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} wji(l)J(w,b) and ∂ J ( w , b ) ∂ b i ( l ) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}} bi(l)J(w,b) for optimizing those parameters by some method (like Gradient Descent). It is actually pretty hard to calculate the ∂ J ( w , b ) ∂ w j i ( l ) \frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} wji(l)J(w,b) and ∂ J ( w , b ) ∂ b i ( l ) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}} bi(l)J(w,b) in direct since we will do a huge work of matrix derivatives. so Let’s concentrate on a easier way.

Error Term δ \delta δ

According to the “Neural Network and Deep Learning”, based on the chain derivation rule, formula ( 6 ) ( 7 ) (6)(7) (6)(7) is defined:
∂ J ( w , b ) ∂ w j i ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ w j i ( l ) (3) \frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}\tag{3} wji(l)J(w,b)=zj(l+1)J(w,b)wji(l)zj(l+1)(3)

∂ J ( w , b ) ∂ b j ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ b j ( l ) (4) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}\tag{4} bj(l)J(w,b)=zj(l+1)J(w,b)bj(l)zj(l+1)(4)

We can use an Error Term δ \delta δ to calculate the above partial derivation more easily:
δ j ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) (5) \delta^{(l)}_{j}=\frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\tag{5} δj(l)=zj(l+1)J(w,b)(5)
in which,
z ( l ) = w ( l ) a ( l − 1 ) + b ( l ) z^{(l)}=w^{(l)}a^{(l-1)}+b^{(l)} z(l)=w(l)a(l1)+b(l)

a ( l ) = f ( z ( l ) ) a^{(l)}=f(z^{(l)}) a(l)=f(z(l))

Here we define the function f f f as the Activate Function (such as sigmoid, tanh, etc.)

To calculate δ \delta δ, we could start from the output layer to the input layer step by step.

Calculation of Output Layer δ i ( n l ) \delta^{(n_l)}_i δi(nl)

Calculating the output layer δ i ( n l ) \delta^{(n_l)}_i δi(nl) is relative easily:
δ i ( n l ) = − ( y i − a i ( n l ) ) f ′ ( z i ( n l ) ) (6) \delta^{(n_l)}_i=-(y_i-a_i^{(n_l)})f'(z_i^{(n_l)})\tag{6} δi(nl)=(yiai(nl))f(zi(nl))(6)

Proof

In the proof, we denote the Training Set is ( x , y ) (x,y) (x,y).
δ i ( n l ) = ∂ ∂ z i ( n l ) J ( w , b ) = ∂ ∂ z i ( n l ) J ( w , b ; x , y ) = ∂ ∂ z i ( n l ) 1 2 ∣ ∣ y − h w , b ( x ) ∣ ∣ 2 = ∂ ∂ z i ( n l ) 1 2 ∑ j = 1 S n l ( y j − f ( z j ( n l ) ) ) 2 = − ( y i − f ( z j ( n l ) ) ) f ′ ( z j ( n l ) ) = − ( y i − a i ( n l ) ) f ′ ( z i ( n l ) ) \begin{aligned} \delta^{(n_l)}_i&=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b;x,y) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}||y-h_{w,b}(x)||^2 \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}\sum_{j=1}^{S_{n_l}}\big(y_j-f(z_j^{(n_l)})\big)^2 \\ &=-(y_i-f(z_j^{(n_l)}))f'(z_j^{(n_l)}) \\ &=-(y_i-a_i^{(n_l)})f'(z^{(n_l)}_i) \end{aligned} δi(nl)=zi(nl)J(w,b)=zi(nl)J(w,b;x,y)=zi(nl)21yhw,b(x)2=zi(nl)21j=1Snl(yjf(zj(nl)))2=(yif(zj(nl)))f(zj(nl))=(yiai(nl))f(zi(nl))

Calculation of n l − 1 n_l-1 nl1 Layers δ i ( n l ) \delta^{(n_{l})}_{i} δi(nl)

For l ∈ { n l − 1 , n l − 2 , . . . , 2 } l\in\{n_l-1, n_l-2, ..., 2\} l{nl1,nl2,...,2} (we cannot let l = 1 l=1 l=1 since the parameters w w w between the input layer and the second layer is define as the w ( 2 ) w^{(2)} w(2) and there is no w ( 1 ) w^{(1)} w(1) parameter) the δ \delta δ can be set as:
δ i ( l ) = ( ∑ j = 1 S l + 1 w j i ( l + 1 ) δ j ( l + 1 ) ) f ′ ( z i ( l ) ) (7) \delta_i^{(l)}=\Big(\sum_{j=1}^{S_{l+1}}w_{ji}^{(l+1)}\delta_j^{(l+1)}\Big)f'(z_i^{(l)})\tag{7} δi(l)=(j=1Sl+1wji(l+1)δj(l+1))f(zi(l))(7)
To proof that, we could proof the δ i ( n l − 1 ) \delta_i^{(n_l-1)} δi(nl1) in the n l − 1 n_l-1 nl1 layer first:
δ i ( n l − 1 ) = ( ∑ j = 1 S n l w j i ( n l ) δ j ( n l ) ) f ′ ( z i ( l ) ) (8) \delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}w_{ji}^{(n_l)}\delta_j^{(n_l)}\Big)f'(z_i^{(l)})\tag{8} δi(nl1)=(j=1Snlwji(nl)δj(nl))f(zi(l))(8)

Proof

KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \delta_i^{(n_l…

Based on above mathematical process, we could conclude the formula ( 9 ) (9) (9):
δ i ( n l − 1 ) = ( ∑ j = 1 S n l δ j ( n l ) w j i ( n l ) ) f ′ ( z i ( n l − 1 ) ) (9) \delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}\delta^{(n_l)}_jw_{ji}^{(n_l)}\Big)f'(z_i^{(n_l-1)})\tag{9} δi(nl1)=(j=1Snlδj(nl)wji(nl))f(zi(nl1))(9)

Calculation of Other Layers δ i ( l ) \delta^{(l)}_{i} δi(l)

Using formula ( 9 ) (9) (9), we change the n l − 1 n_l-1 nl1 to any layer l l l:
δ i ( l ) = ( ∑ j = 1 S l δ j ( l + 1 ) w j i ( l + 1 ) ) f ′ ( z i ( l ) ) (10) \delta_i^{(l)}=\Big(\sum_{j=1}^{S_l}\delta^{(l+1)}_jw_{ji}^{(l+1)}\Big)f'(z_i^{(l)})\tag{10} δi(l)=(j=1Slδj(l+1)wji(l+1))f(zi(l))(10)

Calculation of derivatives w j i ( l ) , b ( l ) w^{(l)}_{ji}, b^{(l)} wji(l),b(l)

According to formula ( 3 ) ( 4 ) (3)(4) (3)(4), we need to calculate the ∂ z j ( l + 1 ) ∂ w j i ( l ) \frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}} wji(l)zj(l+1) and ∂ z j ( l + 1 ) ∂ b j ( l ) \frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}} bj(l)zj(l+1):
∂ z j ( l + 1 ) ∂ w j i ( l ) = ∂ ( ∑ k = 1 S n l − 1 a k ( l ) w j k ( l + 1 ) + b ( l ) ) ∂ w j i ( l ) = a i (11) \frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial w^{(l)}_{ji}}=a_i\tag{11} wji(l)zj(l+1)=wji(l)(k=1Snl1ak(l)wjk(l+1)+b(l))=ai(11)

∂ z j ( l + 1 ) ∂ b ( l ) = ∂ ( ∑ k = 1 S n l − 1 a k ( l ) w j k ( l + 1 ) + b ( l ) ) ∂ b ( l ) = 1 (12) \frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial b^{(l)}}=1\tag{12} b(l)zj(l+1)=b(l)(k=1Snl1ak(l)wjk(l+1)+b(l))=1(12)

Using formula ( 10 ) ( 11 ) (10)(11) (10)(11), we could solve ∂ J ( w , b ) ∂ w j i ( l ) \frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} wji(l)J(w,b) and ∂ J ( w , b ) ∂ b i ( l ) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}} bi(l)J(w,b) :
∂ J ( w , b ) ∂ w j i ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ w j i ( l ) = δ j ( l + 1 ) a i (13) \frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\delta_j^{(l+1)}a_i\tag{13} wji(l)J(w,b)=zj(l+1)J(w,b)wji(l)zj(l+1)=δj(l+1)ai(13)

∂ J ( w , b ) ∂ b j ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ b j ( l ) = δ j ( l + 1 ) ⋅ 1 (14) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}=\delta_j^{(l+1)}\cdot1\tag{14} bj(l)J(w,b)=zj(l+1)J(w,b)bj(l)zj(l+1)=δj(l+1)1(14)

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值