title: Mathematic Process of Back Propagation Derivatives in Neural Network
date: 2020-04-06 12:25:09
tags: Machine Learning
mathjax: true
I took several days to try to figure out the process of Back Propagation in Neural Network, and after I pull through I review and record the passage here.
General Speaking
As we all know, in order to do some mathematics on Neural Network, we need to define some formulas first.
We define the Training Set as
{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
(
x
(
3
)
,
y
(
3
)
)
,
.
.
.
(
x
(
m
)
,
y
(
m
)
)
}
{\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\}
{(x(1),y(1)),(x(2),y(2)),(x(3),y(3)),...(x(m),y(m))}
which means we totally have
m
m
m samples in training set.
In the neural network we consider now has n l n_l nl layers totally, and the 1 s t 1^{st} 1st layer is the Input Layer and the last layer i.e. the n l n_l nl layer is the Output Layer. For i i i layer the number of Neurons is S i S_i Si. So the dimension of the y ( i ) y^{(i)} y(i) is S n l S_{n_l} Snl i.e. y ( i ) = [ y 1 ( i ) , y 2 ( i ) , y 3 ( i ) , . . . y S n l ( i ) ] T y^{(i)}=[y^{(i)}_1,y^{(i)}_2,y^{(i)}_3,...y^{(i)}_{S_{n_l}}]^T y(i)=[y1(i),y2(i),y3(i),...ySnl(i)]T and the dimension of the x ( i ) x^{(i)} x(i) is S 1 S_{1} S1 i.e. x ( i ) = [ x 1 ( i ) , x 2 ( i ) , x 3 ( i ) , . . . x S 1 ( i ) ] T x^{(i)}=[x^{(i)}_1,x^{(i)}_2,x^{(i)}_3,...x^{(i)}_{S_{1}}]^T x(i)=[x1(i),x2(i),x3(i),...xS1(i)]T. The connection weight from i i i nodes in l l l layer to j j j nodes in l − 1 l-1 l−1 layer is defined as w j i ( l ) w_{ji}^{(l)} wji(l) and the bias from l l l layer to l l l layer to l − 1 l-1 l−1 layer is defined as b l b_l bl, in which l ∈ { n l , n l − 1 , . . . , 2 } l\in\{n_l, n_l-1, ... ,2\} l∈{nl,nl−1,...,2}. That means in this neural network we have
The Cost Function for a particular ( x , y ) (x,y) (x,y) is defined in formula ( 1 ) (1) (1):
J ( w , b ; x , y ) = 1 2 ∣ ∣ h w , b ( x ) − y ∣ ∣ 2 (1) J(w,b; x,y)=\frac{1}{2}||h_{w,b}(x)-y||^2\tag{1} J(w,b;x,y)=21∣∣hw,b(x)−y∣∣2(1)
We use the Mean Square Error as the error criteria. For a set of
{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
(
x
(
3
)
,
y
(
3
)
)
,
.
.
.
(
x
(
m
)
,
y
(
m
)
)
}
{\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),(x^{(3)},y^{(3)}),...(x^{(m)},y^{(m)})}\}
{(x(1),y(1)),(x(2),y(2)),(x(3),y(3)),...(x(m),y(m))}, the Cost Function is defined as:
J
(
w
,
b
)
=
[
∑
i
=
1
m
J
(
w
,
b
;
x
(
i
)
,
y
(
i
)
)
]
+
λ
2
∑
l
=
2
n
l
∑
i
=
1
S
l
−
1
∑
j
=
1
S
l
(
w
j
i
(
l
)
)
2
(2)
J(w,b)=[\sum^{m}_{i=1}J(w,b;x^{(i)},y^{(i)})]+\frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji})^2\tag{2}
J(w,b)=[i=1∑mJ(w,b;x(i),y(i))]+2λl=2∑nli=1∑Sl−1j=1∑Sl(wji(l))2(2)
The second term to the right of the equation
λ
2
∑
l
=
2
n
l
∑
i
=
1
S
l
−
1
∑
j
=
1
S
l
(
w
j
i
(
l
)
)
\frac{\lambda}{2}\sum_{l=2}^{n_l}\sum_{i=1}^{S_{l-1}}\sum_{j=1}^{S_l}(w^{(l)}_{ji})
2λ∑l=2nl∑i=1Sl−1∑j=1Sl(wji(l)) actually is a additional term in order to avoiding “overfitting problem”, and its derivatives of any parameters is simple. So in the following section let’s just ignore it.
The target we want to solve is the
∂
J
(
w
,
b
)
∂
w
j
i
(
l
)
\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}
∂wji(l)∂J(w,b) and
∂
J
(
w
,
b
)
∂
b
i
(
l
)
\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}
∂bi(l)∂J(w,b) for optimizing those parameters by some method (like Gradient Descent). It is actually pretty hard to calculate the
∂
J
(
w
,
b
)
∂
w
j
i
(
l
)
\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}
∂wji(l)∂J(w,b) and
∂
J
(
w
,
b
)
∂
b
i
(
l
)
\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}
∂bi(l)∂J(w,b) in direct since we will do a huge work of matrix derivatives. so Let’s concentrate on a easier way.
Error Term δ \delta δ
According to the “Neural Network and Deep Learning”, based on the chain derivation rule, formula
(
6
)
(
7
)
(6)(7)
(6)(7) is defined:
∂
J
(
w
,
b
)
∂
w
j
i
(
l
)
=
∂
J
(
w
,
b
)
∂
z
j
(
l
+
1
)
∂
z
j
(
l
+
1
)
∂
w
j
i
(
l
)
(3)
\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}\tag{3}
∂wji(l)∂J(w,b)=∂zj(l+1)∂J(w,b)∂wji(l)∂zj(l+1)(3)
∂ J ( w , b ) ∂ b j ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ b j ( l ) (4) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}\tag{4} ∂bj(l)∂J(w,b)=∂zj(l+1)∂J(w,b)∂bj(l)∂zj(l+1)(4)
We can use an Error Term
δ
\delta
δ to calculate the above partial derivation more easily:
δ
j
(
l
)
=
∂
J
(
w
,
b
)
∂
z
j
(
l
+
1
)
(5)
\delta^{(l)}_{j}=\frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\tag{5}
δj(l)=∂zj(l+1)∂J(w,b)(5)
in which,
z
(
l
)
=
w
(
l
)
a
(
l
−
1
)
+
b
(
l
)
z^{(l)}=w^{(l)}a^{(l-1)}+b^{(l)}
z(l)=w(l)a(l−1)+b(l)
a ( l ) = f ( z ( l ) ) a^{(l)}=f(z^{(l)}) a(l)=f(z(l))
Here we define the function f f f as the Activate Function (such as sigmoid, tanh, etc.)
To calculate δ \delta δ, we could start from the output layer to the input layer step by step.
Calculation of Output Layer δ i ( n l ) \delta^{(n_l)}_i δi(nl)
Calculating the output layer
δ
i
(
n
l
)
\delta^{(n_l)}_i
δi(nl) is relative easily:
δ
i
(
n
l
)
=
−
(
y
i
−
a
i
(
n
l
)
)
f
′
(
z
i
(
n
l
)
)
(6)
\delta^{(n_l)}_i=-(y_i-a_i^{(n_l)})f'(z_i^{(n_l)})\tag{6}
δi(nl)=−(yi−ai(nl))f′(zi(nl))(6)
Proof
In the proof, we denote the Training Set is
(
x
,
y
)
(x,y)
(x,y).
δ
i
(
n
l
)
=
∂
∂
z
i
(
n
l
)
J
(
w
,
b
)
=
∂
∂
z
i
(
n
l
)
J
(
w
,
b
;
x
,
y
)
=
∂
∂
z
i
(
n
l
)
1
2
∣
∣
y
−
h
w
,
b
(
x
)
∣
∣
2
=
∂
∂
z
i
(
n
l
)
1
2
∑
j
=
1
S
n
l
(
y
j
−
f
(
z
j
(
n
l
)
)
)
2
=
−
(
y
i
−
f
(
z
j
(
n
l
)
)
)
f
′
(
z
j
(
n
l
)
)
=
−
(
y
i
−
a
i
(
n
l
)
)
f
′
(
z
i
(
n
l
)
)
\begin{aligned} \delta^{(n_l)}_i&=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}J(w,b;x,y) \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}||y-h_{w,b}(x)||^2 \\ &=\frac{\partial}{\partial z^{(n_l)}_i}\frac{1}{2}\sum_{j=1}^{S_{n_l}}\big(y_j-f(z_j^{(n_l)})\big)^2 \\ &=-(y_i-f(z_j^{(n_l)}))f'(z_j^{(n_l)}) \\ &=-(y_i-a_i^{(n_l)})f'(z^{(n_l)}_i) \end{aligned}
δi(nl)=∂zi(nl)∂J(w,b)=∂zi(nl)∂J(w,b;x,y)=∂zi(nl)∂21∣∣y−hw,b(x)∣∣2=∂zi(nl)∂21j=1∑Snl(yj−f(zj(nl)))2=−(yi−f(zj(nl)))f′(zj(nl))=−(yi−ai(nl))f′(zi(nl))
Calculation of n l − 1 n_l-1 nl−1 Layers δ i ( n l ) \delta^{(n_{l})}_{i} δi(nl)
For
l
∈
{
n
l
−
1
,
n
l
−
2
,
.
.
.
,
2
}
l\in\{n_l-1, n_l-2, ..., 2\}
l∈{nl−1,nl−2,...,2} (we cannot let
l
=
1
l=1
l=1 since the parameters
w
w
w between the input layer and the second layer is define as the
w
(
2
)
w^{(2)}
w(2) and there is no
w
(
1
)
w^{(1)}
w(1) parameter) the
δ
\delta
δ can be set as:
δ
i
(
l
)
=
(
∑
j
=
1
S
l
+
1
w
j
i
(
l
+
1
)
δ
j
(
l
+
1
)
)
f
′
(
z
i
(
l
)
)
(7)
\delta_i^{(l)}=\Big(\sum_{j=1}^{S_{l+1}}w_{ji}^{(l+1)}\delta_j^{(l+1)}\Big)f'(z_i^{(l)})\tag{7}
δi(l)=(j=1∑Sl+1wji(l+1)δj(l+1))f′(zi(l))(7)
To proof that, we could proof the
δ
i
(
n
l
−
1
)
\delta_i^{(n_l-1)}
δi(nl−1) in the
n
l
−
1
n_l-1
nl−1 layer first:
δ
i
(
n
l
−
1
)
=
(
∑
j
=
1
S
n
l
w
j
i
(
n
l
)
δ
j
(
n
l
)
)
f
′
(
z
i
(
l
)
)
(8)
\delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}w_{ji}^{(n_l)}\delta_j^{(n_l)}\Big)f'(z_i^{(l)})\tag{8}
δi(nl−1)=(j=1∑Snlwji(nl)δj(nl))f′(zi(l))(8)
Proof
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \delta_i^{(n_l…
Based on above mathematical process, we could conclude the formula
(
9
)
(9)
(9):
δ
i
(
n
l
−
1
)
=
(
∑
j
=
1
S
n
l
δ
j
(
n
l
)
w
j
i
(
n
l
)
)
f
′
(
z
i
(
n
l
−
1
)
)
(9)
\delta_i^{(n_l-1)}=\Big(\sum_{j=1}^{S_{n_l}}\delta^{(n_l)}_jw_{ji}^{(n_l)}\Big)f'(z_i^{(n_l-1)})\tag{9}
δi(nl−1)=(j=1∑Snlδj(nl)wji(nl))f′(zi(nl−1))(9)
Calculation of Other Layers δ i ( l ) \delta^{(l)}_{i} δi(l)
Using formula
(
9
)
(9)
(9), we change the
n
l
−
1
n_l-1
nl−1 to any layer
l
l
l:
δ
i
(
l
)
=
(
∑
j
=
1
S
l
δ
j
(
l
+
1
)
w
j
i
(
l
+
1
)
)
f
′
(
z
i
(
l
)
)
(10)
\delta_i^{(l)}=\Big(\sum_{j=1}^{S_l}\delta^{(l+1)}_jw_{ji}^{(l+1)}\Big)f'(z_i^{(l)})\tag{10}
δi(l)=(j=1∑Slδj(l+1)wji(l+1))f′(zi(l))(10)
Calculation of derivatives w j i ( l ) , b ( l ) w^{(l)}_{ji}, b^{(l)} wji(l),b(l)
According to formula
(
3
)
(
4
)
(3)(4)
(3)(4), we need to calculate the
∂
z
j
(
l
+
1
)
∂
w
j
i
(
l
)
\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}
∂wji(l)∂zj(l+1) and
∂
z
j
(
l
+
1
)
∂
b
j
(
l
)
\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}
∂bj(l)∂zj(l+1):
∂
z
j
(
l
+
1
)
∂
w
j
i
(
l
)
=
∂
(
∑
k
=
1
S
n
l
−
1
a
k
(
l
)
w
j
k
(
l
+
1
)
+
b
(
l
)
)
∂
w
j
i
(
l
)
=
a
i
(11)
\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial w^{(l)}_{ji}}=a_i\tag{11}
∂wji(l)∂zj(l+1)=∂wji(l)∂(∑k=1Snl−1ak(l)wjk(l+1)+b(l))=ai(11)
∂ z j ( l + 1 ) ∂ b ( l ) = ∂ ( ∑ k = 1 S n l − 1 a k ( l ) w j k ( l + 1 ) + b ( l ) ) ∂ b ( l ) = 1 (12) \frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}}=\frac{\partial \Big(\sum_{k=1}^{S_{n_l-1}}a_k^{(l)}w_{jk}^{(l+1)}+b^{(l)}\Big)}{\partial b^{(l)}}=1\tag{12} ∂b(l)∂zj(l+1)=∂b(l)∂(∑k=1Snl−1ak(l)wjk(l+1)+b(l))=1(12)
Using formula
(
10
)
(
11
)
(10)(11)
(10)(11), we could solve
∂
J
(
w
,
b
)
∂
w
j
i
(
l
)
\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}}
∂wji(l)∂J(w,b) and
∂
J
(
w
,
b
)
∂
b
i
(
l
)
\frac{\partial{J(w,b)}}{\partial b^{(l)}_{i}}
∂bi(l)∂J(w,b) :
∂
J
(
w
,
b
)
∂
w
j
i
(
l
)
=
∂
J
(
w
,
b
)
∂
z
j
(
l
+
1
)
∂
z
j
(
l
+
1
)
∂
w
j
i
(
l
)
=
δ
j
(
l
+
1
)
a
i
(13)
\frac{\partial{J(w,b)}}{\partial w^{(l)}_{ji}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial w^{(l)}_{ji}}=\delta_j^{(l+1)}a_i\tag{13}
∂wji(l)∂J(w,b)=∂zj(l+1)∂J(w,b)∂wji(l)∂zj(l+1)=δj(l+1)ai(13)
∂ J ( w , b ) ∂ b j ( l ) = ∂ J ( w , b ) ∂ z j ( l + 1 ) ∂ z j ( l + 1 ) ∂ b j ( l ) = δ j ( l + 1 ) ⋅ 1 (14) \frac{\partial{J(w,b)}}{\partial b^{(l)}_{j}} = \frac{\partial{J(w,b)}}{\partial z^{(l+1)}_{j}}\frac{\partial z^{(l+1)}_{j}}{\partial b^{(l)}_{j}}=\delta_j^{(l+1)}\cdot1\tag{14} ∂bj(l)∂J(w,b)=∂zj(l+1)∂J(w,b)∂bj(l)∂zj(l+1)=δj(l+1)⋅1(14)