1. Cost function for neural networks
J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 K [ y k ( i ) log ( ( h Θ ( x ( i ) ) ) k ) + ( 1 − y k ( i ) ) log ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2 \begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] \\&+ \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{aligned} J(Θ)=−m1i=1∑mk=1∑K[yk(i)log((hΘ(x(i)))k)+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑sl+1(Θj,i(l))2
- m m m = number of samples
- K K K = number of output units
- ( h Θ ( x ( i ) ) ) k (h_\Theta (x^{(i)}))_k (hΘ(x(i)))k = hypothesis that results in the k t h k^{th} kth output for i t h i^{th} ith sample
- λ \lambda λ = regularization parameter
- L L L = total number of layers in the network
- s l s_l sl number of units (excluding the bias unit) in layer l l l.
Cost function for logistic regression:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2 J(θ)=−m1i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))+2mλj=1∑nθj2
2. Understanding the backpropagation
In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually1. In other words, backpropagation could help us minimize our cost function fo neural networks.
For the given training set { ( x ( 1 ) , y ( 1 ) ) ⋯ ( x ( m ) , y ( m ) ) } \lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace {(x(1),y(1))⋯(x(m),y(m))}, the backpropagation algorithm will be implemented with the following steps:
-
Obtaining the output value(activation) of the output layer. If the network have L L L layers, we could calculated the a ( L ) a^{(L)} a(L) by forward propagation algorithm.
-
Computing error term for the ouput layer δ k ( L ) = a k ( L ) − y k \delta^{(L)}_k = a^{(L)}_k - y_k δk(L)=ak(L)−yk
-
Computing error term for hidden layers δ ( L − 1 ) , δ ( L − 2 ) , … , δ ( 2 ) \delta^{(L-1)}, \delta^{(L-2)}, \dots,\delta^{(2)} δ(L−1),δ(L−2),…,δ(2)
δ ( l ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ g ′ ( z ( l ) ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ g ( z ( l ) ) . ∗ ( 1 − g ( z ( l ) ) ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) ) \begin{aligned}\delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*g'(z^{(l)})\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*g(z^{(l)}).*(1-g(z^{(l)}))\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned} δ(l)=(Θ(l))Tδ(l+1).∗g′(z(l))=(Θ(l))Tδ(l+1).∗g(z(l)).∗(1−g(z(l)))=(Θ(l))Tδ(l+1).∗a(l).∗(1−a(l))
if g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1, g ′ ( z ) = g'(z)= g′(z)= ?
g ′ ( z ) = − ( 1 1 + e − z ) 2 ⋅ e − z ⋅ ( − 1 ) = 1 + e − z − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 + e − z ( 1 + e − z ) ( 1 + e − z ) − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 1 + e − z − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 1 + e − z ⋅ ( 1 − 1 1 + e − z ) = g ( z ) ⋅ ( 1 − g ( z ) ) \begin{aligned}g'(z)&=-(\frac{1}{1+e^{-z}})^2\cdot e^{-z} \cdot(-1)\\ &=\frac{1+e^{-z}-1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1+e^{-z}}{(1+e^{-z})(1+e^{-z})}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}} \cdot \left(1-\frac{1}{1+e^{-z}} \right)\\ &= g(z) \cdot(1-g(z)) \end{aligned} g′(z)=−(1+e−z1)2⋅e−z⋅(−1)=(1+e−z)(1+e−z)1+e−z−1=(1+e−z)(1+e−z)1+e−z−(1+e−z)(1+e−z)1=1+e−z1−(1+e−z)(1+e−z)1=1+e−z1⋅(1−1+e−z1)=g(z)⋅(1−g(z))
-
Accumulating the gradient using Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T \Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T
* here the weights for bias unit should be removed.
* The formula also can be rewritten as Δ ( l ) = ∑ i = 1 m ( δ ( l + 1 ) ) i ( ( a ( l ) ) T ) i \Delta^{(l)} = \sum\limits_{i=1}^m \left(\delta^{(l+1)}\right)^{i}\left((a^{(l)})^T\right)^{i} Δ(l)=i=1∑m(δ(l+1))i((a(l))T)i -
Obtaining the gradient for the neural network
∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) f o r j = 0 ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) + λ m Θ i j ( l ) f o r j ≥ 1 \begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned} ∂Θij(l)∂J(Θ)∂Θij(l)∂J(Θ)=Dij(l)=m1Δij(l)=Dij(l)=m1Δij(l)+mλΘij(l)forj=0forj≥1
Now, a simple neural network2 is given to illustrate the back propagation algorithm.
(1) Firstly, we generate a 3-layer neural network with teo inputs, two hidden neurons and two output neurons, and intialize the weights as follows.
For hidden layers,
z 1 ( 2 ) = 0.35 ∗ a 0 ( 1 ) + w 1 ∗ a 1 ( 1 ) + w 2 ∗ a 2 ( 1 ) z^{(2)}_1 = 0.35*a^{(1)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2 z1(2)=0.35∗a0(1)+w1∗a1(1)+w2∗a2(1)
so the activiation of neuron a 1 ( 2 ) a^{(2)}_1 a1(2)
a 1 ( 2 ) = 1 1 + e − z 1 ( 2 ) a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}} a1(2)=1+e−z1(2)1
Carrying out the same process we could get a 2 ( 2 ) a^{(2)}_2 a2(2)
Then, we repeat the process for the output layer neurons, using the output a ( 2 ) a^{(2)} a(2) as inputs.
z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60∗a0(2)+w5∗a1(2)+w6∗a2(2)
a 1 ( 3 ) = 1 1 + e − z 1 ( 3 ) a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}} a1(3)=1+e−z1(3)1
Here, we difine the total error for output neuron summing the squared error:
E t o t a l = ∑ k = 1 K 1 2 ( y k − a k ( 3 ) ) 2 E_{total} = \sum\limits_{k=1}^K \frac{1}{2}(y_k - a^{(3)}_k)^{2} Etotal=k=1∑K21(yk−ak(3))2
For the first output neuron, its error
E a 1 ( 3 ) = 1 2 ( y 1 − a 1 ( 3 ) ) 2 E_{a^{(3)}_1}= \frac{1}{2}(y_1-a^{(3)}_1)^2 Ea1(3)=21(y1−a1(3))2
and the error of the second output neuron
E a 2 ( 3 ) = 1 2 ( y 2 − a 2 ( 3 ) ) 2 E_{a^{(3)}_2}= \frac{1}{2}(y_2-a^{(3)}_2)^2 Ea2(3)=21(y2−a2(3))2
(2)The backwards pass
By applying the chain rule we know that:
∂ E t o t a l ∂ w 5 = ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) ⏟ δ 1 ( 3 ) ∗ ∂ z 1 ( 3 ) ∂ w 5 \frac{\partial E_{total}}{\partial w_{5}} = \underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}_{\delta^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial w_{5}} ∂w5∂Etotal=δ1(3) ∂a1(3)∂Etotal∗∂z1(3)∂a1(3)∗∂w5∂z1(3)
Now, we need to figure out these partial derivatives.
- (1) E t o t a l = E a 1 ( 3 ) + E a 2 ( 3 ) = 1 2 ( y 1 − a 1 ( 3 ) ) 2 + 1 2 ( y 2 − a 2 ( 3 ) ) 2 E_{total} =E_{a^{(3)}_1}+E_{a^{(3)}_2}=\frac{1}{2}(y_1-a^{(3)}_1)^2+\frac{1}{2}(y_2-a^{(3)}_2)^2 Etotal=Ea1(3)+Ea2(3)=21(y1−a1(3))2+21(y2−a2(3))2
∂ E t o t a l ∂ a 1 ( 3 ) = 2 ∗ 1 2 ( y 1 − a 1 ( 3 ) ) ∗ ( − 1 ) + 0 = − ( y 1 − a 1 ( 3 ) ) \frac{\partial E_{total}}{\partial a^{(3)}_1} = 2*\frac{1}{2}(y_1-a^{(3)}_1)*(-1)+0=-(y_1-a^{(3)}_1) ∂a1(3)∂Etotal=2∗21(y1−a1(3))∗(−1)+0=−(y1−a1(3))- (2) a 1 ( 3 ) = 1 1 + e − z 1 ( 3 ) = g ( z 1 ( 3 ) ) a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}}=g(z^{(3)}_1) a1(3)=1+e−z1(3)1=g(z1(3))
∂ a 1 ( 3 ) ∂ z 1 ( 3 ) = g ′ ( z 1 ( 3 ) ) = g ( z 1 ( 3 ) ) ∗ ( 1 − g ( z 1 ( 3 ) ) ) = a 1 ( 3 ) ∗ ( 1 − a 1 ( 3 ) ) \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} =g'(z^{(3)}_1) =g(z^{(3)}_1)*(1-g(z^{(3)}_1)) = a^{(3)}_1*(1-a^{(3)}_1) ∂z1(3)∂a1(3)=g′(z1(3))=g(z1(3))∗(1−g(z1(3)))=a1(3)∗(1−a1(3))- (3) z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60∗a0(2)+w5∗a1(2)+w6∗a2(2)
∂ z 1 ( 3 ) ∂ w 5 = a 1 ( 2 ) \frac{\partial z^{(3)}_1}{\partial w_{5}} = a^{(2)}_1 ∂w5∂z1(3)=a1(2)
∂ E t o t a l ∂ w 5 = − ( y 1 − a 1 ( 3 ) ) ∗ a 1 ( 3 ) ∗ ( 1 − a 1 ( 3 ) ) ⏟ δ 1 ( 3 ) ∗ a 1 ( 2 ) \frac{\partial E_{total}}{\partial w_{5}}=\underbrace{-(y_1-a^{(3)}_1)*a^{(3)}_1*(1-a^{(3)}_1)}_{\delta^{(3)}_1}* a^{(2)}_1 ∂w5∂Etotal=δ1(3) −(y1−a1(3))∗a1(3)∗(1−a1(3))∗a1(2)The above formula also can be represented as Δ 1 ( 3 ) = δ 1 ( 3 ) ∗ a 1 ( 2 ) \Delta^{(3)}_1 = \delta^{(3)}_1* a^{(2)}_1 Δ1(3)=δ1(3)∗a1(2)
So, we get the gradient with respect to w 5 w_5 w5 and also can get the new weights of w 6 , w 7 w_6, w_7 w6,w7 and w 8 w_8 w8 by repeating the above process.
Next, we’ll continue the backwards pass by calculating new values for w 1 , w 2 , w 3 w_1, w_2, w_3 w1,w2,w3 and w 4 w_4 w4.
∂ E t o t a l ∂ w 1 = ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) ∗ ∂ z 1 ( 3 ) ∂ a 1 ( 2 ) ∗ ∂ a 1 ( 2 ) ∂ z 1 ( 2 ) ∗ ∂ z 1 ( 2 ) ∂ w 1 ⏟ Δ n o d e 1 + ∂ E t o t a l ∂ a 2 ( 3 ) ∗ ∂ a 2 ( 3 ) ∂ z 2 ( 3 ) ∗ ∂ z 2 ( 3 ) ∂ a 1 ( 2 ) ∗ ∂ a 1 ( 2 ) ∂ z 1 ( 2 ) ∗ ∂ z 1 ( 2 ) ∂ w 1 ⏟ Δ n o d e 2 \begin{aligned}\frac{\partial E_{total}}{\partial w_{1}} = &\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node1}+\\&\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_2} * \frac{\partial a^{(3)}_2}{\partial z^{(3)}_2} * \frac{\partial z^{(3)}_2}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node2}\end{aligned} ∂w1∂Etotal=Δnode1 ∂a1(3)∂Etotal∗∂z1(3)∂a1(3)∗∂a1(2)∂z1(3)∗∂z1(2)∂a1(2)∗∂w1∂z1(2)+Δnode2 ∂a2(3)∂Etotal∗∂z2(3)∂a2(3)∗∂a1(2)∂z2(3)∗∂z1(2)∂a1(2)∗∂w1∂z1(2)from the ∂ E t o t a l ∂ w 5 \frac{\partial E_{total}}{\partial w_{5}} ∂w5∂Etotal we get ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) = δ 1 ( 3 ) {\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}=\delta^{(3)}_1 ∂a1(3)∂Etotal∗∂z1(3)∂a1(3)=δ1(3)
- (1) z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60∗a0(2)+w5∗a1(2)+w6∗a2(2)
∂ z 1 ( 3 ) ∂ a 1 ( 2 ) = w 5 \frac{\partial z^{(3)}_1}{\partial a^{(2)}_{1}} = w^5 ∂a1(2)∂z1(3)=w5- (2) a 1 ( 2 ) = 1 1 + e − z 1 ( 2 ) = g ( z 1 ( 2 ) ) a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}}=g(z^{(2)}_1) a1(2)=1+e−z1(2)1=g(z1(2))
∂ a 1 ( 2 ) ∂ z 1 ( 2 ) = g ′ ( z 1 ( 2 ) ) = g ( z 1 ( 2 ) ) ∗ ( 1 − g ( z 1 ( 2 ) ) ) = a 1 ( 2 ) ∗ ( 1 − a 1 ( 2 ) ) \frac{\partial a^{(2)}_1}{\partial z^{(2)}_1} =g'(z^{(2)}_1) =g(z^{(2)}_1)*(1-g(z^{(2)}_1)) = a^{(2)}_1*(1-a^{(2)}_1) ∂z1(2)∂a1(2)=g′(z1(2))=g(z1(2))∗(1−g(z1(2)))=a1(2)∗(1−a1(2))- (3) z 1 ( 2 ) = 0.35 ∗ a 0 ( 2 ) + w 1 ∗ a 1 ( 1 ) + w 2 ∗ a 2 ( 1 ) z^{(2)}_1 = 0.35*a^{(2)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2 z1(2)=0.35∗a0(2)+w1∗a1(1)+w2∗a2(1)
∂ z 1 ( 2 ) ∂ w 1 = a 1 ( 1 ) \frac{\partial z^{(2)}_1}{\partial w_{1}} = a^{(1)}_1 ∂w1∂z1(2)=a1(1)
so, Δ n o d e 1 = δ 1 ( 3 ) ∗ w 5 ∗ g ′ ( z 1 ( 2 ) ) ⏟ δ 11 ( 2 ) ∗ a 1 ( 1 ) \Delta node1 = \underbrace{\delta^{(3)}_1* w^5*g'(z^{(2)}_1)}_{\delta^{(2)}_{11}}*a^{(1)}_1 Δnode1=δ11(2) δ1(3)∗w5∗g′(z1(2))∗a1(1), and Δ n o d e 2 = δ 2 ( 3 ) ∗ w 7 ∗ g ′ ( z 1 ( 2 ) ) ⏟ δ 12 ( 2 ) ∗ a 1 ( 1 ) \Delta node2 = \underbrace{\delta^{(3)}_2* w^7*g'(z^{(2)}_1)}_{\delta^{(2)}_{12}}*a^{(1)}_1 Δnode2=δ12(2) δ2(3)∗w7∗g′(z1(2))∗a1(1)
∂ E t o t a l ∂ w 1 = Δ n o d e 1 + Δ n o d e 2 \frac{\partial E_{total}}{\partial w_{1}}=\Delta node1 +\Delta node2 ∂w1∂Etotal=Δnode1+Δnode2
We rewrite the gradient as Δ 1 ( 2 ) = Δ n o d e 1 + Δ n o d e 2 = δ 11 ( 2 ) ∗ a 1 ( 1 ) + δ 12 ( 2 ) ∗ a 1 ( 1 ) = δ 1 ( 2 ) ∗ a 1 ( 1 ) \Delta^{(2)}_1=\Delta node1 +\Delta node2=\delta^{(2)}_{11}*a^{(1)}_1+\delta^{(2)}_{12}*a^{(1)}_1=\delta^{(2)}_{1}*a^{(1)}_1 Δ1(2)=Δnode1+Δnode2=δ11(2)∗a1(1)+δ12(2)∗a1(1)=δ1(2)∗a1(1)
References
[1] Backpropagation
[2] A Step by Step Backpropagation Example (EN) - Matt Mazur
[3] A Step by Step Backpropagation Example (CN)
3. Gradient checking
Gradient checking will help to confirm that the backpropagation works correctly. We can approximate the derivative of our cost function with:
∂
∂
Θ
J
(
Θ
)
≈
J
(
Θ
+
ϵ
)
−
J
(
Θ
−
ϵ
)
2
ϵ
\frac{\partial}{\partial \Theta}J(\Theta) \approx\frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}
∂Θ∂J(Θ)≈2ϵJ(Θ+ϵ)−J(Θ−ϵ)
Before performing the gradient checking, we unroll the parameters(theta) into a long vector
θ
\theta
θ. Generaly, we set
ϵ
=
1
0
−
4
\epsilon=10^{-4}
ϵ=10−4 to guarantee the gradient.
4. Application of neural network to classification task
(1) Weights initialization
Initializing all theta weights to zero does not work with neural networks. Hence, we initialize all theta between [ − ϵ , ϵ ] [-\epsilon,\epsilon] [−ϵ,ϵ] using the folowing method.
- * ϵ i n i t i a l = 0.12 \epsilon_{initial} = 0.12 ϵinitial=0.12
W = zeros(L_out, 1 + L_in);
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
(2) Feedforward the neural network
% performing the forward propagation
X = [ones(m,1),X]; % 5000*401
h_out = sigmoid(X * Theta1'); % 5000*25
h_out = [ones(m,1),h_out]; % 5000*26
hypo = sigmoid(h_out * Theta2'); % 5000*10
% generating the label matrix
y_label = zeros(m, num_labels);
for i = 1:num_labels
loc = find(y == i);
y_label(loc,i) = ones(size(loc,1),1);
end
(3) Cost function computation
J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 K [ y k ( i ) log ( ( h Θ ( x ( i ) ) ) k ) + ( 1 − y k ( i ) ) log ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2 \begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] \\&+ \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{aligned} J(Θ)=−m1i=1∑mk=1∑K[yk(i)log((hΘ(x(i)))k)+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑sl+1(Θj,i(l))2
% cost function (no regularization)
J= (y_label.*log(hypo)) + ((ones(m,num_labels)-y_label).*log(1-hypo));
J = sum(sum(J));
J = (-1/m) * J;
% regularization term of the cost function
theta_sum = 0;
Theta = []
for i = 1:2
Theta = eval(['Theta',num2str(i)]);
theta_sum = theta_sum + sum(sum(Theta(:,2:end).^2));
clear Theta
end
J = J + lambda / (2 * m) * theta_sum;
(4) Backpropagation (Gradient calculation)
δ k ( L ) = a k ( L ) − y k δ ( l ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) ) \begin{aligned}\delta^{(L)}_k &= a^{(L)}_k - y_k\\ \delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned} δk(L)δ(l)=ak(L)−yk=(Θ(l))Tδ(l+1).∗a(l).∗(1−a(l))
delta_3 = (hypo - y_label)';
delta_2 = Theta2(:,2:end)' * delta_3.*sigmoidGradient((X * Theta1')');
Δ
(
l
)
:
=
Δ
(
l
)
+
δ
(
l
+
1
)
(
a
(
l
)
)
T
\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T
Δ(l):=Δ(l)+δ(l+1)(a(l))T
* Vectorization the above equation
Δ
(
l
)
=
δ
(
l
+
1
)
a
(
l
)
\Delta^{(l)} = \delta^{(l+1)}a^{(l)}
Δ(l)=δ(l+1)a(l)
delta_sum_1 = zeros(hidden_layer_size,input_layer_size+1); % 25*401
delta_sum_2 = zeros(num_labels,hidden_layer_size+1); % 10*26
delta_sum_1 = delta_2 * X;
delta_sum_2 = delta_3 * h_out;
∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) f o r j = 0 ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) + λ m Θ i j ( l ) f o r j ≥ 1 \begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned} ∂Θij(l)∂J(Θ)∂Θij(l)∂J(Θ)=Dij(l)=m1Δij(l)=Dij(l)=m1Δij(l)+mλΘij(l)forj=0forj≥1
Theta1_grad = (1/m) .* delta_sum_1;
Theta2_grad = (1/m) .* delta_sum_2;
regular_1 = Theta1 * (lambda/m);
regular_1(:,1) = 0;
Theta1_grad = Theta1_grad + regular_1;
regular_2 = Theta2 * (lambda/m);
regular_2(:,1) = 0;;
Theta2_grad = Theta2_grad + regular_2;
grad = [Theta1_grad(:); Theta2_grad(:)];
(5) Gradients checking (optional)
∂ ∂ Θ j J ( Θ ) ≈ J ( Θ 1 , ⋯ , Θ j + ϵ , ⋯ , Θ n ) − J ( Θ 1 , ⋯ , Θ j − ϵ , ⋯ , Θ n ) 2 ϵ \frac{\partial}{\partial \Theta_j}J(\Theta) \approx\frac{J(\Theta_1,\cdots,\Theta_j+\epsilon,\cdots,\Theta_n)-J(\Theta_1,\cdots,\Theta_j-\epsilon,\cdots,\Theta_n)}{2\epsilon} ∂Θj∂J(Θ)≈2ϵJ(Θ1,⋯,Θj+ϵ,⋯,Θn)−J(Θ1,⋯,Θj−ϵ,⋯,Θn)
- * ϵ = 1 0 − 4 \epsilon = 10^{-4} ϵ=10−4
theta = [Theta1(:); Theta2(:)];
numgrad = zeros(size(theta));
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
% Set perturbation vector
perturb(p) = e;
loss1 = J(theta - perturb);
loss2 = J(theta + perturb);
% Compute Numerical Gradient
numgrad(p) = (loss2 - loss1) / (2*e);
perturb(p) = 0;
end
*After checking with correct gradient, we could turn off gradient checking before running algorithm.
(6) Minimizing the cost function J ( Θ ) J(\Theta) J(Θ)
options = optimset('MaxIter', 200);
lambda = 0.2;
% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, X, y, lambda);
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);