ML Notes: Week 5 - Neural Networks: Learning

1. Cost function for neural networks

J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 K [ y k ( i ) log ⁡ ( ( h Θ ( x ( i ) ) ) k ) + ( 1 − y k ( i ) ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2 \begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] \\&+ \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{aligned} J(Θ)=m1i=1mk=1K[yk(i)log((hΘ(x(i)))k)+(1yk(i))log(1(hΘ(x(i)))k)]+2mλl=1L1i=1slj=1sl+1(Θj,i(l))2

  • m m m = number of samples
  • K K K = number of output units
  • ( h Θ ( x ( i ) ) ) k (h_\Theta (x^{(i)}))_k (hΘ(x(i)))k = hypothesis that results in the k t h k^{th} kth output for i t h i^{th} ith sample
  • λ \lambda λ = regularization parameter
  • L L L = total number of layers in the network
  • s l s_l sl number of units (excluding the bias unit) in layer l l l.

Cost function for logistic regression:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ⁡ h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2 J(θ)=m1i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))+2mλj=1nθj2


2. Understanding the backpropagation

In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually1. In other words, backpropagation could help us minimize our cost function fo neural networks.

For the given training set { ( x ( 1 ) , y ( 1 ) ) ⋯ ( x ( m ) , y ( m ) ) } \lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace {(x(1),y(1))(x(m),y(m))}, the backpropagation algorithm will be implemented with the following steps:

  • Obtaining the output value(activation) of the output layer. If the network have L L L layers, we could calculated the a ( L ) a^{(L)} a(L) by forward propagation algorithm.

  • Computing error term for the ouput layer δ k ( L ) = a k ( L ) − y k \delta^{(L)}_k = a^{(L)}_k - y_k δk(L)=ak(L)yk

  • Computing error term for hidden layers δ ( L − 1 ) , δ ( L − 2 ) , … , δ ( 2 ) \delta^{(L-1)}, \delta^{(L-2)}, \dots,\delta^{(2)} δ(L1),δ(L2),,δ(2)
    δ ( l ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ g ′ ( z ( l ) ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ g ( z ( l ) ) . ∗ ( 1 − g ( z ( l ) ) ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) ) \begin{aligned}\delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*g'(z^{(l)})\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*g(z^{(l)}).*(1-g(z^{(l)}))\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned} δ(l)=(Θ(l))Tδ(l+1).g(z(l))=(Θ(l))Tδ(l+1).g(z(l)).(1g(z(l)))=(Θ(l))Tδ(l+1).a(l).(1a(l))

if g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+ez1, g ′ ( z ) = g'(z)= g(z)= ?

g ′ ( z ) = − ( 1 1 + e − z ) 2 ⋅ e − z ⋅ ( − 1 ) = 1 + e − z − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 + e − z ( 1 + e − z ) ( 1 + e − z ) − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 1 + e − z − 1 ( 1 + e − z ) ( 1 + e − z ) = 1 1 + e − z ⋅ ( 1 − 1 1 + e − z ) = g ( z ) ⋅ ( 1 − g ( z ) ) \begin{aligned}g'(z)&=-(\frac{1}{1+e^{-z}})^2\cdot e^{-z} \cdot(-1)\\ &=\frac{1+e^{-z}-1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1+e^{-z}}{(1+e^{-z})(1+e^{-z})}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}} \cdot \left(1-\frac{1}{1+e^{-z}} \right)\\ &= g(z) \cdot(1-g(z)) \end{aligned} g(z)=(1+ez1)2ez(1)=(1+ez)(1+ez)1+ez1=(1+ez)(1+ez)1+ez(1+ez)(1+ez)1=1+ez1(1+ez)(1+ez)1=1+ez1(11+ez1)=g(z)(1g(z))

  • Accumulating the gradient using Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T \Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T
    * here the weights for bias unit should be removed.
    * The formula also can be rewritten as Δ ( l ) = ∑ i = 1 m ( δ ( l + 1 ) ) i ( ( a ( l ) ) T ) i \Delta^{(l)} = \sum\limits_{i=1}^m \left(\delta^{(l+1)}\right)^{i}\left((a^{(l)})^T\right)^{i} Δ(l)=i=1m(δ(l+1))i((a(l))T)i

  • Obtaining the gradient for the neural network
    ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) f o r j = 0 ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) + λ m Θ i j ( l ) f o r j ≥ 1 \begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned} Θij(l)J(Θ)Θij(l)J(Θ)=Dij(l)=m1Δij(l)=Dij(l)=m1Δij(l)+mλΘij(l)forj=0forj1

Now, a simple neural network2 is given to illustrate the back propagation algorithm.


(1) Firstly, we generate a 3-layer neural network with teo inputs, two hidden neurons and two output neurons, and intialize the weights as follows.
在这里插入图片描述
For hidden layers,
z 1 ( 2 ) = 0.35 ∗ a 0 ( 1 ) + w 1 ∗ a 1 ( 1 ) + w 2 ∗ a 2 ( 1 ) z^{(2)}_1 = 0.35*a^{(1)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2 z1(2)=0.35a0(1)+w1a1(1)+w2a2(1)
so the activiation of neuron a 1 ( 2 ) a^{(2)}_1 a1(2)
a 1 ( 2 ) = 1 1 + e − z 1 ( 2 ) a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}} a1(2)=1+ez1(2)1
Carrying out the same process we could get a 2 ( 2 ) a^{(2)}_2 a2(2)
Then, we repeat the process for the output layer neurons, using the output a ( 2 ) a^{(2)} a(2) as inputs.
z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60a0(2)+w5a1(2)+w6a2(2)
a 1 ( 3 ) = 1 1 + e − z 1 ( 3 ) a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}} a1(3)=1+ez1(3)1
Here, we difine the total error for output neuron summing the squared error:
E t o t a l = ∑ k = 1 K 1 2 ( y k − a k ( 3 ) ) 2 E_{total} = \sum\limits_{k=1}^K \frac{1}{2}(y_k - a^{(3)}_k)^{2} Etotal=k=1K21(ykak(3))2
For the first output neuron, its error
E a 1 ( 3 ) = 1 2 ( y 1 − a 1 ( 3 ) ) 2 E_{a^{(3)}_1}= \frac{1}{2}(y_1-a^{(3)}_1)^2 Ea1(3)=21(y1a1(3))2
and the error of the second output neuron
E a 2 ( 3 ) = 1 2 ( y 2 − a 2 ( 3 ) ) 2 E_{a^{(3)}_2}= \frac{1}{2}(y_2-a^{(3)}_2)^2 Ea2(3)=21(y2a2(3))2


(2)The backwards pass
By applying the chain rule we know that:
在这里插入图片描述
∂ E t o t a l ∂ w 5 = ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) ⏟ δ 1 ( 3 ) ∗ ∂ z 1 ( 3 ) ∂ w 5 \frac{\partial E_{total}}{\partial w_{5}} = \underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}_{\delta^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial w_{5}} w5Etotal=δ1(3) a1(3)Etotalz1(3)a1(3)w5z1(3)
Now, we need to figure out these partial derivatives.

  • (1) E t o t a l = E a 1 ( 3 ) + E a 2 ( 3 ) = 1 2 ( y 1 − a 1 ( 3 ) ) 2 + 1 2 ( y 2 − a 2 ( 3 ) ) 2 E_{total} =E_{a^{(3)}_1}+E_{a^{(3)}_2}=\frac{1}{2}(y_1-a^{(3)}_1)^2+\frac{1}{2}(y_2-a^{(3)}_2)^2 Etotal=Ea1(3)+Ea2(3)=21(y1a1(3))2+21(y2a2(3))2
    ∂ E t o t a l ∂ a 1 ( 3 ) = 2 ∗ 1 2 ( y 1 − a 1 ( 3 ) ) ∗ ( − 1 ) + 0 = − ( y 1 − a 1 ( 3 ) ) \frac{\partial E_{total}}{\partial a^{(3)}_1} = 2*\frac{1}{2}(y_1-a^{(3)}_1)*(-1)+0=-(y_1-a^{(3)}_1) a1(3)Etotal=221(y1a1(3))(1)+0=(y1a1(3))
  • (2) a 1 ( 3 ) = 1 1 + e − z 1 ( 3 ) = g ( z 1 ( 3 ) ) a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}}=g(z^{(3)}_1) a1(3)=1+ez1(3)1=g(z1(3))
    ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) = g ′ ( z 1 ( 3 ) ) = g ( z 1 ( 3 ) ) ∗ ( 1 − g ( z 1 ( 3 ) ) ) = a 1 ( 3 ) ∗ ( 1 − a 1 ( 3 ) ) \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} =g'(z^{(3)}_1) =g(z^{(3)}_1)*(1-g(z^{(3)}_1)) = a^{(3)}_1*(1-a^{(3)}_1) z1(3)a1(3)=g(z1(3))=g(z1(3))(1g(z1(3)))=a1(3)(1a1(3))
  • (3) z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60a0(2)+w5a1(2)+w6a2(2)
    ∂ z 1 ( 3 ) ∂ w 5 = a 1 ( 2 ) \frac{\partial z^{(3)}_1}{\partial w_{5}} = a^{(2)}_1 w5z1(3)=a1(2)
    ∂ E t o t a l ∂ w 5 = − ( y 1 − a 1 ( 3 ) ) ∗ a 1 ( 3 ) ∗ ( 1 − a 1 ( 3 ) ) ⏟ δ 1 ( 3 ) ∗ a 1 ( 2 ) \frac{\partial E_{total}}{\partial w_{5}}=\underbrace{-(y_1-a^{(3)}_1)*a^{(3)}_1*(1-a^{(3)}_1)}_{\delta^{(3)}_1}* a^{(2)}_1 w5Etotal=δ1(3) (y1a1(3))a1(3)(1a1(3))a1(2)

The above formula also can be represented as Δ 1 ( 3 ) = δ 1 ( 3 ) ∗ a 1 ( 2 ) \Delta^{(3)}_1 = \delta^{(3)}_1* a^{(2)}_1 Δ1(3)=δ1(3)a1(2)
So, we get the gradient with respect to w 5 w_5 w5 and also can get the new weights of w 6 , w 7 w_6, w_7 w6,w7 and w 8 w_8 w8 by repeating the above process.


Next, we’ll continue the backwards pass by calculating new values for w 1 , w 2 , w 3 w_1, w_2, w_3 w1,w2,w3 and w 4 w_4 w4.
在这里插入图片描述
∂ E t o t a l ∂ w 1 = ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) ∗ ∂ z 1 ( 3 ) ∂ a 1 ( 2 ) ∗ ∂ a 1 ( 2 ) ∂ z 1 ( 2 ) ∗ ∂ z 1 ( 2 ) ∂ w 1 ⏟ Δ n o d e 1 + ∂ E t o t a l ∂ a 2 ( 3 ) ∗ ∂ a 2 ( 3 ) ∂ z 2 ( 3 ) ∗ ∂ z 2 ( 3 ) ∂ a 1 ( 2 ) ∗ ∂ a 1 ( 2 ) ∂ z 1 ( 2 ) ∗ ∂ z 1 ( 2 ) ∂ w 1 ⏟ Δ n o d e 2 \begin{aligned}\frac{\partial E_{total}}{\partial w_{1}} = &\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node1}+\\&\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_2} * \frac{\partial a^{(3)}_2}{\partial z^{(3)}_2} * \frac{\partial z^{(3)}_2}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node2}\end{aligned} w1Etotal=Δnode1 a1(3)Etotalz1(3)a1(3)a1(2)z1(3)z1(2)a1(2)w1z1(2)+Δnode2 a2(3)Etotalz2(3)a2(3)a1(2)z2(3)z1(2)a1(2)w1z1(2)

from the ∂ E t o t a l ∂ w 5 \frac{\partial E_{total}}{\partial w_{5}} w5Etotal we get ∂ E t o t a l ∂ a 1 ( 3 ) ∗ ∂ a 1 ( 3 ) ∂ z 1 ( 3 ) = δ 1 ( 3 ) {\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}=\delta^{(3)}_1 a1(3)Etotalz1(3)a1(3)=δ1(3)

  • (1) z 1 ( 3 ) = 0.60 ∗ a 0 ( 2 ) + w 5 ∗ a 1 ( 2 ) + w 6 ∗ a 2 ( 2 ) z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2 z1(3)=0.60a0(2)+w5a1(2)+w6a2(2)
    ∂ z 1 ( 3 ) ∂ a 1 ( 2 ) = w 5 \frac{\partial z^{(3)}_1}{\partial a^{(2)}_{1}} = w^5 a1(2)z1(3)=w5
  • (2) a 1 ( 2 ) = 1 1 + e − z 1 ( 2 ) = g ( z 1 ( 2 ) ) a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}}=g(z^{(2)}_1) a1(2)=1+ez1(2)1=g(z1(2))
    ∂ a 1 ( 2 ) ∂ z 1 ( 2 ) = g ′ ( z 1 ( 2 ) ) = g ( z 1 ( 2 ) ) ∗ ( 1 − g ( z 1 ( 2 ) ) ) = a 1 ( 2 ) ∗ ( 1 − a 1 ( 2 ) ) \frac{\partial a^{(2)}_1}{\partial z^{(2)}_1} =g'(z^{(2)}_1) =g(z^{(2)}_1)*(1-g(z^{(2)}_1)) = a^{(2)}_1*(1-a^{(2)}_1) z1(2)a1(2)=g(z1(2))=g(z1(2))(1g(z1(2)))=a1(2)(1a1(2))
  • (3) z 1 ( 2 ) = 0.35 ∗ a 0 ( 2 ) + w 1 ∗ a 1 ( 1 ) + w 2 ∗ a 2 ( 1 ) z^{(2)}_1 = 0.35*a^{(2)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2 z1(2)=0.35a0(2)+w1a1(1)+w2a2(1)
    ∂ z 1 ( 2 ) ∂ w 1 = a 1 ( 1 ) \frac{\partial z^{(2)}_1}{\partial w_{1}} = a^{(1)}_1 w1z1(2)=a1(1)

so, Δ n o d e 1 = δ 1 ( 3 ) ∗ w 5 ∗ g ′ ( z 1 ( 2 ) ) ⏟ δ 11 ( 2 ) ∗ a 1 ( 1 ) \Delta node1 = \underbrace{\delta^{(3)}_1* w^5*g'(z^{(2)}_1)}_{\delta^{(2)}_{11}}*a^{(1)}_1 Δnode1=δ11(2) δ1(3)w5g(z1(2))a1(1), and Δ n o d e 2 = δ 2 ( 3 ) ∗ w 7 ∗ g ′ ( z 1 ( 2 ) ) ⏟ δ 12 ( 2 ) ∗ a 1 ( 1 ) \Delta node2 = \underbrace{\delta^{(3)}_2* w^7*g'(z^{(2)}_1)}_{\delta^{(2)}_{12}}*a^{(1)}_1 Δnode2=δ12(2) δ2(3)w7g(z1(2))a1(1)
∂ E t o t a l ∂ w 1 = Δ n o d e 1 + Δ n o d e 2 \frac{\partial E_{total}}{\partial w_{1}}=\Delta node1 +\Delta node2 w1Etotal=Δnode1+Δnode2
We rewrite the gradient as Δ 1 ( 2 ) = Δ n o d e 1 + Δ n o d e 2 = δ 11 ( 2 ) ∗ a 1 ( 1 ) + δ 12 ( 2 ) ∗ a 1 ( 1 ) = δ 1 ( 2 ) ∗ a 1 ( 1 ) \Delta^{(2)}_1=\Delta node1 +\Delta node2=\delta^{(2)}_{11}*a^{(1)}_1+\delta^{(2)}_{12}*a^{(1)}_1=\delta^{(2)}_{1}*a^{(1)}_1 Δ1(2)=Δnode1+Δnode2=δ11(2)a1(1)+δ12(2)a1(1)=δ1(2)a1(1)

References

[1] Backpropagation

[2] A Step by Step Backpropagation Example (EN) - Matt Mazur
[3] A Step by Step Backpropagation Example (CN)


3. Gradient checking

Gradient checking will help to confirm that the backpropagation works correctly. We can approximate the derivative of our cost function with:
∂ ∂ Θ J ( Θ ) ≈ J ( Θ + ϵ ) − J ( Θ − ϵ ) 2 ϵ \frac{\partial}{\partial \Theta}J(\Theta) \approx\frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon} ΘJ(Θ)2ϵJ(Θ+ϵ)J(Θϵ)
Before performing the gradient checking, we unroll the parameters(theta) into a long vector θ \theta θ. Generaly, we set ϵ = 1 0 − 4 \epsilon=10^{-4} ϵ=104 to guarantee the gradient.


4. Application of neural network to classification task

(1) Weights initialization

Initializing all theta weights to zero does not work with neural networks. Hence, we initialize all theta between [ − ϵ , ϵ ] [-\epsilon,\epsilon] [ϵ,ϵ] using the folowing method.

  • * ϵ i n i t i a l = 0.12 \epsilon_{initial} = 0.12 ϵinitial=0.12
W = zeros(L_out, 1 + L_in);
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
(2) Feedforward the neural network

在这里插入图片描述

% performing the forward propagation
	X = [ones(m,1),X];                % 5000*401
	h_out = sigmoid(X * Theta1');     % 5000*25
	h_out = [ones(m,1),h_out];        % 5000*26
	hypo = sigmoid(h_out * Theta2');  % 5000*10

% generating the label matrix
	y_label = zeros(m, num_labels);  
	for i = 1:num_labels
	    loc = find(y == i);
	    y_label(loc,i) = ones(size(loc,1),1);
	end
(3) Cost function computation

J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 K [ y k ( i ) log ⁡ ( ( h Θ ( x ( i ) ) ) k ) + ( 1 − y k ( i ) ) log ⁡ ( 1 − ( h Θ ( x ( i ) ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2 \begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] \\&+ \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{aligned} J(Θ)=m1i=1mk=1K[yk(i)log((hΘ(x(i)))k)+(1yk(i))log(1(hΘ(x(i)))k)]+2mλl=1L1i=1slj=1sl+1(Θj,i(l))2

% cost function (no regularization)
    J= (y_label.*log(hypo)) + ((ones(m,num_labels)-y_label).*log(1-hypo));
    J = sum(sum(J));
    J = (-1/m) * J;
    
% regularization term of the cost function
	theta_sum = 0;    
	Theta = []
	for i = 1:2
	    Theta = eval(['Theta',num2str(i)]);
	    theta_sum = theta_sum + sum(sum(Theta(:,2:end).^2));
	    clear Theta
	end
J = J + lambda / (2 * m) * theta_sum;
(4) Backpropagation (Gradient calculation)

δ k ( L ) = a k ( L ) − y k δ ( l ) = ( Θ ( l ) ) T δ ( l + 1 ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) ) \begin{aligned}\delta^{(L)}_k &= a^{(L)}_k - y_k\\ \delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned} δk(L)δ(l)=ak(L)yk=(Θ(l))Tδ(l+1).a(l).(1a(l))

delta_3 = (hypo - y_label)';
delta_2 = Theta2(:,2:end)' * delta_3.*sigmoidGradient((X * Theta1')');

Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T \Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T
* Vectorization the above equation
Δ ( l ) = δ ( l + 1 ) a ( l ) \Delta^{(l)} = \delta^{(l+1)}a^{(l)} Δ(l)=δ(l+1)a(l)

delta_sum_1 = zeros(hidden_layer_size,input_layer_size+1);  % 25*401
delta_sum_2 = zeros(num_labels,hidden_layer_size+1);        % 10*26

delta_sum_1 = delta_2 * X;
delta_sum_2 = delta_3 * h_out;

∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) f o r j = 0 ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) = 1 m Δ i j ( l ) + λ m Θ i j ( l ) f o r j ≥ 1 \begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned} Θij(l)J(Θ)Θij(l)J(Θ)=Dij(l)=m1Δij(l)=Dij(l)=m1Δij(l)+mλΘij(l)forj=0forj1

Theta1_grad = (1/m) .* delta_sum_1;
Theta2_grad = (1/m) .* delta_sum_2;

regular_1 = Theta1 * (lambda/m);
regular_1(:,1) = 0;
Theta1_grad = Theta1_grad + regular_1;

regular_2 = Theta2 * (lambda/m);
regular_2(:,1) = 0;;
Theta2_grad = Theta2_grad + regular_2;

grad = [Theta1_grad(:); Theta2_grad(:)];
(5) Gradients checking (optional)

∂ ∂ Θ j J ( Θ ) ≈ J ( Θ 1 , ⋯   , Θ j + ϵ , ⋯   , Θ n ) − J ( Θ 1 , ⋯   , Θ j − ϵ , ⋯   , Θ n ) 2 ϵ \frac{\partial}{\partial \Theta_j}J(\Theta) \approx\frac{J(\Theta_1,\cdots,\Theta_j+\epsilon,\cdots,\Theta_n)-J(\Theta_1,\cdots,\Theta_j-\epsilon,\cdots,\Theta_n)}{2\epsilon} ΘjJ(Θ)2ϵJ(Θ1,,Θj+ϵ,,Θn)J(Θ1,,Θjϵ,,Θn)

  • * ϵ = 1 0 − 4 \epsilon = 10^{-4} ϵ=104
theta = [Theta1(:); Theta2(:)];
numgrad = zeros(size(theta));
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
    % Set perturbation vector
    perturb(p) = e;
    loss1 = J(theta - perturb);
    loss2 = J(theta + perturb);
    % Compute Numerical Gradient
    numgrad(p) = (loss2 - loss1) / (2*e);
    perturb(p) = 0;
end

*After checking with correct gradient, we could turn off gradient checking before running algorithm.

(6) Minimizing the cost function J ( Θ ) J(\Theta) J(Θ)
options = optimset('MaxIter', 200);
lambda = 0.2;

% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, X, y, lambda);

[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值