ML Notes: Week 5 - Neural Networks: Learning

最新推荐文章于 2021-06-13 18:05:07 发布

CCrazyGuy

最新推荐文章于 2021-06-13 18:05:07 发布

阅读量147

点赞数

分类专栏： ML学习笔记文章标签： machine learning

本文链接：https://blog.csdn.net/jty573894890/article/details/106927048

版权

ML学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

1. Cost function for neural networks

$\begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] \\&+ \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{aligned}$

$m$ = number of samples
$K$ = number of output units
$(h_\Theta (x^{(i)}))_k$ = hypothesis that results in the $k^{th}$ output for $i^{th}$ sample
$\lambda$ = regularization parameter
$L$ = total number of layers in the network
$s_l$ number of units (excluding the bias unit) in layer $l$ .

Cost function for logistic regression:
$J(\theta) = -\frac1m\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2$

2. Understanding the backpropagation

In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually¹. In other words, backpropagation could help us minimize our cost function fo neural networks.

For the given training set $\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$ , the backpropagation algorithm will be implemented with the following steps:

Obtaining the output value(activation) of the output layer. If the network have $L$ layers, we could calculated the $a^{(L)}$ by forward propagation algorithm.
Computing error term for the ouput layer $\delta^{(L)}_k = a^{(L)}_k - y_k$
Computing error term for hidden layers $\delta^{(L-1)}, \delta^{(L-2)}, \dots,\delta^{(2)}$
$\begin{aligned}\delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*g'(z^{(l)})\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*g(z^{(l)}).*(1-g(z^{(l)}))\\ &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned}$

if $\frac{1}{1+e^{-z}}$ , $g^{'} (z) =$ ?

$\begin{aligned}g'(z)&=-(\frac{1}{1+e^{-z}})^2\cdot e^{-z} \cdot(-1)\\ &=\frac{1+e^{-z}-1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1+e^{-z}}{(1+e^{-z})(1+e^{-z})}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}}-\frac{1}{(1+e^{-z})(1+e^{-z})}\\ &= \frac{1}{1+e^{-z}} \cdot \left(1-\frac{1}{1+e^{-z}} \right)\\ &= g(z) \cdot(1-g(z)) \end{aligned}$

Accumulating the gradient using $\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
* here the weights for bias unit should be removed.
* The formula also can be rewritten as $\Delta^{(l)} = \sum\limits_{i=1}^m \left(\delta^{(l+1)}\right)^{i}\left((a^{(l)})^T\right)^{i}$
Obtaining the gradient for the neural network
$\begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned}$

Now, a simple neural network² is given to illustrate the back propagation algorithm.

(1) Firstly, we generate a 3-layer neural network with teo inputs, two hidden neurons and two output neurons, and intialize the weights as follows.

For hidden layers,
$z^{(2)}_1 = 0.35*a^{(1)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2$
so the activiation of neuron $a^{(2)}_1$
$a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}}$
Carrying out the same process we could get $a^{(2)}_2$
Then, we repeat the process for the output layer neurons, using the output $a^{(2)}$ as inputs.
$z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2$
$a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}}$
Here, we difine the total error for output neuron summing the squared error:
$E_{total} = \sum\limits_{k=1}^K \frac{1}{2}(y_k - a^{(3)}_k)^{2}$
For the first output neuron, its error
$E_{a^{(3)}_1}= \frac{1}{2}(y_1-a^{(3)}_1)^2$
and the error of the second output neuron
$E_{a^{(3)}_2}= \frac{1}{2}(y_2-a^{(3)}_2)^2$

(2)The backwards pass
By applying the chain rule we know that:

$\frac{\partial E_{total}}{\partial w_{5}} = \underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}_{\delta^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial w_{5}}$
Now, we need to figure out these partial derivatives.

(1) $E_{total} =E_{a^{(3)}_1}+E_{a^{(3)}_2}=\frac{1}{2}(y_1-a^{(3)}_1)^2+\frac{1}{2}(y_2-a^{(3)}_2)^2$
$\frac{\partial E_{total}}{\partial a^{(3)}_1} = 2*\frac{1}{2}(y_1-a^{(3)}_1)*(-1)+0=-(y_1-a^{(3)}_1)$
(2) $a^{(3)}_1 =\frac{1}{1+e^{-z^{(3)}_1}}=g(z^{(3)}_1)$
$\frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} =g'(z^{(3)}_1) =g(z^{(3)}_1)*(1-g(z^{(3)}_1)) = a^{(3)}_1*(1-a^{(3)}_1)$
(3) $z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2$
$\frac{\partial z^{(3)}_1}{\partial w_{5}} = a^{(2)}_1$
$\frac{\partial E_{total}}{\partial w_{5}}=\underbrace{-(y_1-a^{(3)}_1)*a^{(3)}_1*(1-a^{(3)}_1)}_{\delta^{(3)}_1}* a^{(2)}_1$

The above formula also can be represented as $\Delta^{(3)}_1 = \delta^{(3)}_1* a^{(2)}_1$
So, we get the gradient with respect to $w_5$ and also can get the new weights of $w_6, w_7$ and $w_8$ by repeating the above process.

Next, we’ll continue the backwards pass by calculating new values for $w_1, w_2, w_3$ and $w_4$ .

$\begin{aligned}\frac{\partial E_{total}}{\partial w_{1}} = &\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1} * \frac{\partial z^{(3)}_1}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node1}+\\&\underbrace{\frac{\partial E_{total}}{\partial a^{(3)}_2} * \frac{\partial a^{(3)}_2}{\partial z^{(3)}_2} * \frac{\partial z^{(3)}_2}{\partial a^{(2)}_1}*\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1}*\frac{\partial z^{(2)}_1}{\partial w_{1}}}_{\Delta node2}\end{aligned}$

from the $\frac{\partial E_{total}}{\partial w_{5}}$ we get ${\frac{\partial E_{total}}{\partial a^{(3)}_1} * \frac{\partial a^{(3)}_1}{\partial z^{(3)}_1}}=\delta^{(3)}_1$

(1) $z^{(3)}_1 = 0.60*a^{(2)}_0+w_5*a^{(2)}_1+w_6*a^{(2)}_2$
$\frac{\partial z^{(3)}_1}{\partial a^{(2)}_{1}} = w^5$
(2) $a^{(2)}_1 =\frac{1}{1+e^{-z^{(2)}_1}}=g(z^{(2)}_1)$
$\frac{\partial a^{(2)}_1}{\partial z^{(2)}_1} =g'(z^{(2)}_1) =g(z^{(2)}_1)*(1-g(z^{(2)}_1)) = a^{(2)}_1*(1-a^{(2)}_1)$
(3) $z^{(2)}_1 = 0.35*a^{(2)}_0+w_1*a^{(1)}_1+w_2*a^{(1)}_2$
$\frac{\partial z^{(2)}_1}{\partial w_{1}} = a^{(1)}_1$

so, $\Delta node1 = \underbrace{\delta^{(3)}_1* w^5*g'(z^{(2)}_1)}_{\delta^{(2)}_{11}}*a^{(1)}_1$ , and $\Delta node2 = \underbrace{\delta^{(3)}_2* w^7*g'(z^{(2)}_1)}_{\delta^{(2)}_{12}}*a^{(1)}_1$
$\frac{\partial E_{total}}{\partial w_{1}}=\Delta node1 +\Delta node2$
We rewrite the gradient as $\Delta^{(2)}_1=\Delta node1 +\Delta node2=\delta^{(2)}_{11}*a^{(1)}_1+\delta^{(2)}_{12}*a^{(1)}_1=\delta^{(2)}_{1}*a^{(1)}_1$

References

[1] Backpropagation

[2] A Step by Step Backpropagation Example (EN) - Matt Mazur
[3] A Step by Step Backpropagation Example (CN)

3. Gradient checking

Gradient checking will help to confirm that the backpropagation works correctly. We can approximate the derivative of our cost function with:
$\frac{\partial}{\partial \Theta}J(\Theta) \approx\frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}$
Before performing the gradient checking, we unroll the parameters(theta) into a long vector $\theta$ . Generaly, we set $\epsilon=10^{-4}$ to guarantee the gradient.

4. Application of neural network to classification task

(1) Weights initialization

Initializing all theta weights to zero does not work with neural networks. Hence, we initialize all theta between $[-\epsilon,\epsilon]$ using the folowing method.

* $\epsilon_{initial} = 0.12$

W = zeros(L_out, 1 + L_in);
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;

(2) Feedforward the neural network

在这里插入图片描述

% performing the forward propagation
	X = [ones(m,1),X];                % 5000*401
	h_out = sigmoid(X * Theta1');     % 5000*25
	h_out = [ones(m,1),h_out];        % 5000*26
	hypo = sigmoid(h_out * Theta2');  % 5000*10

% generating the label matrix
	y_label = zeros(m, num_labels);  
	for i = 1:num_labels
	    loc = find(y == i);
	    y_label(loc,i) = ones(size(loc,1),1);
	end

(3) Cost function computation

% cost function (no regularization)
    J= (y_label.*log(hypo)) + ((ones(m,num_labels)-y_label).*log(1-hypo));
    J = sum(sum(J));
    J = (-1/m) * J;
    
% regularization term of the cost function
	theta_sum = 0;    
	Theta = []
	for i = 1:2
	    Theta = eval(['Theta',num2str(i)]);
	    theta_sum = theta_sum + sum(sum(Theta(:,2:end).^2));
	    clear Theta
	end
J = J + lambda / (2 * m) * theta_sum;

(4) Backpropagation (Gradient calculation)

$\begin{aligned}\delta^{(L)}_k &= a^{(L)}_k - y_k\\ \delta^{(l)} &=(\Theta^{(l)})^T\delta^{(l+1)} .*a^{(l)}.*(1-a^{(l)})\end{aligned}$

delta_3 = (hypo - y_label)';
delta_2 = Theta2(:,2:end)' * delta_3.*sigmoidGradient((X * Theta1')');

$\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T$
* Vectorization the above equation
$\Delta^{(l)} = \delta^{(l+1)}a^{(l)}$

delta_sum_1 = zeros(hidden_layer_size,input_layer_size+1);  % 25*401
delta_sum_2 = zeros(num_labels,hidden_layer_size+1);        % 10*26

delta_sum_1 = delta_2 * X;
delta_sum_2 = delta_3 * h_out;

$\begin{aligned} \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} &for j=0\\ \frac{\partial}{\partial\Theta^{(l)}_{ij}}J(\Theta) &= D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}+\frac{\lambda}{m}\Theta^{(l)}_{ij}&for j\ge1\end{aligned}$

Theta1_grad = (1/m) .* delta_sum_1;
Theta2_grad = (1/m) .* delta_sum_2;

regular_1 = Theta1 * (lambda/m);
regular_1(:,1) = 0;
Theta1_grad = Theta1_grad + regular_1;

regular_2 = Theta2 * (lambda/m);
regular_2(:,1) = 0;;
Theta2_grad = Theta2_grad + regular_2;

grad = [Theta1_grad(:); Theta2_grad(:)];

(5) Gradients checking (optional)

$\frac{\partial}{\partial \Theta_j}J(\Theta) \approx\frac{J(\Theta_1,\cdots,\Theta_j+\epsilon,\cdots,\Theta_n)-J(\Theta_1,\cdots,\Theta_j-\epsilon,\cdots,\Theta_n)}{2\epsilon}$

* $\epsilon = 10^{-4}$

theta = [Theta1(:); Theta2(:)];
numgrad = zeros(size(theta));
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
    % Set perturbation vector
    perturb(p) = e;
    loss1 = J(theta - perturb);
    loss2 = J(theta + perturb);
    % Compute Numerical Gradient
    numgrad(p) = (loss2 - loss1) / (2*e);
    perturb(p) = 0;
end

*After checking with correct gradient, we could turn off gradient checking before running algorithm.

(6) Minimizing the cost function $J(\Theta)$

options = optimset('MaxIter', 200);
lambda = 0.2;

% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, X, y, lambda);

[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

CCrazyGuy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ML Notes: Week 5 - Neural Networks: Learning

1. Cost function for neural networksJ(Θ)=−1m∑i=1m∑k=1K[yk(i)log⁡((hΘ(x(i)))k)+(1−yk(i))log⁡(1−(hΘ(x(i)))k)]+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θj,i(l))2\begin{aligned}J(\Theta) = &- \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)})
复制链接

扫一扫