机器学习学习笔记(七)—— 使用反向传播(Backpropagation)算法训练神经网络(Neural Network)

代价函数(cost function)定义

符号定义:

L: 网络的最大层数.

s_{l}: 在第l层的神经元个数(不包括偏置单元).

K: 输出层神经元个数或者分类数.

神经网络中有K个输出,通过前面的逻辑回归问题的多分类问题分析可以知道,K个输出对应的K个假设函数。那么我们用 h_{\Theta}(x)_{k}来表示 k^{th} 输出假设函数。

一个正则化之后的逻辑回归代价函数如下:

那么应用于神经网络的代价函数如下:

反向传播算法(backpropagation algorithm)导数部分

反向传播是神经网络的术语,Our goal is to compute:

我们先解决导数部分:

 

给出训练集: \left \{ (x^{(1)}, y^{(1)})...(x^{(m)}, y^{(m)}) \right \}

Set \Delta ^{(l)}_{i,j}:=0 for all (l,i,j), (hence you end up having a matrix full of zeros)

For training example t =1 to m:

1.Set a^{(1)}:=x^{(t)}

2.Perform forward propagation to compute a^{(l)} for l=2,3,…,L

3.Using y^{(t)}, compute \delta^{(L)}=a^{(L)}-y^{(t)}

注:\delta is called "error values(误差)" 

4.Compute \delta ^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} using the equation:

注:\delta(1) 不需要计算,所以z(1)也不需要计算.

The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by z(l).

The g-prime derivative terms can also be written out as:

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

Intuitively, δj(l)​ is the "error" for aj(l)​ (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

\delta 递推公式推导过程:

\tiny \delta^{(l-1)}\equiv \frac{\partial }{\partial z^{(l-1)}}J =\frac{\partial J}{\partial z^{(l)}}\cdot \frac{\partial z^{(l)}}{\partial z^{(l-1)}} =\delta^{(l)}\frac{\partial}{\partial z^{(l-1)}}z^{(l)}

\tiny =\delta^{(l)}\frac{\partial}{\partial z^{(l-1)}}(\Theta^{(l-1)}g(z^{(l-1)}))

\tiny =\delta^{(l)}\Theta^{(l-1)}g'(z^{(l-1)})

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. 

 

 5.Hence we update our new Δ matrix.

or with vectorization:

注:\Delta^{(l)} 分别对应 \Theta^{(l)}, 分别进行累加操作.

Thus we get:

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative.  

梯度检查(gradient checking):

进行梯度检查可以让我们确保反向传播算法运行是否达到我们预期效果,于是我们使用如下近似的方法:

 

A small value for ϵ (epsilon) such as ϵ=10^{-4}, 为了保证数学计算正确,如果太小,可能导致计算错误。

Hence, we are only adding or subtracting epsilon to the \Theta _{j} matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.

 

随机化初始值(Random Initialization):

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:

Hence, we initialize each \Theta ^{(l)}_{ij} to a random value between[−ϵ,ϵ]. Using the above formula guarantees that we get the desired bound. The same procedure applies to all the Θ's. Below is some working code you could use to experiment:

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

 

How to choose the epsilon for random initialization?

There are many acceptable methods. Another is discussed in this link:

http://stats.stackexchange.com/questions/47590/what-are-good-initial-weights-in-a-neural-network

The goal is to initialize the Theta values so they are in the range where the sigmoid() function gives an active response, once the weights are applied and the summations occur via (X * Theta). sigmoid() has a pretty useful slope between about -3 and +3, so that's what you want the initial weighted sums to end up. Outside of of that range, the slope of sigmoid() is very flat, and learning will occur very slowly.

 

反向传播算法执行步骤总结

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

首先,选择一个合理的网络模型。包括神经网络的层数、隐藏层的神经元个数。一般的,除了输入层和输出层,多个隐藏层的神经元个数是相同的,以便于计算。(如何选择,将在后续讨论)

  • Number of input units = dimension of features x^{(i)}
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units). 越多越好,但是相应的计算代价也会增加。
  • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network:

  1. Randomly initialize the weights. 随机的初始化\Theta.
  2. Implement forward propagation to get h_{\Theta}(x^{(i)}) for any x^{(i)}. 正向传播计算所有的假设函数.
  3. Implement the cost function.
  4. Implement backpropagation to compute partial derivatives. 反向传播计算所有的导数部分\frac{\partial }{\partial \Theta ^{(l)}_{ij}}J(\Theta).
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking. 使用梯度检查验证反向传播算法正常工作,然后关闭梯度检查。
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

Ideally, you want h_{\Theta}(x^{(i)})\approx y^{(i)}. This will minimize our cost function. However, keep in mind that J(Θ) is not convex and thus we can end up in a local minimum instead.

 

代码实例(一个只有三层的神经网络的代价函数和梯度的计算): 

%Part1

% X = [ones(m, 1) X];
% A2 = sigmoid(X * Theta1');
% A2 = [ones(size(A2, 1), 1) A2];
% H = sigmoid(A2 * Theta2');
% 
% for i = 1 : m,
%     h = H(i, :);
%     r = zeros(num_labels, 1);
%     r(y(i)) = 1;
%     J += -log(h) * r - log(1 - h) * (1 - r);
% end;
% 
% J /= m;

X = [ones(m, 1) X];
A2 = sigmoid(X * Theta1');
A2 = [ones(size(A2, 1), 1) A2];
H = sigmoid(A2 * Theta2');

Y = zeros(m, num_labels);
for i = 1 : m,
    Y(i, y(i)) = 1;
end;
J = sum(sum(-Y .* log(H) - (1 - Y) .* log(1 - H))) / m;

%Part2

theta1 = Theta1(:, 2:size(Theta1, 2))(:);
theta2 = Theta2(:, 2:size(Theta2, 2))(:);
J += lambda / 2 / m * (sum(theta1 .^ 2) + sum(theta2 .^ 2));

%Part3
Delta1 = zeros(size(Theta1));
Delta2 = zeros(size(Theta2));
for t = 1 : m,
    a1 = X(t, :)';
    z2 = Theta1 * a1;
    a2 = [1; sigmoid(z2)];
    z3 = Theta2 * a2;
    a3 = sigmoid(z3);
    delta3 = a3 - Y(t, :)';
    delta2 = Theta2(:, 2:size(Theta2, 2))' * delta3 .* sigmoidGradient(z2);
    Delta1 = Delta1 + delta2 * a1';
    Delta2 = Delta2 + delta3 * a2';
end;

% Theta1_grad = Delta1 ./ m;
% Theta2_grad = Delta2 ./ m;

% Regularized Theta1_grad and Tehta2_grad
Theta1_grad = Delta1 ./ m;
Theta2_grad = Delta2 ./ m;
Theta1_grad += lambda / m .* ...
    [zeros(size(Theta1, 1), 1) Theta1(:, 2:size(Theta1, 2))];
Theta2_grad += lambda / m .* ...
    [zeros(size(Theta2, 1), 1) Theta2(:, 2:size(Theta2, 2))];

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值