【Coursera-Machine Learning】自用4

STARLITSKY23822

已于 2022-06-23 16:00:44 修改

阅读量207

点赞数 2

文章标签：机器学习

于 2022-05-03 16:55:54 首次发布

本文链接：https://blog.csdn.net/STARLITSKY23822/article/details/124555505

版权

目录

前言

一、进度

二、基本内容

1.Cost Function for Nerual Network

2.Backpropagation Algorithm

3.Gradient Checking

4.作业

总结

前言

nothing

学而不思则罔，我™算是懂这句话的意思了

一、进度

第五周（64%）

二、基本内容

1.Cost Function for Nerual Network

之前我们给出的Cost Function长这样：

现在的问题是我们考虑多重输出，那么y就不是一个数了，而是一个向量，例如[1,0,0,0]表示是第一种输出，[0,0,1,0]表示第三种输出，etc.

So, 先有定义：

（这么重要的式子因为太长而只能变那么小...）

这个式子前部分是基本内容，后半部分是regularization的内容，因此分别解释。

对于前半部分而言：

1.首先对于中括号中的内容，单从理解意义上来说可以只看一半，因为两个加数的形式都差不多，不过是分段函数的两部分；

2.累加的部分是对于第i个样本，累加所有的k，最后把所有的i全部加上。所以讨论k的时候，i可以暂时忽略一下。那么对于某一个i来说，累加k就变成了如下形式：

首先，yk是一个真实结果向量，hθx是预估结果向量，维数相同，两个相乘就是类似

那么这个结果其实就是一个数。

那么对于任何第i个样本，最后都会加成一个数，那么把所有的i个样本的值加在一起再取平均，那么前半部分就可以理解了；

3.后半部分：

依旧从内到外分析，j表示后面一层的神经元个数，所以是s(l+1)，i表示前一层的神经元个数，所以是s(l)。那么对于在特定l位置的累加，就表示把这个未知的θ矩阵元素全部加起来。那么加上最前面的总求和，就等于说把l-1个θ矩阵的所有元素全部加起来。这也和我们之前的regularization的思路一致：我们不知道那个是导致过拟合的元凶，那就只能全部regularize。至于前面的含λ系数，就只是一个系数: (

2.Backpropagation Algorithm

Let us get down to fundamental，我们给出Cost Function的目的是为了对此操作，使得操作过后，把Cost Function减少。在这里，BP用到的是求偏导的方法：

对每一个θ都求偏导，但是照样不用我们写过程，Andrew已经给写好了：

分析如下：

1.首先我们要知道BP的目的是帮助求偏导。所以在已经给定θ的情况下，我们引入一个变量△，把l-1个△矩阵的值都设为0。注意这里的△和θ不是一回事！；

2.对于所有的i个样本值依次进行这个操作，所以△值只有在一开始是0；

3.从左到右依照我们之前的做法（Forward Propagation）把所有的Hidden Layer的值全部写上，包括最后得Output Layer的预估值；

4.接下来出现了一个新的字母：δ。首先计算最后一层的δ(L) = a(L) - y(i)，等于说这一层的δ给出的是最直接的误差向量。Andrew对于δ的定义是：“error” of node j in layer l。值得注意的是，Andrew还明确了δ实际上是Cost Function对z的偏导数：

5.然后是依次计算之前的所有delta，注意算到第二层结束。过程如下：

这里是有迭代的意思，先有最后一层的δ，然后才有前一层的δ。以该图为例，δ(4)是一个很明显的4*1的向量。接下来考虑δ3。原来我们说θ是一个j*i的矩阵，j表示下一层的神经元个数，i表示上一层的神经元个数+1，而在这里似乎有点不太一样了。此处的θ应该还是j*i的矩阵，只是i不用包括上一层的bias神经元。这样的话，δ(3)就是一个（4*5）T*（4*1）的向量，结果是5*1。至于后面的.*先不考虑，至少不会影响δ的维数，所以不会出错。然后考虑δ(2)，就是一个（5*5）T*（5*1）的向量，结果也是5*1，刚好对应上了“error” of node j in layer l。

关于g'(z)，原来的g(z) = 1/1+e^(-hx)，然后对他求导数（不会），最后可以变成这个形式：

不知道是怎么导出来的，但这毕竟也是一个和前面的同维向量，维数就是该层的神经元个数，因此可以.*。

6.接下来考虑那个可爱的△。首先明确△的意义：对每个θ的偏导是，需要用到△，后期三角形会变成D。相比PPT里的这样写，我更喜欢阅读材料里vectorization的写法：

这里应该是l层△里的每个元素都加上后面那一坨。同时需要注意的是，δ和a不是同一层。这里写作业的时候发现了这个细节，感觉还有点奇怪。

这样回到最前面之后，我们就得到了第一次BP完成的所有△值。

7.当把所有i层的东西全部按照这样写一遍之后，我们就可以得到最终的△值了。类似一组横向防放置的矩阵。这时候我们引入D：

还是分类讨论。对于Input Layer后的△矩阵，我们只用取m的平均值作为D矩阵，对于其他的所有△矩阵，都要加上一个用于regularization的θ参数。最后得到的D就是一个三维的东西，表示：

8.我们求出偏导数的目的在之前就讲过了（然而我还是记不太清: D）:

等于说就是把三维内所有的θ值进行更新，而我们刚刚做的一大堆就是就是求出偏导数。

9.对于特定的某个δ，我们可以这样看：

BP对于单个的δ可以看成：

其实感觉有点哲学的味道在里面：用有瑕疵的θ算出的值得到的误差，最后返回来用误差去乘θ，从而改善θ。当然具体原理我暂时理解不了，先继续下去看看吧。

10.具体维数

这个维数在编程的时候很容易出错，我觉得有必要写明一下。

如下图这个难看的神经网络：

所有的θ的维数，之前已经明确过了，为了加上bias unit，只能是后层神经元*(前层神经元+1)。至于δ，从定义出发，表述的是“error”，那么维数就是当前层的神经元数。这里有个关系，就是△的计算根据该式计算：

那么当前神经元数(a)*下层神经元数(δ，且a(l+1)=δ)就是△矩阵，而这个矩阵的维数刚好是θ除去bias unit的维数。那么接下来用D表示每个θ的偏导，这个维数就完美对应了！

11.初始化θ

这一点其实我觉得也挺有重要的。关于θ的初始化，一般是随机创建，但是取值一定要小：

Lin表示左边的神经元个数，Lout表示右边的神经元个数。

3.Gradient Checking

Gradient Checking是用来保证第一遍求出的所有偏导数是正确的偏导数。具体原理就是暴力求解倒数的近似值，将其与我们的求偏导方法进行检验。过程很简单，高数基本知识：

取一点的一个邻域，进行导数求解。这里Andrew推荐用该点两边的邻域，而不是我们一般理解的一边邻域。

关于epsilon的取法，和上面一样，可以用该式：

关于Gradient Checking使用的步骤，如下：

只需要使用一次，检验我们的BP计算的偏导是否和暴力偏导值相似即可。因为计算速度很慢，所以只要测一次，如果BP可行，那么久不再使用Gradient Checking，直接进行以下步骤：

4.作业

花了很长时间，但是也没什么好说的，跟着pdf一步步下来，不会就查，总能做出来的。

就是计算J的时候，我的效率很低，而网上的J计算代码效率很高。也许是因为我的时间复杂度大概是O(5000)?

function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices.
%
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
% for our 2 layer neural network
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);

% You need to return the following variables correctly
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the code by working through the
%               following parts.
%
% Part 1: Feedforward the neural network and return the cost in the
%         variable J. After implementing Part 1, you can verify that your
%         cost function computation is correct by verifying the cost
%         computed in ex4.m
%
% Part 2: Implement the backpropagation algorithm to compute the gradients
%         Theta1_grad and Theta2_grad. You should return the partial derivatives of
%         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
%         Theta2_grad, respectively. After implementing Part 2, you can check
%         that your implementation is correct by running checkNNGradients
%
%         Note: The vector y passed into the function is a vector of labels
%               containing values from 1..K. You need to map this vector into a
%               binary vector of 1's and 0's to be used with the neural network
%               cost function.
%
%         Hint: We recommend implementing backpropagation using a for-loop
%               over the training examples if you are implementing it for the
%               first time.
%
% Part 3: Implement regularization with the cost function and gradients.
%
%         Hint: You can implement this around the code for
%               backpropagation. That is, you can compute the gradients for
%               the regularization separately and then add them to Theta1_grad
%               and Theta2_grad from Part 2.
%

y_matrix = eye(num_labels)(y,:);

a1 = [ones(m,1) X];
z2 = a1*Theta1';
a2 = sigmoid(z2);
a2 = [ones(size(a2,1),1) a2];
z3 = a2*Theta2';
a3 = sigmoid(z3);
h = a3;

%for i = 1:size(h,1)
%  J = J + -y_matrix(i,:)*log(h)(i,:)'-(1-y_matrix(i,:))*log(1-h)(i,:)';
%end
%J=J/m;

J = (-1/m)*sum(sum((y_matrix.*log(h)) + ((1-y_matrix).*log(1-h))));
J = J+(lambda/(2*m))*(sum(sum(Theta1(:,2:size(Theta1,2)).^2))+sum(sum(Theta2(:,2:size(Theta2,2)).^2)));

Delta1 = zeros(size(Theta1));
Delta2 = zeros(size(Theta2));

for t = 1:m
  delta3 = h(t,:)-y_matrix(t,:);
  delta2 = delta3*Theta2(:,2:end).*sigmoidGradient((a1*Theta1')(t,:));
  Delta1 = Delta1 + delta2'*a1(t,:);
  Delta2 = Delta2 + delta3'*a2(t,:);
end

Theta1_grad = Delta1/m;
Theta2_grad = Delta2/m;
Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + lambda/m*Theta1(:,2:end);
Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + lambda/m*Theta2(:,2:end);
% -------------------------------------------------------------

% =========================================================================

% Unroll gradients
grad = [Theta1_grad(:) ; Theta2_grad(:)];


end

function W = randInitializeWeights(L_in, L_out)
%RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
%incoming connections and L_out outgoing connections
%   W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights
%   of a layer with L_in incoming connections and L_out outgoing
%   connections.
%
%   Note that W should be set to a matrix of size(L_out, 1 + L_in) as
%   the first column of W handles the "bias" terms
%

% You need to return the following variables correctly
W = zeros(L_out, 1 + L_in);

% ====================== YOUR CODE HERE ======================
% Instructions: Initialize W randomly so that we break the symmetry while
%               training the neural network.
%
% Note: The first column of W corresponds to the parameters for the bias unit
%

epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init-epsilon_init;
% =========================================================================

end

function g = sigmoidGradient(z)
%SIGMOIDGRADIENT returns the gradient of the sigmoid function
%evaluated at z
%   g = SIGMOIDGRADIENT(z) computes the gradient of the sigmoid function
%   evaluated at z. This should work regardless if z is a matrix or a
%   vector. In particular, if z is a vector or matrix, you should return
%   the gradient for each element.

g = zeros(size(z));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the gradient of the sigmoid function evaluated at
%               each value of z (z can be a matrix, vector or scalar).

g=sigmoid(z).*(1-sigmoid(z));
% =============================================================


end

总结

这次的学习其实感触很深，基本是明白了，但是很多细节，没有经过编程这一步是发现不了的，甚至包括很多自己理解有误的地方。还有编程的时候，一些细节还是要注意，前面一个变量定义错了，会导致后面死活查不出问题。比如z=theta*a，后续在sigmoid(z)，而我为了省事直接写了z = sigmoid(theta*a)，并在后面直接调用了本该没有sigmoid的z。结果就很难受。

但是也要想点积极的东西：

1.很久没有这种遇到困难，内心却积极的鼓舞自己去解决它的感受了。而之前遇到编程bug查不出来的时候简直想死。也许学习自己喜欢的东西很重要，同时会教学的老师给一个人的积极暗示也同样重要；

2.课程的最后Andrew放了一个自动驾驶学习的视频，从来没有这样看到一个神奇的东西，而背后的原理已经被我略窥一二的兴奋、满足、自豪的感觉了。不管如何，这次学习Machine Learning这门课，学习到的绝不仅仅是Machine Learning本身。