机器学习coursera 第三章编程作业
Multi-class Classification and Neural Networks
lrCostFunction
整个题目给了两个数据集,一个是关于X,y的,一个是关于theta的,其中X的每一行是一个训练数据,也就是一个手写体数字的位图,每个图片是20*20的,因此就有400列,每一列代表图像中一个点处的灰度值。
第一步是将损失函数的向量化计算方式写出来:
记X的维度是(m, n),由注释中的提示可以推测theta的维度是(n, k),其中k是class数目,这里是指手写体数字的种类数即0~9共10种。
则X*theta的维度就是(m, k),需要注意的是,我们处理的是一个分类问题,所以我们需要使用sigmoid函数将其缩小范围至0~1,对于第theta的第i列,它的值代表了其被归类于该class的概率,即越靠近1则其越有可能是第i个class,在这个题里就是数字i(第10个数字是0)。
这就是老师之前课中的思想,将多分类问题化为多个二分类问题。即对于每个数字,我们先设定第i类为单独一类,所有其他类被归于另一类,循环10次之后,我们就得到该数字对于每一类的预测值p(0<=p<=1)。
对于分类问题,我么需要使用逻辑回归(logistic regression)中的cost函数,因为该函数具有一个良好的性质,就是当y为0时,x的值越靠近0,J越接近于0;反过来当y为1时,x的值越靠近1,J越接近于0。否则J将向无穷大方向趋近。
写出J的表达式后,我们需要对其进行规范化(regularize),即加上一个 λ 2 m ∑ j = 1 n θ j 2 \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 2mλ∑j=1nθj2,注意到j从1开始,我们无需对 θ 0 \theta_0 θ0计算损失,求导后即计算梯度时,也同样不会有 θ 0 \theta_0 θ0,不失一般性,我们可以令 θ 0 = 0 \theta_0=0 θ0=0,则无需进行分类讨论。所以这里需要一个temp向量temp = [0; theta(2:end)];。之后我们使用这个temp向量替代theta进行后续计算就好了。
注意到temp是一个列向量,我们求temp.^2只需计算temp’ * temp即可。
求grad的时候特别注意各个变量的维度,X(m, n), h(m, 1), y(m, 1)
function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with
%regularization
% J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
% You should set J to the cost.
% Compute the partial derivatives and set grad to the partial
% derivatives of the cost w.r.t. each parameter in theta
%
% Hint: The computation of the cost function and gradients can be
% efficiently vectorized. For example, consider the computation
%
% sigmoid(X * theta)
%
% Each row of the resulting matrix will contain the value of the
% prediction for that example. You can make use of this to vectorize
% the cost function and gradient computations.
%
% Hint: When computing the gradient of the regularized cost function,
% there're many possible vectorized solutions, but one solution
% looks like:
% grad = (unregularized gradient for logistic regression)
% temp = theta;
% temp(1) = 0; % because we don't add anything for j = 0
% grad = grad + YOUR_CODE_HERE (using the temp variable)
%
h = sigmoid(X * theta);
% unregularized cost for logistic regression
J = (1.0/m) * sum(-y.*log(h) - (1-y).*log(1-h));
% regularized cost
temp = [0; theta(2:end)];
J = J + (lambda/(2.0*m)) * temp' * temp;
% unregularized gradient for logistic regression
grad = (1.0/m) * X' * (h - y);
% regularized gradient
grad = grad + (1.0/m) * lambda * temp;
% =============================================================
grad = grad(:);
end
oneVsAll
注意,对于二分类问题,y必须是0或者1,代表属于哪个类别。这样我们循环遍历每一个类c,对于属于c的训练数据,我们记y为1,不属于的记为0,并且把训练出来的theta放到all_theta的第c行,代表这一行的theta乘上X后可以告诉我们这个数据是否属于第c类。这样,all_theta乘上X后,第c行就代表这个数据是否属于第c类(值代表概率)。
我们还注意到fmincg中的函数具有参数t(他是一个匿名函数),这个t就是我们写的lrCostFunction函数的theta项,至于为什么要是theta,是因为我们使用的是’GradObj’模式,即梯度下降,这个模式要求每次训练要更新theta的值,并且下次计算theta值时需要使用上一次的theta值,故我们需要把这个theta作为函数参数供fmincg调用。
function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta
%corresponds to the classifier for label i
% [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels
% logistic regression classifiers and returns each of these classifiers
% in a matrix all_theta, where the i-th row of all_theta corresponds
% to the classifier for label i
% Some useful variables
m = size(X, 1);
n = size(X, 2);
% You need to return the following variables correctly
all_theta = zeros(num_labels, n + 1);
% Add ones to the X data matrix
X = [ones(m, 1) X];
% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the following code to train num_labels
% logistic regression classifiers with regularization
% parameter lambda.
%
% Hint: theta(:) will return a column vector.
%
% Hint: You can use y == c to obtain a vector of 1's and 0's that tell you
% whether the ground truth is true/false for this class.
%
% Note: For this assignment, we recommend using fmincg to optimize the cost
% function. It is okay to use a for-loop (for c = 1:num_labels) to
% loop over the different classes.
%
% fmincg works similarly to fminunc, but is more efficient when we
% are dealing with large number of parameters.
%
% Example Code for fmincg:
%
% % Set Initial theta
% initial_theta = zeros(n + 1, 1);
%
% % Set options for fminunc
% options = optimset('GradObj', 'on', 'MaxIter', 50);
%
% % Run fmincg to obtain the optimal theta
% % This function will return theta and the cost
% [theta] = ...
% fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ...
% initial_theta, options);
%
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'On', 'MaxIter', 50);
for c = 1: num_labels
[theta] = ...
fmincg(@(t)(lrCostFunction(t, X, (y == c), lambda)), ...
initial_theta, options);
all_theta(c, :) = theta';
end
% =========================================================================
end
predictOneVsAll
由上面的分析可知,X的维度是(m, n), all_theta的维度是(class, n),即all_theta的每一行代表一个类对应的theta,如果用这个theta去乘X,就会得到该数据集属于这个类的概率值。而如果用all_theta去乘X,就会得到该数据集属于每个类的概率值。
X * all_theta’的结果的维度是(m, class),第c列代表该行数据属于类c的概率,我们要找到最大的概率,并取其下标作为预测值。
function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels
%are in the range 1..K, where K = size(all_theta, 1).
% p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
% for each example in the matrix X. Note that X contains the examples in
% rows. all_theta is a matrix where the i-th row is a trained logistic
% regression theta vector for the i-th class. You should set p to a vector
% of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
% for 4 examples)
m = size(X, 1);
num_labels = size(all_theta, 1);
% You need to return the following variables correctly
p = zeros(size(X, 1), 1);
% Add ones to the X data matrix
X = [ones(m, 1) X];
% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
% your learned logistic regression parameters (one-vs-all).
% You should set p to a vector of predictions (from 1 to
% num_labels).
%
% Hint: This code can be done all vectorized using the max function.
% In particular, the max function can also return the index of the
% max element, for more information see 'help max'. If your examples
% are in rows, then, you can use max(A, [], 2) to obtain the max
% for each row.
%
% X(m, n), all_theta(class, n), where n is the pix num
% The result is (m, class), where in the ith row, every jth col is
% the ith example's probability of being in the jth class
[~, p] = max(sigmoid(X * all_theta'), [], 2);
% =========================================================================
end
predict
最后让我们完成一个简单的神经网络,这里输入x共有400个,每个x代表一个像素点,只有一个隐含层,层中节点是25个,最后输出节点为10个,每个节点是一个向量,代表数据集中每一项属于该类的概率。
之前老师已经讲过,每一层theta的维度为(
R
j
+
1
,
R
j
+
1
R_{j+1}, R_{j}+1
Rj+1,Rj+1)据此我们利用公式
z
=
θ
a
a
=
g
(
z
)
z=\theta a \newline a=g(z)
z=θaa=g(z)
即可列出式子。
function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
% p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
% trained weights of a neural network (Theta1, Theta2)
% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);
% You need to return the following variables correctly
p = zeros(size(X, 1), 1);
% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
% your learned neural network. You should set p to a
% vector containing labels between 1 to num_labels.
%
% Hint: The max function might come in useful. In particular, the max
% function can also return the index of the max element, for more
% information see 'help max'. If your examples are in rows, then, you
% can use max(A, [], 2) to obtain the max for each row.
%
A = [ones(size(X, 1), 1) X];
% Theta1(len(j+1), len(j)+1) A(m, len(j)+1)
z = A * Theta1';
A = sigmoid(z);
A = [ones(size(A, 1), 1) A];
z = A * Theta2';
A = sigmoid(z);
[~, p] = max(A, [], 2);
% =========================================================================
end