利用梯度下降法实现简单的线性回归

最新推荐文章于 2024-08-01 21:06:19 发布

carlwu

最新推荐文章于 2024-08-01 21:06:19 发布

阅读量8.6k

点赞数 5

分类专栏：数据挖掘机器学习 coursera机器学习笔记文章标签：机器学习数据挖掘 coursera机器学习笔记

本文链接：https://blog.csdn.net/carlwu/article/details/74942634

版权

机器学习同时被 3 个专栏收录

8 篇文章 4 订阅

订阅专栏

数据挖掘

5 篇文章 0 订阅

订阅专栏

coursera机器学习笔记

1 篇文章 0 订阅

订阅专栏

最近做了好多个数据挖掘的小项目，使用并比较了N多算法，了解了很多机器学习的工具，如R语言、Spark机器学习库、Python、Tensorflow和RapidMiner等等。但是我感觉到自己没能深入下去，充其量也只是把别人的工具拿来玩玩而已。对算法本身的优劣及适用范围不甚了了，更谈不上改进优化算法了。

本着甘当小学生的精神，我最近在网上参加了机器学习牛人Andrew Ng在Coursera上主讲的《机器学习》课程（https://www.coursera.org/learn/machine-learning/home/welcome）。这门课程是我见到的最好的入门课程，由浅人深，循序渐进，不需要很深的数学基础，符合自然的学习规律。刚刚学习了三周，感觉受益颇多。
下面以最简单的一元线性回归为例，简单介绍一下机器学习的思想及其在Matlab中的实现。

1. 基本概念
假设函数
假设模型 (也称为假设函数，Hypothesis function)，就是根据特征变量（feature或variable）拟合出目标变量(target variable)的公式或函数。例如在下面的表格中，根据假设函数，即公式 h θ (x) = θ0+ θ 1*x ，我们可以从房屋的面积估算出房屋的总价，这个公式即称之为假设函数，其中的θ0和θ1称为参数（parameter）。

选择不同的参数，一元线性回归的假设模型也会不同。下图展示了三个比较简单的一元线性回归模型。
这里写图片描述

代价函数
代价函数或损失函数（Cost function或Loss function）是用来评价假设模型拟合的精确度。在训练数据集上，模型的拟合越好，代价函数就越小。在机器学习中，训练的目的就是要选择合适的参数（如上图中的θ0和θ1），让代价函数达到最小。
例如，在线性回归中，代价函数J(θ0, θ1)的一般定义如下图所示，可理解为样本真实输出值和假设函数估算值之差平方和的平均值。
梯度下降
梯度下降（Gradient descent）法是使得代价函数达到最小的经典方法之一。以前我老是用确定性的最小二乘法来求假设模型h θ (x) = θ0+ θ 1*x中的参数θ0和θ1，让训练集上的代价函数值最小。其实，代价函数J(θ0, θ1)可以看成两个变量θ0和θ1组成的函数，将这个函数可视化的结果如下图所示。我们学习的目的就是在图中找到合适的θ0和θ1，让J(θ0, θ1)接近并达到局部或全局最低点（下图中的红色圈中的点）。

对上图中梯度下降的一个直观的解释是：“比如我们在一座大山上的某处位置，由于我们不知道怎么下山，于是决定走一步算一步，也就是在每走到一个位置的时候，求解当前位置的梯度，沿着梯度的负方向，也就是当前最陡峭的位置向下走一步，然后继续求解当前位置梯度，向这一步所在位置沿着最陡峭最易下山的位置走一步。这样一步步的走下去，一直走到觉得我们已经到了山脚。当然这样走下去，有可能我们不能走到山脚，而是到了某一个局部的山峰低处。”(http://www.cnblogs.com/pinard/p/5970503.html). 梯度下降法不但适合线性回归，还可以用在逻辑回归以及其它算法上。

从几何上来说，在代价函数J(θ0, θ1)的曲面上，任意给定一个起始点（θ0, θ1），我们可沿着梯度下降的方向，即对代价函数J(θ0, θ1) 中的θ0和θ1分别求偏导数的方向，按照一定的步长（a, learning rate，学习效率）进行不断迭代，直到θ0, θ1收敛，也就是达到上图中的局部或全局最低点。对一个简单的线性回归，其算法描述如下：
这里写图片描述

2. 简单实现
在Matlab或Octave中，我们可以通过损失函数和梯度下降的原理来简单实现一元线性回归。

假设原始数据如下, x表示城市的人口，y为城市的利润。我们可以通过线性回归，用某个城市的人口来预测该城市的利润。详细代码如下：

%% 在Matlab或者在Octave中实现梯度下降法，进行一元线性回归
%% 初始化
clear ; close all; clc;
%%加载数据
% x表示城市的人口数量
x = [6.1101; 5.5277; 8.5186; 7.0032; 5.8598; 8.3829; 7.4764; 8.5781; 6.4862; 5.0546; 5.7107; 14.1640; 5.7340; 8.4084; 5.6407; 5.3794; 6.3654; 5.1301; 6.4296; 7.0708; 6.1891; 20.2700; 5.4901; 6.3261; 5.5649; 18.9450; 12.8280; 10.9570; 13.1760; 22.2030; 5.2524; 6.5894; 9.2482; 5.8918; 8.2111; 7.9334; 8.0959; 5.6063; 12.8360; 6.3534; 5.4069; 6.8825; 11.7080; 5.7737; 7.8247; 7.0931; 5.0702; 5.8014; 11.7000; 5.5416; 7.5402; 5.3077; 7.4239; 7.6031; 6.3328; 6.3589; 6.2742; 5.6397; 9.3102; 9.4536; 8.8254; 5.1793; 21.2790; 14.9080; 18.9590; 7.2182; 8.2951; 10.2360; 5.4994; 20.3410; 10.1360; 7.3345; 6.0062; 7.2259; 5.0269; 6.5479; 7.5386; 5.0365; 10.2740; 5.1077; 5.7292; 5.1884; 6.3557; 9.7687; 6.5159; 8.5172; 9.1802; 6.0020; 5.5204; 5.0594; 5.7077; 7.6366; 5.8707; 5.3054; 8.2934; 13.3940; 5.4369];
% y代表城市的利润
y = [ 17.5920;9.1302;13.6620;11.8540;6.8233;11.8860;4.3483;12.0000;6.5987;3.8166;3.2522;15.5050;3.1551;7.2258;0.7162;3.5129;5.3048;0.5608;3.6518;5.3893;3.1386;21.7670;4.2630;5.1875;3.0825;22.6380;13.5010;7.0467;14.6920;24.1470;-.2200;5.9966;12.1340;1.8495;6.5426;4.5623;4.1164;3.3928;10.1170;5.4974;0.5566;3.9115;5.3854;2.4406;6.7318;1.0463;5.1337;1.8440;8.0043;1.0179;6.7504;1.8396;4.2885;4.9981;1.4233;-1.4211;2.4756;4.6042;3.9624;5.4141;5.1694;-0.7428;17.9290;12.0540;17.0540;4.8852;5.7442;7.7754;1.0173;20.9920;6.6799;4.0259;1.2784;3.3411;-.6807;0.2968;3.8845;5.7014;6.7526;2.0576;0.4795;0.2042;0.6786;7.5435;5.3436;4.2415;6.7981;0.9270;0.1520;2.8214;1.8451;4.2959;7.2029;1.9869;0.1445;9.0551;0.6170];
% m是训练样本的个数
m = length(y);
%% 对x, y进行作图
% 数据点以一个十字交叉的符号表示，大小设置为10
plot(x, y, 'rx', 'MarkerSize', 10);
% 设置X轴
xlabel('人口');
% 设置Y轴
ylabel('利润');

然后，我们可在当前工作目录下新建一个computeCost.m文件，其中只包含一个computeCost函数，来计算代价函数的值，代码如下：

function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;

hypothesis = X * theta;
J = sum((hypothesis - y) .^ 2);
J = J/(2*m);

% =========================================================================
end

同样的，我们可以可在当前工作目录下新建另外一个gradientDescent.m文件，其中只包含一个gradientDescent函数，按照迭代次数和设定的alpha（学习效率）来进行梯度下降迭代。最后，我们还可视化了线性回归的结果。具体代码如下：

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
%   theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by 
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

temp = zeros(2, 1);

for iter = 1:num_iters

    hypothesis = X * theta -  y;
    n = length(theta);
    for j = 1:n
        theta(j) = theta(j) - alpha/m * (hypothesis' * X(:,j));
    end

    % ============================================================
    % Save the cost J in every iteration    
    J_history(iter) = computeCost(X, y, theta);

    %fprintf('the cost is %f\n', J_history(iter));

end

end

写完这两个函数后，我们就可以计算代价函数的值，并用梯度下降迭代法进行线性回归了。具体见代码：

%% =================== 计算代价函数并进行梯度下降迭代 ===================
% 在x向量前面加上一列由m个1组成的向量，构成一个X矩阵
X = [ones(m, 1), x]; % Add a column of ones to x
% 初始化theta参数
theta = zeros(2, 1); % initialize fitting parameters

% 梯度下降的迭代次数为3000次，学习率步长alpha为0.01；如果不收敛，可减少步长。
iterations = 3000;
alpha = 0.01;
fprintf('\n测试代价函数 ...\n')
% 计算初始代价函数的值
J = computeCost(X, y, theta);
fprintf('With theta = [0 ; 0]\nCost computed = %f\n', J);
fprintf('Expected cost value (approx) 32.07\n');

fprintf('\n开始梯度下降迭代 ...\n')
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);

% print theta to screen
fprintf('Theta found by gradient descent:\n');
fprintf('%f\n', theta);
fprintf('Expected theta values (approx)\n');
fprintf(' -3.6303\n  1.1664\n\n');

% Plot the linear fit
hold on; % keep previous plot visible
plot(X(:,2), X*theta, '-')
legend('Training data', 'Linear regression')
hold off % don't overlay any more plots on this figure

% 根据拟合的线性回归曲线，对人口为35,000和70,000的两个城市的利润进行预测。
predict1 = [1, 3.5] *theta;
fprintf('For population = 35,000, we predict a profit of %f\n',...
    predict1*10000);
predict2 = [1, 7] * theta;
fprintf('For population = 70,000, we predict a profit of %f\n',...
    predict2*10000);

fprintf('Program paused. Press enter to continue.\n');


%% ============= 可视化 J(theta_0, theta_1) =============
fprintf('Visualizing J(theta_0, theta_1) ...\n')

% Grid over which we will calculate J
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);

% initialize J_vals to a matrix of 0's
J_vals = zeros(length(theta0_vals), length(theta1_vals));

% Fill out J_vals
for i = 1:length(theta0_vals)
    for j = 1:length(theta1_vals)
      t = [theta0_vals(i); theta1_vals(j)];
      J_vals(i,j) = computeCost(X, y, t);
    end
end


% Because of the way meshgrids work in the surf command, we need to
% transpose J_vals before calling surf, or else the axes will be flipped
J_vals = J_vals';
% Surface plot
figure;
surf(theta0_vals, theta1_vals, J_vals)
xlabel('\theta_0'); ylabel('\theta_1');

% Contour plot
figure;
% Plot J_vals as 15 contours spaced logarithmically between 0.01 and 100
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
xlabel('\theta_0'); ylabel('\theta_1');
hold on;
plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);

可视化的结果显示如下：
这里写图片描述

这里写图片描述

Matlab控制台的输出结果为：
测试代价函数 …
With theta = [0 ; 0]
Cost computed = 32.030658
Expected cost value (approx) 32.07

开始梯度下降迭代 …
Theta found by gradient descent:
-3.795429
1.184909
Expected theta values (approx)
-3.6303
1.1664

迭代3000次后的代价函数值=4.366376
For population = 35,000, we predict a profit of 3517.530646
For population = 70,000, we predict a profit of 44989.347534
Program paused. Press enter to continue.
Visualizing J(theta_0, theta_1) …

根据控制台结果，我们可以知道，在迭代3000次后，θ0和θ1的值大约为-3.6303和1.1664。用最小二乘法，我们会得到θ0和θ1的精确结果为3.8128和1.1867，可见要得到更精确的结果，我们还需要更多次的迭代。