机器学习|算法笔记-线性回归（Linear Regression）

Z_begger

已于 2022-11-03 09:00:01 修改

阅读量365

点赞数

分类专栏：笔记文章标签：算法线性回归

于 2022-11-02 21:12:59 首次发布

本文链接：https://blog.csdn.net/Z_begger/article/details/127656819

版权

笔记专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、理解线性回归

问题：假设有这样的一些点，这些都是现有的数据，要找到拟合这些点的线，然后预测接下来的点。要怎么找出这条线呢？

y=w*x+b

或者说换个问法，要怎么求出w和b呢？

二、代价函数（Cost Price）

代价函数是什么？

先随便画两条线来拟合那些点，如图所示，明显图二更加拟合，也就是说图二的线更接近我们理想中的线。

仔细观察，P1的线和P2的线：最明显的，P1中，各个点沿y轴到那条直线的距离更远，而P2中各个点到线的距离更近。

这所有点沿y轴到直线的误差，也就是各个点的误差的平均值。就是代价函数。公式如下：

$J=\frac{1}{n}\sum_{i=1}^{n}\left ( pred_{i}-y_{i} \right )^{2}$

pred(i) 就是第i个点上，直线的y值。y(i)就是第i个点，这个点的y值，加上平方主要是避免了负数的情况。这就是代价函数。

代价函数有什么作用？

代价函数有助于找出w和b的最佳可能值。前面说到，代价函数就是每个点在y轴到直线的距离的平均值。我们的目标就是最小化这个值，在普遍情况下，代价函数是凸函数，如下图所示，

$min(\frac{1}{n}\sum_{i=1}^{n}\left ( pred_{i}-y_{i} \right )^{2})$

三、梯度下降（Gradient Descent）

目标：找到w和b，让拟合直线更贴近所有点（即让代价函数最小）。

如何让代价函数最小化？答：梯度下降法。

梯度下降：一种不断迭代更新w和b来降低代价函数的方法。

$J=\frac{1}{n}\sum_{i=1}^{n}\left ( pred_{i}-y_{i} \right )^{2}$

$J=\frac{1}{n}\sum_{i=1}^{n}\left ( w*x_{i}+b-y_{i} \right )^{2}$

$\frac{\partial J}{\partial w}=\frac{2}{n}\sum_{i=1}^{n}(w*x_{i}+b-y_{i})*x_{i}\Rightarrow \frac{\partial J}{\partial w}=\frac{2}{n}\sum_{i=1}^{n}(pred_{i}-y_{i})*x_{i}$

$\frac{\partial J}{\partial b}=\frac{2}{n}\sum_{i=1}^{n}(w*x_{i}+b-y_{i})*1\Rightarrow \frac{\partial J}{\partial b}=\frac{2}{n}\sum_{i=1}^{n}(pred_{i}-y_{i})*1$

迭代公式：

$w=w-\alpha *\frac{\partial J}{\partial w}$

$b=b-\alpha *\frac{\partial J}{\partial b}$

其中， $\alpha$ 为学习率，决定梯度下降的速度。 $\alpha$ 过小，梯度下降太慢； $\alpha$ 过大，梯度下降可能跳过最小值，可能导致来回震荡。

四、多个特征

每条数据有n个特征，每个特征对应着一个自己的权重值，与权重的乘积再加上一个偏置值。这就是线性回归模型。

上面所述，都是在二维空间的计算，只有一个特征。而现实中会有多个特征，公式如下：

$y=w_{1}*x_{1}+w_{2}*x_{2}+\cdots +w_{n}*x_{n}+b$

为了方便后续写成矩阵的形式，我们这边可以稍作修改，令 $w_{0}=b$ ， $x_{0}$ =1就可以写成下边的形式：

$y=w_{0}*x_{0}+w_{1}*x_{1}+w_{2}*x_{2}+\cdots +w_{n}*x_{n}$

假设现在有m个样本，写成矩阵的形式：

$X=\begin{pmatrix} 1 & x_{1}^{1}& x_{1}^{2}&\cdots & x_{1}^{n}\\ 1 & x_{2}^{1}& x_{2}^{2}&\cdots & x_{2}^{n} \\ & & \cdots & \\ 1 & x_{m}^{1}& x_{m}^{2}&\cdots & x_{m}^{n} \end{pmatrix}$ $y=\begin{pmatrix} y_{1}\\ y_{2}\\ \cdots \\ y_{m} \end{pmatrix}$

权重 w也可以写成矩阵的形式：

$W=\begin{pmatrix} w_{0} &w_{1} &w_{2} & \cdots & w_{n} \end{pmatrix}$

原始公式矩阵的形式：

$Y=XW^{T}$

代价函数

代价函数是一定程度上衡量模型的好坏的一个算法。

$J=\frac{1}{m}\sum_{i=1}^{m}\left ( pred_{i}-y_{i} \right )^{2}$

$J=\frac{1}{m}\sum_{i=1}^{m}(w_{0}*x_{i}^{0}+w_{1}*x_{i}^{1}+w_{2}*x_{i}^{2}+\cdots +w_{n}*x_{i}^{n}-y_{i})^{2}$

在全部特征求偏导：

$\frac{\partial J}{\partial w_{j}}=\frac{2}{m}\sum_{i=1}^{m}(w_{0}*x_{i}^{0}+w_{1}*x_{i}^{1}+w_{2}*x_{i}^{2}+\cdots +w_{n}*x_{i}^{n}-y_{i})*x_{i}^{j}$

共有n+1个参数（包含偏置b），同步更新（参照单个特征），其中 $x_{i}^{j}$ 中i表示第i个样本，j表示第j个特征对应的样本。

五、算法代码（Matlab）

1.computecost.m

function j = computecost(x, Y, Theta)

n = length(Y); % Number of training examples.
 
j = 0;

j = (1 / (2 * n)) * sum(((x * Theta) - Y).^2); 

end

2.gradientdescent.m

% This function demonstrates gradient descent in case of linear regression with one variable.

% Theta is a column vector with two elements which this function returns after modifying it.

% This function receives the feature vector x, vector of actual target variables Y, Theta

% containing initial values of theta_0 and theta_1, learning rate Alpha, number of iterations

% noi.

function Theta = gradientdescent(x, Y, Theta, Alpha, noi)

    n = length(Y); % Number of training examples. 

    for i = 1:noi 

        theta_1 = Theta(1) - Alpha * (1 / n) * sum(((x * Theta) - Y) .* x(:, 1)); % Temporary variable to simultaneously update theta_0 but i have used 1 to

                                                                                  % avoid confusion since indexing in MATLAB/Octave starts from 1.

        theta_2 = Theta(2) - Alpha * (1 / n) * sum(((x * Theta) - Y) .* x(:, 2)); % Temporary variable to simultaneously update theta_1.

        % sum(((x * Theta) - Y) .* x(:, 1)),对代价函数求导的结果，请自行推导
        
        Theta(1) = theta_1; % Assigning first temporary value to update first actual value simultaneously.  
    
        Theta(2) = theta_2; % Assigning second temporary value to update second actual value simultaneously. 

    end

end

3.plotdata.m

function plotdata(x,Y)

figure;

plot(x,Y,'rx','MarkerSize',10); % rx means red coloured x.

ylabel('Profit in $10,000s');

xlabel('Population of city in 10,000s');

end

4.RunLinearRegression.m

% This file runs univariate linear regression to predict profits of food trucks based on previous

% actual values of profits in $10,000s in various cities with populations in 10,000s respectively. 

clear ; close all; clc ;

fprintf('Plotting data\n');

data = load('data_.txt');
x = data(:, 1); Y = data(:, 2);
n = length(Y); % Number of training examples.

plotdata(x, Y);

fprintf('Program paused, press enter to continue\n');

pause;

x = [ones(n, 1), data(:,1)]; 
Theta = zeros(2, 1);

noi = 1500;   % Number of iterations in gradient descent. 
Alpha = 0.01; % Learning rate.

fprintf('Testing the cost function\n')

j = computecost(x, Y, Theta);
fprintf('With Theta = [0 ; 0]\nCost computed = %f\n', j);
fprintf('Expected cost value (approx) 32.07\n');

j = computecost(x, Y, [-1 ; 2]);
fprintf('With theta = [-1 ; 2]\nCost computed = %f\n', j);
fprintf('Expected cost value (approx) 54.24\n');

fprintf('Program paused, press enter to continue\n');

pause;

fprintf('Running gradient descent\n');

Theta = gradientdescent(x, Y, Theta, Alpha, noi);

fprintf('Theta found by gradient descent\n');
fprintf('%f\n', Theta);
fprintf('Expected Theta vector (approx)\n');
fprintf(' -3.6303\n  1.1664\n\n');

hold on; % To plot hypothesis on data. 

plot(x(:, 2), x * Theta, '-');
legend('Training data', 'Linear regression');

predict1 = [1, 3.5] * Theta;
fprintf('For population = 35,000, we predict a profit of %f\n',...
    predict1*10000);

predict2 = [1, 7] * Theta;
fprintf('For population = 70,000, we predict a profit of %f\n',...
    predict2*10000);

5.示例数据