线性回归算法

陈大愚

已于 2023-04-19 15:43:12 修改

阅读量139

点赞数 1

分类专栏：机器学习文章标签：回归机器学习 python

于 2022-08-08 22:02:34 首次发布

本文链接：https://blog.csdn.net/qq_45190143/article/details/126237178

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

线性回归简介

定义：利⽤回归⽅程(函数)对⼀个或多个⾃变量(特征值)和因变量(⽬标值)之间关系进⾏建模的⼀种分析⽅式
线性回归的分类：
- 线性关系
- 非线性关系

线性回归api初步使用

$h(w)=w_{1} x_{1}+w_{2} x_{2}+w_{3} x_{3} \ldots+\mathrm{b}=w^{T} x$

sklearn.linear_model.LinearRegression(fit_intercept=True)
- 参数：
  - fit_intercept：是否计算偏置
- 属性：
  - fit_intercept：是否计算偏置
  - LinearRegression.coef_：回归系数

步骤分析

获取数据集
数据基本处理（该案例中省略）
特征工程（该案例中省略）
机器学习
模型评估（该案例中省略）

代码过程

#导入模块
from sklearn.linear_model import LinearRegression
#构造数据集
x = [[80, 86],
    [82, 80],
    [85, 78],
    [90, 90],
    [86, 82],
    [82, 90],
    [78, 80],
    [92, 94]]
y = [84.2, 80.6, 80.1, 90, 83.2, 87.6, 79.4, 93.4]

#机器学习-模型训练

#实例化api
estimator=LinearRegression()
#使用fit方法进行训练
estimator.fit(x,y)
#打印对应系数
print("线性回归的系数是：\n",estimator.coef_)
#打印预测结果
print("线性回归的预测结果是：\n",estimator.predict([[100,80]]))

线性回归的系数是：
 [0.3 0.7]
线性回归的预测结果是：
 [86.]

线性回归的损失和优化

当真实结果与我们预测的结果存在一定误差时，我们需要将这个误差衡量出来。

损失函数

$\begin{aligned} \mathrm{J}(\mathrm{w}) &=\left(\mathrm{h}\left(x_{1}\right)-y_{1}\right)^{2}+\left(\mathrm{h}\left(x_{2}\right)-y_{2}\right)^{2}+\cdots+\left(\mathrm{h}\left(x_{m}\right)-y_{m}\right)^{2} \\ &=\sum_{i=1}^{m}\left(\mathrm{~h}\left(x_{i}\right)-y_{i}\right)^{2} \end{aligned}$

yi为第i个训练样本的真实值
h(xi)为第i个训练样本特征值组合预测函数
又称最小二乘法

优化函数

目的是找到最小损失对应的W值（权重）

线性回归经常使用的两种优化算法
- 正规方程
- 梯度下降法

正规方程

$w=\left(X^{T} X\right)^{-1} X^{T} y$

理解：X为特征值矩阵，y为目标值矩阵。直接求到最好的结果。
缺点：当特征过多过于复杂时，求解速度太慢且得不到结果。
（正规方程一步到位，梯度下降一步一步来）

梯度下降

$\theta_{i+1}=\theta_{i}-\alpha \frac{\partial}{\partial \theta_{i}} J(\theta)$

在单变量函数中，梯度其实就是函数的微分，代表着函数在某个给定点切线的斜率；
在多变量函数中，梯度其实是一个向量，向量有方向，梯度的方向就指出了函数在给定点的上升最快的方向；
α在梯度下降算法中被称作为学习率或者步⻓；
梯度前加⼀个负号，代表着是在下降

正规方程与梯度下降对比

梯度下降	正规方程
需要选择学习率	不需要
需要迭代求解	一次运算得出
特征数量较大可以使用	需要计算方程，时间复杂度O(n3)

算法选择依据：

小规模数据：
- 正规方程：LinearRegression(不能解决拟合问题)
- 岭回归
大规模数据：
- 梯度下降：SGDRegressor

梯度下降算法

α（步长）：决定了在梯度下降迭代的过程中，每⼀步沿梯度负⽅向前进的⻓度。
x（特征）：指的是样本中输⼊部分。
h (x) = θ + θ x（特征函数）：在监督学习中，为了拟合输⼊样本，⽽使⽤的假设函数，记为h (x)。
损失函数：为了评估模型拟合的好坏，通常⽤损失函数来度量拟合的程度。

梯度下降法算法

全梯度下降算法(Full gradient descent),
随机梯度下降算法(Stochastic gradient descent),
⼩批量梯度下降算法(Mini-batch gradient descent),
随机平均梯度下降算法(Stochastic average gradient descent)

全梯度下降算法(FG)

$\theta_{i}=\theta_{i}-\alpha \sum_{j=1}^{m}\left(h_{\theta}\left(x_{0}^{(j)}, x_{1}^{(j)}, \ldots x_{n}^{(j)}\right)-y_{j}\right) x_{i}^{(j)}$

批量梯度下降法，具体做法也就是在更新参数时使⽤所有的样本来进⾏更新
计算训练集所有样本误差，对其求和再取平均值作为⽬标函数
因为计算整个数据集上的梯度，故速度会很慢
批梯度下降法同样也不能在线更新模型，即在运⾏的过程中，不能增加新的样本

随机梯度下降算法(SG)

$\theta_{i}=\theta_{i}-\alpha\left(h_{\theta}\left(x_{0}^{(j)}, x_{1}^{(j)}, \ldots x_{n}^{(j)}\right)-y_{j}\right) x_{i}^{(j)}$

其每轮计算的⽬标函数不再是全体样本误差，⽽仅是单个样本误差，即每次只代⼊计算⼀个样本⽬标函数的梯度来更新权重，再取下⼀个样本重复此过程，直到损失函数值停⽌下降或损失函数值⼩于某个可以容忍的阈值。
但是由于，SG每次只使⽤⼀个样本迭代，若遇上噪声则容易陷⼊局部最优解

小批量梯度下降算法(mini-batch)

$\theta_{i}=\theta_{i}-\alpha \sum_{j=t}^{t+x-1}\left(h_{\theta}\left(x_{0}^{(j)}, x_{1}^{(j)}, \ldots x_{n}^{(j)}\right)-y_{j}\right) x_{i}^{(j)}$

每次从训练样本集上随机抽取⼀个⼩样本集，在抽出来的⼩样本集上采⽤FG迭代更新权重。

随机平均梯度下降算法(SAG)

$\theta_{i}=\theta_{i}-\frac{\alpha}{n}\left(h_{\theta}\left(x_{0}^{(j)}, x_{1}^{(j)}, \ldots x_{n}^{(j)}\right)-y_{j}\right) x_{i}^{(j)}$

会给每个样本都维持⼀个平均值,后期计算的时候,参考这个平均值。

api使用

正规方程：
- sklearn.linear_model.LinearRegression(fit_intercept=True)
- 参数：
  - fit_intercept：是否计算偏置
- 属性：
  - LinearRegression.coef_：回归系数
  - LinearRegression.intercept_：偏置
梯度下降：
- sklearn.linear_model.SGDRegressor(loss=“squared_loss”, fit_intercept=True, learning_rate =‘invscaling’,eta0=0.01)
- 定义：实现了随机梯度下降学习，它⽀持不同的loss函数和正则化惩罚项来拟合线性回归模型
- 参数：
  - loss:损失类型
    - loss=”squared_loss”: 普通最⼩⼆乘法
  - fit_intercept：是否计算偏置
  - learning_rate : string, optional
    - 学习率填充
    - ‘constant’: eta = eta0
    - ‘optimal’: eta = 1.0 / (alpha * (t + t0)) [default]
    - ‘invscaling’: eta = eta0 / pow(t, power_t)
    - 对于⼀个常数值的学习率来说，可以使⽤learning_rate=’constant’ ，并使⽤eta0来指定学习率
- 属性：
  - SGDRegressor.coef_：回归系数
  - SGDRegressor.intercept_：偏置

案例：波士顿房价预测

基本步骤

获取数据
数据基本处理—分割数据
特征工程—标准化
机器学习—线性回归
模型预测

回归性能分析

$E=\frac{1}{m} \sum_{i=1}^{m}\left(y^{i}-\bar{y}\right)^{2}$
（注意：yi是预测值，y为真实值）

api调用：

sklearn.metrics.mean_squared_error(y_true, y_pred)
- 均⽅误差回归损失
- y_true:真实值
- y_pred:预测值
- return:浮点数结果

正规方程

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import mean_squared_error

#获取数据
boston=load_boston()
#print(boston)

#数据基本处理---分割数据
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target,random_state=22,test_size=0.2)

#特征工程---标准化
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)

#机器学习---线性回归
estimator=LinearRegression()
estimator.fit(x_train,y_train)

print("这个模型的偏置是：\n",estimator.intercept_)
print("这个模型的系数是：\n",estimator.coef_)

#模型评估
y_pre=estimator.predict(x_test)
print("预测值是：\n",y_pre)

ret=mean_squared_error(y_test,y_pre)
print("均方差是：\n",ret)

这个模型的偏置是：
 22.57970297029704
这个模型的系数是：
 [-0.73088157  1.13214851 -0.14177415  0.86273811 -2.02555721  2.72118285
 -0.1604136  -3.36678479  2.5618082  -1.68047903 -1.67613468  0.91214657
 -3.79458347]
预测值是：
 [27.79728567 30.90056436 20.70927059 31.59515005 18.71926707 18.46483447
 20.7090385  18.01249201 18.18443754 32.26228416 20.45969144 27.30025768
 15.04218041 19.25382799 36.18076812 18.45209512  7.73077544 17.33936848
 29.40094704 23.32172471 18.43837789 33.31097321 28.38611788 17.43787678
 34.25179785 26.06150404 34.65387545 26.07481562 19.13116067 12.66351087
 30.00302966 14.70773445 36.82392563  9.08197058 15.06703028 16.68218611
  7.99793409 19.41266159 39.15193917 27.42584071 24.24171273 16.93863931
 38.03318373  6.63678428 21.51394405 24.41042009 18.86273557 19.87843319
 15.71796503 26.48901546  8.09589057 26.90160249 29.19481155 16.86472843
  8.47361081 34.87951213 32.41546229 20.50741461 16.27779646 20.32570308
 22.82622646 23.45866662 19.01451735 37.50382701 23.61872796 19.43409925
 12.98316226  6.99153964 40.99988893 20.87265869 16.74869905 20.79222071
 39.90859398 20.20645238 36.15225857 26.80056368 19.20376894 19.60725424
 24.04458577 20.45114082 30.47485108 19.09694834 22.55307626 30.77038574
 26.2119968  20.48073193 28.53910224 20.16485961 25.94461242 19.13440772
 24.98211795 22.84782867 19.18212763 18.88071352 14.49151931 17.78587168
 24.00230395 16.01304321 20.51185516 26.1867442  20.64288449 17.35297955]
均方差是：
 20.955979753963458

梯度下降


#获取数据
boston=load_boston()
#print(boston)

#数据基本处理---分割数据
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target,test_size=0.2)

#特征工程---标准化
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)

#机器学习---线性回归（梯度下降）
estimator=SGDRegressor(max_iter=1000,learning_rate="constant",eta0=0.1)
#estimator=SGDRegressor(max_iter=1000)
estimator.fit(x_train,y_train)

print("这个模型的偏置是：\n",estimator.intercept_)
print("这个模型的系数是：\n",estimator.coef_)

#模型评估
y_pre=estimator.predict(x_test)
print("预测值是：\n",y_pre)

ret=mean_squared_error(y_test,y_pre)
print("均方差是：\n",ret)

这个模型的偏置是：
 [-24556.94860587]
这个模型的系数是：
 [-54338.36041772 -58971.82122396 -14988.8475472   21426.5171881
  44093.50676334  -8239.40911211  58147.3993348  -71356.22105643
  36378.36967637   8651.22614696  34871.06844508 -25702.00089769
  86324.36526027]
预测值是：
 [ 256259.55717189  295160.08344121  334232.66277692  323111.01316478
 -240379.26932915 -284710.12561722 -177967.38248788 -175407.7082281
  481352.95314898  141948.90355711 -253688.1618531   117058.12693854
   80839.99767867  333823.93933158  300792.36580634 -243443.59037812
 -133678.54082743   17676.99101496   46533.14528816 -283849.37323857
 -641052.4386125   334767.43146276   29259.26820306  440841.27199401
  413223.8467386  -564632.54610825  212298.13008897   90754.35549796
  138007.58843538   70397.51309435  -42114.17600931  203691.00515853
  143112.34285012 -268234.07752995 -517254.3085209    33325.41214605
 -247988.34523958   35075.83621924 -135902.32332184  -58515.89124101
  361432.89097155  -97537.54616215 -188276.35321062 -321979.3128175
  -91952.95293004   21941.90654747 -259113.42556611  475251.97358102
  395244.68645323 -346927.14507805 -197760.98033439 -183478.57707744
  -78445.60287813  247451.06609472  462729.99505565  -30491.38107563
  -61960.87654616 -429074.73959691  193748.93524211   42225.39057277
  363199.19306549  198388.22890255 -343287.19893353 -183841.61570433
  -17549.60112857 -243263.67341275  -72012.76679515 -164827.46357463
  230926.35598974 -172018.70490432 -343687.34819693  431051.00728167
 -334118.65717532 -768347.97337917  263130.18724857  -17313.84687133
 -506895.84287699   17403.61668976 -314254.02137436 -215977.13949945
   89730.81985328  331463.10515049  188827.62470972   71702.8191361
 -170923.86761718 -566931.26283155  286197.36108306  -38044.88067351
  102396.6380818  -233959.01136794  -22803.9314517  -410165.31677214
 -122238.31841663 -477822.57744396  340287.0707774   317764.21160364
  -66413.20297383 -249606.38509935 -106186.44721741 -536394.30061638
  389296.62391045   54558.30111478]
均方差是：
 81040831966.05513

过拟合和欠拟合

欠拟合：
- 在训练集上表现不好，在测试集上也表现不好；
- 解决方案：
  - 添加其他的特征项
  - 添加多项式特征
过拟合：
- 在训练集上表现好，在测试集上表现不好
- 解决方案：
  - 重新清洗数据集
  - 增大数据的训练量
  - 正则化
  - 减少特征维度，防止维灾难
正则化：
- 定义：通过限制高次项的系数防止过拟合
- L1正则化：直接把高次项系数变为0
  - Lassos回归
- L2正则化：把⾼次项前⾯的系数变成特别⼩的值
  - Ridge回归（岭回归）

正则化线性模型

Ridge Regression (岭回归)

$J(\theta)=\operatorname{MSE}(\theta)+\alpha \sum_{i=1}^{n} \theta_{i}^{2}$

岭回归是线性回归的正则化版本，即在原来的线性回归的 cost function 中添加正则项，以达到在拟合数据的同时，使模型权重尽可能⼩的⽬的
就是把系数添加平方项，然后限制系数值大小
α值越⼩，系数值越⼤，α越⼤，系数值越⼩

Lasso Regression(Lasso 回归)

$J(\theta)=\operatorname{MSE}(\theta)+\alpha \sum_{i=1}^{n}\left|\theta_{i}\right|$

对系数值进⾏绝对值处理
由于绝对值在顶点处不可导，所以进⾏计算的过程中产⽣很多0，最后得到结果为：稀疏矩阵

Elastic Net (弹性网络)

$J(\theta)=\operatorname{MSE}(\theta)+r \alpha \sum_{i=1}^{n}\left|\theta_{i}\right|+\frac{1-r}{2} \alpha \sum_{i=1}^{n} \theta_{i}^{2}$

是前两个内容的综合
设置了⼀个r,如果r=0–岭回归；r=1–Lasso回归

Early Stopping

在验证错误率达到最⼩值的时候停⽌训练
通过限制错误率的阈值，进⾏停⽌

岭回归api的具体使用

sklearn.linear_model.Ridge(alpha=1.0, fit_intercept=True,solver=“auto”, normalize=False)
- 具有l2正则化的线性回归
- alpha:正则化⼒度（取值：0-1 1-10）
- solver:会根据数据⾃动选择优化⽅法（sag:如果数据集、特征都⽐较⼤，选择该随机梯度下降优化）
- normalize:数据是否进⾏标准化（之前已经标准化了）
- 属性：
  - Ridge.coef_:回归权重
  - Ridge.intercept_:回归偏置
sklearn.linear_model.RidgeCV(_BaseRidgeCV, RegressorMixin)
- 具有l2正则化的线性回归，可以进⾏交叉验证
- coef_:回归系数
alpha – 正则化
- 正则化⼒度越⼤，权重系数会越⼩
- 正则化⼒度越⼩，权重系数会越⼤

from sklearn.linear_model import Ridge,RidgeCV
#机器学习-线性回归（岭回归）
#estimator = Ridge(alpha=1)
estimator = RidgeCV(alphas=(0.1, 1, 10))
estimator.fit(x_train,y_train)

#模型评估
y_predict = estimator.predict(x_test)
print("预测值为:\n", y_predict)
# 均⽅误差
error = mean_squared_error(y_test, y_predict)
print("误差为:\n", error)

预测值为:
 [27.70161604 30.69198547 20.83293806 31.37993352 19.06886905 18.46499899
 20.76434753 18.2087364  18.48509569 31.9167732  20.57736653 26.88095143
 15.13391718 19.38341927 35.95783246 18.2489176   8.30254898 17.59236763
 29.48841298 23.38417563 18.41239891 32.99997871 28.11719645 17.2524964
 33.93017182 25.76855298 34.07662551 26.12377425 18.98332919 13.70504553
 29.75130577 14.08245618 36.66838496  9.42992218 15.39829098 16.38059949
  8.13674968 19.23355789 38.86737442 27.71325959 24.28681416 17.02344926
 38.03981891  6.57716147 21.20739634 24.15864691 19.22136751 20.02583079
 15.72281764 26.27039269  8.89642718 26.59251088 29.12520923 16.78458923
  8.68635424 34.47348662 31.56930862 21.30448809 16.47327242 20.52293343
 22.84054023 23.25324333 19.32834221 37.0459777  24.29366892 19.26138453
 13.19474709  6.9174831  40.91101695 20.93104583 16.36981652 21.14627646
 39.65318001 20.56130518 35.80757432 26.57356025 20.10938353 19.71277372
 24.15586543 21.65389104 30.50234679 19.15218487 22.61455148 30.60026088
 26.32261595 20.39646938 28.25519618 20.53674089 26.04721145 18.4195752
 24.63530895 22.69959017 19.19323207 19.25652691 14.6268253  17.69199436
 23.75279201 16.00619317 20.24243978 26.13593095 20.44423499 17.41745467]
误差为:
 20.935600218331768

模型的保存和加载

模型训练需要时间，将提前训练好的模型进行保存，下次就可以直接加载使用。

from sklearn.externals import joblib
- 保存：joblib.dump(estimator, ‘test.pkl’)
- 加载：estimator = joblib.load(‘test.pkl’)

import joblib
# 4.机器学习-线性回归(岭回归)
#4.1 模型训练
estimator = Ridge(alpha=1)
estimator.fit(x_train, y_train)

#4.2 模型保存
#joblib.dump(estimator, "./data/test.pkl")

#4.3 模型加载(加载模型是需要通过⼀个变量进⾏承接)
estimator = joblib.load("./data/test.pkl")

# 5.模型评估
# 5.1 获取系数等值
y_predict = estimator.predict(x_test)
print("预测值为:\n", y_predict)
print("模型中的系数为:\n", estimator.coef_)
print("模型中的偏置为:\n", estimator.intercept_)
# 5.2 评价
# 均⽅误差
error = mean_squared_error(y_test, y_predict)
print("误差为:\n", error)

预测值为:
 [27.78873457 30.88050916 20.7278544  31.57167054 18.75977878 18.46287435
 20.719399   18.02915627 18.21657378 32.22520354 20.47916624 27.2511531
 15.05294942 19.27076022 36.15980162 18.42723014  7.79369197 17.3680282
 29.41241038 23.3281782  18.43319722 33.27590755 28.35135334 17.41391778
 34.21591336 26.0296379  34.5838712  26.07766082 19.11301251 12.78779461
 29.97371042 14.61585424 36.81099837  9.12517646 15.10411068 16.64292828
  8.01158958 19.38880315 39.11889581 27.46040948 24.24640334 16.94638444
 38.04105813  6.62389382 21.47631033 24.37881408 18.9059369  19.89761521
 15.7132391  26.46605334  8.2025957  26.86177914 29.18720561 16.85456495
  8.49458658 34.83395243 32.30959733 20.61181045 16.29762508 20.34090122
 22.830845   23.43394661 19.05120909 37.44873593 23.7104196  19.41216025
 13.00521048  6.97818359 40.99752892 20.87747386 16.7024726  20.82862838
 39.88275744 20.24968244 36.11441189 26.7758095  19.31349113 19.6245034
 24.0553401  20.6065828  30.48446232 19.0989152  22.55823821 30.74827306
 26.23282983 20.46955834 28.50915212 20.20310585 25.96357801 19.03376439
 24.9380452  22.82851545 19.181656   18.93235434 14.50560797 17.77224626
 23.97068811 16.0106808  20.47816964 26.18277209 20.61876916 17.35873382]
模型中的系数为:
 [-0.71845222  1.10899367 -0.16686116  0.86602377 -1.99228506  2.73072103
 -0.16588061 -3.32715968  2.48093423 -1.60572264 -1.66849991  0.91120104
 -3.78188807]
模型中的偏置为:
 22.57970297029704
误差为:
 20.945827015197693