【最小二乘法估计】

好好学习_rich

已于 2022-12-28 22:59:54 修改

阅读量1.2k

点赞数

分类专栏：线性模型文章标签： python 算法

于 2022-12-27 18:42:41 首次发布

本文链接：https://blog.csdn.net/Four2017/article/details/128458389

版权

线性模型专栏收录该内容

8 篇文章 1 订阅

订阅专栏

最小二乘法Ordinary Least Squares

线性模型

线性模型

线性模型做回归任务，一般是指用线性函数去拟合因变量(target)，表达式一般为：

$\hat{y}(w,x)=w_0+w_1x_1+\cdots+w_px_p$

在Python的线性模型里，coef_存储这个函数里的参数向量 $w=(w_1,\cdots,w_p)$ ，而intercept_存储 $w_0$

线性回归

LinearRegression拟合一个线性模型，通过最小化预测值和真实值的残差平方和(the residual sum of squares)，目标函数如下：

$\min_{w}||X_w-y||^2_2=\min_{w}\sum_{i=1}^n(y_i-\hat{y}_i)^2$

代码如下：

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]
reg.coef_

参数估计方法是最小二乘法OLS，特别注意：最小二乘法要求特征之间相互独立。如果存在多重共线性，将会导致：

估计值容易受随机误差影响，估计值变得不可信；
引起大的方差。

Example

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]#前20个
diabetes_X_test = diabetes_X[-20:]#后20个

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

结果是：

Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47

可视化结果：

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

在这里插入图片描述

非负最小二乘(Non-Negative Least Squares)

如果要限制所有的系数为非负，下面给出一个小🌰，拟合一个对回归系数为非负限制的线性模型。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

np.random.seed(42)

n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)
true_coef = 3 * np.random.randn(n_features)
# Threshold coefficients to render them non-negative
true_coef[true_coef < 0] = 0#负值改为0
y = np.dot(X, true_coef)

# Add some noise
y += 5 * np.random.normal(size=(n_samples,))

#划分数据集为训练集和测试集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

from sklearn.linear_model import LinearRegression

reg_nnls = LinearRegression(positive=True)#就是非负限制
y_pred_nnls = reg_nnls.fit(X_train, y_train).predict(X_test)
r2_score_nnls = r2_score(y_test, y_pred_nnls)
print("NNLS R2 score", r2_score_nnls)

结果得：

NNLS R2 score 0.8225220806196526

和OLS对比：

import numpy as np
reg_ols = LinearRegression()
y_pred_ols = reg_ols.fit(X_train, y_train).predict(X_test)
r2_score_ols = r2_score(y_test, y_pred_ols)
print("OLS R2 score", r2_score_ols)

结果为：

OLS R2 score 0.7436926291700348

可视化对比结果：

fig, ax = plt.subplots()
ax.plot(reg_ols.coef_, reg_nnls.coef_, linewidth=0, marker=".")

low_x, high_x = ax.get_xlim()
low_y, high_y = ax.get_ylim()
low = max(low_x, low_y)
high = min(high_x, high_y)
ax.plot([low, high], [low, high], ls="--", c=".3", alpha=0.5)#alpha表示透明度
ax.set_xlabel("OLS regression coefficients", fontweight="bold")
ax.set_ylabel("NNLS regression coefficients", fontweight="bold")

在这里插入图片描述

对比OLS和NNLS的回归系数，发现两者高度相关，只是在NNLS中一些数的部分收敛到0。

模型评估指标

R2 score 模型的可决定系数。

岭回归(Ridge regression)

岭回归是一种专用于共线性数据分析的有偏估计回归方法。实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法，对病态数据的拟合要强于最小二乘法。其目标函数为：

$\min_w ||X_w-y||^2_2+\alpha ||w||^2_2=\min_w\sum_{i=1}^n (\hat{y}_i-y_i)^2+\alpha \sum_{i=1}^nw_i^2$

这里， $\alpha \geqslant 0$ 是正则化参数。当 $\alpha$ 非常大时，导致系数往往趋于0，此时和最小二乘估计法相差不大。

RidgeClassifier可以做分类任务，但是本文不讨论，仅考虑回归。

多重线性回归

先来讲两个概念：

条件依赖性：

保持其他特征不变去分析 $x_i$ 对 $y$ 的影响

边际依赖性：

考虑所有的变量的影响

Example见下篇

内容灵感

知识来源: sklearn.

机器学习线性模型打卡

好好学习_rich

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
【最小二乘法估计】

本文介绍两种线性模型，一种线性回归(用最小二乘法估计参数)，一种岭回归。岭回归是一种专用于共线性数据分析的有偏估计回归方法。实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法，对病态数据的拟合要强于最小二乘法。
复制链接

扫一扫