最小二乘法Ordinary Least Squares
线性模型
线性模型做回归任务,一般是指用线性函数去拟合因变量(target),表达式一般为:
y ^ ( w , x ) = w 0 + w 1 x 1 + ⋯ + w p x p \hat{y}(w,x)=w_0+w_1x_1+\cdots+w_px_p y^(w,x)=w0+w1x1+⋯+wpxp
在Python的线性模型里,coef_
存储这个函数里的参数向量
w
=
(
w
1
,
⋯
,
w
p
)
w=(w_1,\cdots,w_p)
w=(w1,⋯,wp),而intercept_
存储
w
0
w_0
w0
线性回归
LinearRegression
拟合一个线性模型,通过最小化预测值和真实值的残差平方和(the residual sum of squares),目标函数如下:
min w ∣ ∣ X w − y ∣ ∣ 2 2 = min w ∑ i = 1 n ( y i − y ^ i ) 2 \min_{w}||X_w-y||^2_2=\min_{w}\sum_{i=1}^n(y_i-\hat{y}_i)^2 wmin∣∣Xw−y∣∣22=wmini=1∑n(yi−y^i)2
代码如下:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]
reg.coef_
参数估计方法是最小二乘法OLS,特别注意:最小二乘法要求特征之间相互独立。如果存在多重共线性,将会导致:
- 估计值容易受随机误差影响,估计值变得不可信;
- 引起大的方差。
Example
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]#前20个
diabetes_X_test = diabetes_X[-20:]#后20个
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
结果是:
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
可视化结果:
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
非负最小二乘(Non-Negative Least Squares)
如果要限制所有的系数为非负,下面给出一个小🌰,拟合一个对回归系数为非负限制的线性模型。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
np.random.seed(42)
n_samples, n_features = 200, 50
X = np.random.randn(n_samples, n_features)
true_coef = 3 * np.random.randn(n_features)
# Threshold coefficients to render them non-negative
true_coef[true_coef < 0] = 0#负值改为0
y = np.dot(X, true_coef)
# Add some noise
y += 5 * np.random.normal(size=(n_samples,))
#划分数据集为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
from sklearn.linear_model import LinearRegression
reg_nnls = LinearRegression(positive=True)#就是非负限制
y_pred_nnls = reg_nnls.fit(X_train, y_train).predict(X_test)
r2_score_nnls = r2_score(y_test, y_pred_nnls)
print("NNLS R2 score", r2_score_nnls)
结果得:
NNLS R2 score 0.8225220806196526
和OLS对比:
import numpy as np
reg_ols = LinearRegression()
y_pred_ols = reg_ols.fit(X_train, y_train).predict(X_test)
r2_score_ols = r2_score(y_test, y_pred_ols)
print("OLS R2 score", r2_score_ols)
结果为:
OLS R2 score 0.7436926291700348
可视化对比结果:
fig, ax = plt.subplots()
ax.plot(reg_ols.coef_, reg_nnls.coef_, linewidth=0, marker=".")
low_x, high_x = ax.get_xlim()
low_y, high_y = ax.get_ylim()
low = max(low_x, low_y)
high = min(high_x, high_y)
ax.plot([low, high], [low, high], ls="--", c=".3", alpha=0.5)#alpha表示透明度
ax.set_xlabel("OLS regression coefficients", fontweight="bold")
ax.set_ylabel("NNLS regression coefficients", fontweight="bold")
对比OLS和NNLS的回归系数,发现两者高度相关,只是在NNLS中一些数的部分收敛到0。
模型评估指标
R2 score
模型的可决定系数。
岭回归(Ridge regression)
岭回归是一种专用于共线性数据分析的有偏估计回归方法。实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。其目标函数为:
min w ∣ ∣ X w − y ∣ ∣ 2 2 + α ∣ ∣ w ∣ ∣ 2 2 = min w ∑ i = 1 n ( y ^ i − y i ) 2 + α ∑ i = 1 n w i 2 \min_w ||X_w-y||^2_2+\alpha ||w||^2_2=\min_w\sum_{i=1}^n (\hat{y}_i-y_i)^2+\alpha \sum_{i=1}^nw_i^2 wmin∣∣Xw−y∣∣22+α∣∣w∣∣22=wmini=1∑n(y^i−yi)2+αi=1∑nwi2
这里, α ⩾ 0 \alpha \geqslant 0 α⩾0是正则化参数。当 α \alpha α非常大时,导致系数往往趋于0,此时和最小二乘估计法相差不大。
RidgeClassifier
可以做分类任务,但是本文不讨论,仅考虑回归。
多重线性回归
先来讲两个概念:
- 条件依赖性:
保持其他特征不变去分析 x i x_i xi对 y y y的影响
- 边际依赖性:
考虑所有的变量的影响
Example见下篇
内容灵感
知识来源: sklearn.
- 机器学习线性模型打卡