sklearn 的线性回归文档:
- linear_model
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model - LinearRegression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
from sklearn import datasets
from sklearn.linear_model import LinearRegression
boston = datasets.load_boston()
boston_features = boston.data
boston_target = boston.target
拟合一条直线
boston_features
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
4.0300e+00],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
5.6400e+00],
[1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
6.4800e+00],
[4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
7.8800e+00]])
features = boston_features[:, 0:2]
features
array([[6.3200e-03, 1.8000e+01],
[2.7310e-02, 0.0000e+00],
[2.7290e-02, 0.0000e+00],
...,
[6.0760e-02, 0.0000e+00],
[1.0959e-01, 0.0000e+00],
[4.7410e-02, 0.0000e+00]])
rgs = LinearRegression()
model = rgs.fit(features, boston_target)
# 查看截距
model.intercept_
# 22.485628113468223
# 显示特征权重
model.coef_
# array([-0.35207832, 0.11610909])
# 目标向量第一个值 乘以 1000
boston_target[0] * 1000 # 24000.0
# boston_target
model.predict(features)[0] * 1000
# 24573.366631705547
处理特征之间的影响
from sklearn.preprocessing import PolynomialFeatures
# 创建交互特征
interaction = PolynomialFeatures( degree=3, include_bias=False, interaction_only=True)
features_interaction = interaction.fit_transform(features)
rgs = LinearRegression()
model = rgs.fit(features_interaction, boston_target)
# 第一个样本的特征
features[0]
# array([6.32e-03, 1.80e+01])
import numpy as np
# 将每个样本的第一个和第二个特征相乘
interaction_term = np.multiply(features[:, 0], features[:, 1])
# 查看第一个样本的交互特征
interaction_term[0] # 0.11376
# 观察第一个样本的值
features_interaction[0]
# array([6.3200e-03, 1.8000e+01, 1.1376e-01])
拟合非线性关系
features = boston_features[:, 0:1]
# 创建多项式特征 x^2 和 x^3
polynomial = PolynomialFeatures( degree=3, include_bias=False )
features_polynomial = polynomial.fit_transform(features)
# 创建线性回归对象
rgs = LinearRegression()
model = rgs.fit(features_polynomial, boston_target)
# 第一个样本的特征
features[0]
# array([0.00632])
# 提升到 二阶
features[0] ** 2
# array([3.99424e-05])
features[0] ** 3
# array([2.52435968e-07])
# 观察第一个样本的所有三个特征 x, x^2, x^3
features_polynomial[0]
array([6.32000000e-03, 3.99424000e-05, 2.52435968e-07])
通过正则化减少方差
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_std = scaler.fit_transform(boston_features)
rgs = Ridge(alpha=0.5)
model = rgs.fit(features_std, boston_target)
from sklearn.linear_model import RidgeCV
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])
# 拟合线性回归
model_cv = regr_cv.fit(features_std, boston_target)
# 查看模型系数
model_cv.coef_
array([-0.91987132, 1.06646104, 0.11738487, 0.68512693, -2.02901013,
2.68275376, 0.01315848, -3.07733968, 2.59153764, -2.0105579 ,
-2.05238455, 0.84884839, -3.73066646])
model_cv.alpha_
1.0
使用套索回归 减少特征
希望通过 减少特征的树龄啊,来简化线性回归模型
from sklearn.linear_model import Lasso
scaler = StandardScaler()
features_std = scaler.fit_transform(boston_features)
rgs = Lasso(alpha=0.5)
model = rgs.fit(features_std, boston_target)
# 查看系数
# 有很多系数为 0, 意味着他们对应的特征,并未在模型中使用
model.coef_
array([-0.11526463, 0. , -0. , 0.39707879, -0. ,
2.97425861, -0. , -0.17056942, -0. , -0. ,
-1.59844856, 0.54313871, -3.66614361])
# 将 alpha 设置为更大的值,会看到模型不会使用任何特征
rgs_10 = Lasso(alpha=10)
model_10 = rgs_10.fit(features_std, boston_target)
model_10.coef_
array([-0., 0., -0., 0., -0., 0., -0., 0., -0., -0., -0., 0., -0.])
- 利用这种特性,可以再特征矩阵中,包含100个特征,然后调整套索回归的超参数,生成比如 仅使用10个最重要特征的模型;
- 这样做可以减少模型方差,同时提高模型的可解释性(特征越少,越容易解释)
2023-04-02(日)小雨空气清新