Lasso回归

最新推荐文章于 2025-03-27 20:12:14 发布

好好学习_rich

最新推荐文章于 2025-03-27 20:12:14 发布

阅读量5.3k

点赞数

分类专栏：线性模型文章标签：回归 python

本文链接：https://blog.csdn.net/Four2017/article/details/128535582

版权

线性模型专栏收录该内容

8 篇文章

订阅专栏

Lasso 回归

选择正则化参数

Lasso回归是一种线性模型，该方法是一种压缩估计。它通过构造一个惩罚函数得到一个较为精炼的模型，使得它压缩一些回归系数，即强制系数绝对值之和小于某个固定值；同时设定一些回归系数为0。也是一种处理 具有复共线性数据的有偏估计。

目标函数为：

$\min_w \frac{1}{2n_{samples}}||X_w-y||^2_2+\alpha ||w||_1=\\ \min_w \frac{1}{2n_{samples}}\sum_{i=1}^n(\hat{y}_i-y_i)^2+\alpha\sum_{i=1}^n |w_i| \tag{1}$

其中， $\alpha$ 是一个常数， $w||_1$ 是L1范数。

from sklearn.linear_model import Lasso

alpha = 0.1
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(lasso)
print("r^2 on test data : %f" % r2_score_lasso)

评价指标：r2_score。

选择正则化参数

交叉验证cross-validation

有两种交叉验证——LassoCV 和 LassoLarsCV。

LassoCV
基于coordinate descent算法
LassoLarsCV
基于Least Angle Regression算法

对于很多共线性特征的高维数据集，LassoCV表现更好。如果样本量很小，样本量小于特征数，LassoLarsCV比较快。

LassoCV例子：

from sklearn.linear_model import LassoCV

start_time = time.time()
model = make_pipeline(StandardScaler(), LassoCV(cv=20)).fit(X, y)
fit_time = time.time() - start_time

import matplotlib.pyplot as plt

ymin, ymax = 2300, 3800
lasso = model[-1]
plt.semilogx(lasso.alphas_, lasso.mse_path_, linestyle=":")
plt.plot(
    lasso.alphas_,
    lasso.mse_path_.mean(axis=-1),
    color="black",
    label="Average across the folds",
    linewidth=2,
)
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha: CV estimate")

plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
_ = plt.title(
    f"Mean square error on each fold: coordinate descent (train time: {fit_time:.2f}s)"
)

LassoLarsCV例子：

from sklearn.linear_model import LassoLarsCV

start_time = time.time()
model = make_pipeline(StandardScaler(), LassoLarsCV(cv=20)).fit(X, y)
fit_time = time.time() - start_time

lasso = model[-1]
plt.semilogx(lasso.cv_alphas_, lasso.mse_path_, ":")
plt.semilogx(
    lasso.cv_alphas_,
    lasso.mse_path_.mean(axis=-1),
    color="black",
    label="Average across the folds",
    linewidth=2,
)
plt.axvline(lasso.alpha_, linestyle="--", color="black", label="alpha CV")

plt.ylim(ymin, ymax)
plt.xlabel(r"$\alpha$")
plt.ylabel("Mean square error")
plt.legend()
_ = plt.title(f"Mean square error on each fold: Lars (train time: {fit_time:.2f}s)")

AIC/BIC准则

计数信息准则是基于训练集数据。

困难：

自由度的选择。
要求大样本。
样本量大于特征数。

计算公式为：

$AIC=-2log(\hat{L})+2d \tag{2}$

其中， $\hat{L}$ 是模型的最大似然估计函数， $d$ 是参数个数，即自由度。

$B I C$ 的计算就是把式(2)中的2替换为 $l o g (N)$ ：

$BIC=-2log(\hat{L})+log(N)d \tag{3}$

其中 $N$ 是样本量。

对于一个线性高斯模型，最大似然函数的对数为：

$log(\hat{L})=-\frac{n}{2}log(2\pi)-\frac{n}{2}ln(\sigma^2)-\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{2\sigma^2} \tag{4}$

将式(4)代入式(2)得：

$AIC=nlog(2\pi\sigma^2)+\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{\sigma^2}+2d \tag{5}$

其中 $\sigma^2$ 是常数，由式(6)估计而得：

$\sigma^2=\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{n-p} \tag{6}$

其中， $p$ 是特征个数，且仅当n_samples > n_features时成立。

import time
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoLarsIC
from sklearn.pipeline import make_pipeline

start_time = time.time()
lasso_lars_ic = make_pipeline(StandardScaler(), LassoLarsIC(criterion="aic")).fit(X, y)#AIC
fit_time = time.time() - start_time

results = pd.DataFrame(
    {
        "alphas": lasso_lars_ic[-1].alphas_,
        "AIC criterion": lasso_lars_ic[-1].criterion_,
    }
).set_index("alphas")
alpha_aic = lasso_lars_ic[-1].alpha_

以上是 $A I C$ 计算。

lasso_lars_ic.set_params(lassolarsic__criterion="bic").fit(X, y)
results["BIC criterion"] = lasso_lars_ic[-1].criterion_
alpha_bic = lasso_lars_ic[-1].alpha_

#加粗列的最小值
def highlight_min(x):
    x_min = x.min()
    return ["font-weight: bold" if v == x_min else "" for v in x]


results.style.apply(highlight_min)