LogisticRegressionCV作图

最新推荐文章于 2025-02-17 17:43:35 发布

qq_37353305

最新推荐文章于 2025-02-17 17:43:35 发布

阅读量4.3k

点赞数 1

分类专栏： ML with Python 文章标签： python 数据挖掘逻辑回归机器学习

本文链接：https://blog.csdn.net/qq_37353305/article/details/122664167

版权

ML with Python 专栏收录该内容

5 篇文章

订阅专栏

LogisticRegressionCV Plot

简介
数据
代码

简介

主要是关于 logistic regression with lasso ( $L_1$ penalty) 在 python 中的作图。做 logistic 回归，我们这里用到 sckit-learn 中两个函数，一个是 LogisticRegression，另一个是 LogisticReressionCV。LogisticReressionCV 的不同之处在于它使用了 Cross Validation 来选择最好的惩罚项系数，并给出了惩罚项系数的 path。LogisticRegression 需要自己给定 $L_1$ 惩罚的力度。下面是官网给的公式：
$\min _{w, c} \frac{1-\rho}{2} w^{T} w+\rho\|w\|_{1}+C \sum_{i=1}^{n} \log \left(\exp \left(-y_{i}\left(X_{i}^{T} w+c\right)\right)+1\right)$
函数中的 $C$ 即是惩罚力度，但它是放在了 loss function 的前面，所以实际上是我们所说的 $1/\lambda$ 。 $C$ 越大，惩罚力度越小。

数据

数据来源于《饥饿游戏》这本书。数据统计了书中的人名和他们的一些特征，比如有没有名字，年龄，是否自愿参加游戏。label 是该人物是否在第一天的游戏结束后仍然存活，是 0-1 二元的。我们用 logistic + lasso 来拟合模型并进行预测。

代码

需要的库：

# libarary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
import sklearn

设置种子：

np.random.seed(0)

读取并查看数据：

# load data
df_hunger = pd.read_csv("hunger_games.csv")
#
# A simple view of the first 10 rows of data 
#
print(df_hunger.head(10))

在这里插入图片描述

特征和标签：

# features and labels
x, y = df_hunger.drop(['id','surv_day1','name','age_cat'], axis=1), df_hunger['surv_day1']

用 LogisticRegressionCV 创建模型，并拟合数据。这里参数 Cs 是惩罚系数 path 的长度，scoring 是 CV 依据的 measure，我这里选择了 RMSE，当然也可以选择其他的例如 accuracy。其他的指标上 https://scikit-learn.org/stable/modules/model_evaluation.html 查看：

# create and fit logistic model with cross validation criterion for lambda
modelcv = LogisticRegressionCV(cv=5, random_state=0, max_iter=10000, penalty='l1', solver = 'saga', Cs=100,
                             scoring = 'neg_root_mean_squared_error')
modelcv.fit(x,y)

收集 score 和 lambda 的 path，注意这里 model 的参数 scores_ 是一个 dict object，先转化为 list 然后取第一项。lambda 是 C 的倒数：

# lambda and score paths
score_path = (list(modelcv.scores_.values())[0]).transpose()
score_ave_path = list(modelcv.scores_.values())[0].mean(axis = 0)
lambda_path = 1/modelcv.Cs_

再创建个 full model (非CV) ，以便后面画系数选择的图：

# full model without data splitting on the lambda path
coefs = []
for a in lambda_path:
    model_full = LogisticRegression(random_state=0, max_iter=10000, penalty='l1', solver = 'saga', C = 1/a)
    model_full.fit(x, y)
    coefs.append(model_full.coef_)
coefs = np.array(coefs).reshape(100,x.shape[1])

作图：第一幅图是 RMSE 在 lambda path 上的变化图，第二幅是 coefficient 在 lambda path 上变化的图。第二幅图可以用来选择变量：

# plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

axes[0].semilogx(lambda_path,  -score_path, ':')
axes[0].plot(
    lambda_path,
    -score_ave_path,
    "k",
    label="Average across the folds",
    linewidth=2
)
axes[0].axvline(
    1/modelcv.C_, linestyle="--", color="k", label="lambda: CV estimate"
)

axes[0].legend(loc="lower right")
axes[0].set(xlabel="Lambda", ylabel="Root mean square error")
axes[0].set_title("CV Root mean square error by lambda")
axes[0].axis("tight")

axes[1].semilogx(lambda_path,  coefs, ':')
axes[1].axvline(1/modelcv.C_, linestyle="--", color="k", label="lambda: CV estimate")

axes[1].legend()
axes[1].set(xlabel="Lambda", ylabel="Coefficient value")
axes[1].set_title("Coefficient shrinkage by lambda")

在这里插入图片描述
输出选择的变量以及对应的估计系数：

print("The optimal lambda is:", 1/modelcv.C_)
print("Variables selected by the model:", x.columns[modelcv.coef_[0]!=0])
print("The corresponding coefficients of the selected variables:",modelcv.coef_[0][modelcv.coef_[0]!=0])

Output:
The optimal lambda is: [1.59228279]
Variables selected by the model: Index(['has_name', 'age', 'volunteer'], dtype='object')
The corresponding coefficients of the selected variables: [1.94517941 0.00695115 0.21605118]

我们找一个特定的人 - Katniss，来查看我们创建的 model 对其第一天生存下来的概率的预测值：

# index Katniss
Katniss = df_hunger[df_hunger["name"]=="Katniss"].drop(['id','surv_day1','name','age_cat'], axis=1)
print("Katniss’ first-day survival probability is: ", round(modelcv.predict_proba(Katniss)[0][1]*100,2),"%")

Output:
Katniss’ first-day survival probability is:  83.39 %