logistic回归又称logistic回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域。 逻辑回归根据给定的自变量数据集来估计事件的发生概率,由于结果是一个概率,因此因变量的范围在 0 和 1 之间。
以上摘选自:
https://baike.baidu.com/item/logistic%E5%9B%9E%E5%BD%92/2981575
下面进行一些简单的学习:
本次使用的数据集是乳腺癌数据集。
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
%matplotlib inline
#下载数据集
data = load_breast_cancer()
X = data.data
y = data.target
#切分训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=200)
#数据标准化处理
from sklearn.preprocessing import StandardScaler
standar_X_train = StandardScaler().fit(X_train).fit_transform(X_train)
standar_X_test = StandardScaler().fit(X_train).fit_transform(X_test)
#使用网格搜索寻求最优参数
params = {"C":[*range(1,5)],
"solver":["liblinear","newton-cg","sag","lbfgs"]}
gs = GridSearchCV(LogisticRegression(penalty="l2",max_iter=10000),param_grid=params,cv = 5)
gs.fit(standar_X_train,y_train)
gs.best_params_
#模型拟合
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty="l2",C = gs.best_params_["C"],solver=gs.best_params_["solver"],max_iter=10000).fit(standar_X_train,y_train)
[lr.score(standar_X_train,y_train),lr.score(standar_X_test,y_test)]
得到的结果是:
{‘C’: 1, ‘solver’: ‘liblinear’}
score:[0.9899497487437185, 0.9649122807017544]****
其实上述网格搜索对于C值,不够精细,不能设置成小数。为此,可以描绘学习曲线,观测不同C值下的score值的变化
#l1和l2范数训练集和测试集的分数列表
lrL1TrainList = []
lrL1TestList = []
lrL2TrainList = []
lrL2TestList = []
c_range = np.linspace(0.1,2,20)
#L1范数下只能用liblinear 坐标下降法
for i in c_range:
lr_l1= LogisticRegression(penalty="l1",C = i,solver="liblinear",max_iter=1000).fit(Xtrian,ytrain)
lr_l2= LogisticRegression(penalty="l2",C = i,solver="newton-cg",max_iter=1000).fit(Xtrian,ytrain)
score_trian = lr_l2.score(Xtrian,ytrain)
score_test = lr_l2.score(Xtest,ytest)
score_trian_1 = lr_l1.score(Xtrian,ytrain)
score_test_1 = lr_l1.score(Xtest,ytest)
lrL2TrainList.append(score_trian)
lrL2TestList.append(score_test)
lrL1TrainList.append(score_trian_1)
lrL1TestList.append(score_test_1)
plt.plot(c_range,lrL2TrainList,color = "green",label = "l2_trian")
plt.plot(c_range,lrL2TestList,color = "blue",label = "l2_test")
plt.plot(c_range,lrL1TrainList,color = "lightgreen",label = "l1_trian")
plt.plot(c_range,lrL1TestList,color = "lightblue",label = "l1_test")
plt.legend()
绘图如下:
可以看到l2范数下的score,训练集和测试集上的分差较小,C值为1.0确实是一个不错的选择。