一、基于逻辑回归的分类预测-学习笔记

最新推荐文章于 2024-08-04 12:35:46 发布

lesen_l

最新推荐文章于 2024-08-04 12:35:46 发布

阅读量467

点赞数

本文链接：https://blog.csdn.net/readPython/article/details/108106554

版权

本文深入探讨逻辑回归的原理及应用，通过代码实践展示如何使用sklearn库进行逻辑回归模型的训练与预测，特别聚焦于鸢尾花数据集的分类任务。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于逻辑回归的分类预测

一、学习目标
二、代码实践
三、逻辑回归原理简介

一、学习目标

了解逻辑回归的理论
掌握逻辑回归的sklearn函数的调用使用并将其运用到鸢尾花数据集预测

二、代码实践

1.代码流程

demo实践
1.库函数导入
2.模型训练
3.模型参数查看
4.数据和模型可视化
5.模型预测
基于鸢尾花（iris）数据集的逻辑回归分类实践
1.库函数导入
2.数据读取/载入
3.数据信息查看
4.可视化描述
5.利用数据回归模型在二分类上进行训练和预测
6.利用数据回归模型在三分类（多分类）上进行训练和预测

2.demo实践

库函数导入

#导入基础库函数
import numpy as np
#numpy是Python进行科学计算的基础软件包

#导入画图库
import matplotlib.pyplot as plt
import seaborn as sns
#matplotlib和seaborn是画图软件包

#导入逻辑回归模型函数
from sklearn.linear_model import LogisticRegression

训练模型

#Demo演示LogisticRegression分类

#构造数据集
x_fearures=np.array([[-1,-2],[-2,-1],[-3,-2],[1,3],[2,1],[3,2]])
y_label=np.array([0,0,0,1,1,1])

#调用逻辑回归模型
lr_clf=LogisticRegression()

#逻辑回归模型拟合构造的数据集
lr_clf=lr_clf.fit(x_fearures,y_label)
#拟合方程为：y=w0+w1*x1+w2*x2

模型参数查看

#查看其对应模型的w
print("the weight of Logistic Regression:",lr_clf.coef_)
#查看其对应模型的w0
print("the intercept(w0) of Logistic regression:",lr_clf.intercept_)
#the weight of Logistic Regression: [[ 0.73462087  0.6947908 ]]
#the intercept(w0) of Logistic regression: [-0.03643213]

数据和模型可视化

#可视化构造的数据样本点
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1],c=y_label,s=50,cmap='viridis')
plt.title('Dataset')
plt.show()

在这里插入图片描述

#可视化决策边界
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1],c=y_label,s=50,cmap='viridis')
plt.title('Dataset')

nx,ny=200,100
x_min,x_max=plt.xlim()
y_min,y_max=plt.ylim()
x_grid,y_grid=np.meshgrid(np.linspace(x_min,x_max,nx),np.linspace(y_min,y_max,ny))

z_proba=lr_clf.predict_proba(np.c_[x_grid.ravel(),y_grid.ravel()])
z_proba=z_proba[:,1].reshape(x_grid.shape)
plt.contour(x_grid,y_grid,z_proba,[0.5],linewidths=2.,colors='blue')

plt.show()

在这里插入图片描述

#可视化预测新样本

plt.figure()
#new point 1
x_fearures_new1=np.array([[0,-1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1],s=50,cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

#new point 2
x_fearures_new2=np.array([[1,2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1],s=50,cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

#训练样本
plt.scatter(x_fearures[:,0],x_fearures[:,1],c=y_label,s=50,cmap='viridis')
plt.title('Dataset')

#可视化决策边界
plt.contour(x_grid,y_grid,z_proba,[0.5],linewidths=2.,colors='blue')

plt.show()

在这里插入图片描述

模型预测

#在训练集和测试集上分布利用训练好的模型进行预测
y_label_new1_predict=lr_clf.predict(x_fearures_new1)
y_label_new2_predict=lr_clf.predict(x_fearures_new2)
print('the new point 1 predict:\n',y_label_new1_predict)
print('the new point 2 predict:\n',y_label_new2_predict)
#由于逻辑回归模型是概率预测模型（前文介绍的p=p(y=(1|x),\theta)）,所以我们可以利用predict_proba函数预测其概率
y_label_new1_predict_proba=lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba=lr_clf.predict_proba(x_fearures_new2)
print('the new point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('the new point2 predicr Probability of each class:\n',y_label_new2_predict_proba)

#the new point 1 predict:
 #[0]
#the new point 2 predict:
 #[1]
#the new point 1 predict Probability of each class:
 #[[ 0.67507358  0.32492642]]
#the new point2 predicr Probability of each class:
 #[[ 0.11029117  0.88970883]]

可以发现训练好的回归模型将X_new1预测为了类别0（判别面左下侧），将X_new2预测为了类别1（判别面的右上侧）。其训练得到的逻辑回归模型的概率为0.5的判别面为上图中的蓝色线。

3.基于鸢尾花（iris）数据集的逻辑回归代码实践

本次我们选择鸢尾花数据集（iris）进行方法的尝试训练，该数据集一共包含5个变量（4个特征变量，1个目标分类变量），共150个样本。目标变量为“花的类别”，都属于鸢尾属下的三个亚属，分别是：山鸢尾（Iris-setosa）、变色鸢尾（Iris-versicolor）和维吉尼亚鸢尾（Iris-virginica）。三种鸢尾花的4个特征是：花萼长度（cm）、花萼宽度（cm）、花瓣长度（cm）、花瓣宽度（cm），这些形态特征在过去用来识别物种。

变量	描述
sepal legth	花萼长度（cm）
sepal width	花萼宽度（cm）
petal legth	花瓣长度（cm）
petal width	花瓣宽度（cm）
target	鸢尾花的三个亚属类别：setosa（0）、versicolor（1）、virginica（2）

库函数的导入

#导入基础函数库
import numpy as np
import pandas as pd
#pandas是一种快速、强大、灵活且易于使用的开源数据分析和处理工具

#导入绘图函数库
import matplotlib.pyplot as plt
import seaborn as sns

数据读取/载入

#利用sklearn中自带的iris数据作为数据载入，并利用pandas转换为DataFrame格式
from sklearn.datasets import load_iris
data=load_iris() #得到数据特征
iris_target=data.target #得到数据对应的标签
iris_features=pd.DataFrame(data=data.data,columns=data.feature_names) #利用pandas转换成DataFrame格式

数据信息简单查看

#利用.info查看数据的整体信息
iris_features.info()

'''
结果如下：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
'''

#如果进行简单的数据查看，我们可以利用：.head()和.tail()

#查看头部
iris_features.head()

结果如下：

	pepal length	sepal width	petal length	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

#查看尾部
iris_features.tail()

结果如下：

	pepal length	sepal width	petal length	petal width
145	6.7	3.0	5.2	2.3
146	6.3	2.5	5.0	1.9
147	6.5	3.0	5.2	2.0
148	6.2	3.4	5.4	2.3
149	5.9	3.0	5.1	1.8

#查看对应的类别标签
iris_target


'''
结果如下：
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
'''
#其中，0,1,2分别代表‘setosa’、‘versicolor’、‘virginica’三种鸢尾花的类别

#利用value_counts函数查看类别数量
pd.Series(iris_target).value_counts()

'''
结果如下：
2    50
1    50
0    50
dtype: int64
'''

#对特征进行一些统计描述
iris_features.describe()

结果如下：

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

从统计描述中，我们可以看到不同数值特征的变化范围。

可视化描述

#合并标签和特征信息
iris_all=iris_features.copy()#进行浅拷贝，防止对原始数据的修改
iris_all['target']=iris_target

#特征与标签组合的散点可视化
sns.pairplot(data=iris_all,diag_kind='hist',hue='target')
plt.show()

在这里插入图片描述
从上图可以发现，在2D情况下，不同的的特征组合对于不同类别的花的散点分布，以及大概的区分能力。

for col in iris_features.columns:
    sns.boxplot(x='target',y=col,saturation=0.5,palette='pastel',data=iris_all)
    plt.title(col)
    plt.show()

在这里插入图片描述

利用箱型图也可以得到不同类别的花在不同特征上的分布差异情况。

#选取前三个特征绘制三维散点图
from mpl_toolkits.mplot3d import Axes3D

fig=plt.figure(figsize=(10,8))
ax=fig.add_subplot(111,projection='3d')

iris_all_class0=iris_all[iris_all['target']==0].values
iris_all_class1=iris_all[iris_all['target']==1].values
iris_all_class2=iris_all[iris_all['target']==2].values
#'setosa'(0),'versicolor'(1),'virginica'(2)
ax.scatter(iris_all_class0[:,0],iris_all_class0[:,1],iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0],iris_all_class1[:,1],iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0],iris_all_class2[:,1],iris_all_class2[:,2],label='virginica')
plt.legend()

plt.show()

在这里插入图片描述

利用逻辑回归模型在二分类上进行训练和预测‘

#为了正确评估模型性能，将数据划分为训练集合测试集，并在训练集上训练模型，在测试集上验证模型性能。
from sklearn.model_selection import train_test_split
#选择其类别为0和1的样本（不包括类别为2的样本）
iris_features_part=iris_features.iloc[:100]
iris_target_part=iris_target[:100]
#测试集大小为20%，80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features_part,iris_target_part,test_size=0.2,random_state=2020)

#从sklearn中导入逻辑回归模型
from sklearn.linear_model import LogisticRegression

#定义逻辑回归模型
clf=LogisticRegression(random_state=0,solver='lbfgs')

#在训练集上训练逻辑回归模型
clf.fit(x_train,y_train)

'''
结果如下：
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)
'''

#查看其对用的w
print('the weight of Logistic Regression:',clf.coef_)

#查看其对应的w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)

'''
结果如下：
the weight of Logistic Regression: [[ 0.45244919 -0.81010583  2.14700385  0.90450733]]
the intercept(w0) of Logistic Regression: [-6.57504448]
'''

#在训练集和测试集上分布利用训练好的模型进行预测
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)

from sklearn import metrics
#利用accuracy(准确度)预测正确的样本数目占总预测样本数目的比例评估模型的准确性
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

#查看混淆矩阵（预测值和真实值的各类情况统计矩阵）
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

#利用热力图对结果进行可视化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predictedlabels')
plt.ylabel('Turelabels')
plt.show()

'''
结果如下：
The accuracy of the Logistic Regression is 1.0
The accuracy of the Logistic Regression is: 1.0
The confusion matrix result:
 [[ 9  0]
 [ 0 11]]
'''

在这里插入图片描述
我们可以发现其准确度为1，代表所有样本都预测正确了。

利用逻辑回归模型在三分类（多分类）’上进行训练和预测

#测试集大小为20%，80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features,iris_target,test_size=0.2,random_state=2020)
#定义逻辑回归模型
clf=LogisticRegression(random_state=0,solver='lbfgs')
#在训练集上训练逻辑回归模型
clf.fit(x_train,y_train)

'''
结果如下：
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)
'''

#查看其对用w
print('The weight of Logistic Regression:\n',clf.coef_)
#查看其对应的w0
print('The intercept(w0) of Logistic Regression:\n',clf.intercept_)
#由于这个是三分类，所以我们这里得到了三个逻辑回归模型的参数，其三个逻辑回归组合起来即可实现三分类

'''
结果如下：
The weight of Logistic Regression:
 [[-0.43538857  0.87888013 -2.19176678 -0.94642091]
 [-0.39434234 -2.6460985   0.76204684 -1.35386989]
 [-0.00806312  0.11304846  2.52974343  2.3509289 ]]
The intercept(w0) of Logistic Regression:
 [  6.30620875   8.25761672 -16.63629247]
'''

#在训练集和测试集分布利用训练好的模型进行预测
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)

#由于逻辑回归模型是概率预测模型（前文介绍的p=p(y=1|x,\theta)），所以我们可以利用predict_proba函数预测其概率）
train_predict_proba=clf.predict_proba(x_train)
test_predict_proba=clf.predict_proba(x_test)

print('The test predict Probability of each class:\n',test_predict_proba)
#其中第一列代表预测为0类的概率，第二列代表预测为1类的概率，第三列代表预测为2类的概率

#利用accuracy（准确度）【预测正确的样本数目占总预测样本数目的比例】评估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

'''
结果如下：
The test predict Probability of each class:
 [[  1.32525870e-04   2.41745142e-01   7.58122332e-01]
 [  7.02970475e-01   2.97026349e-01   3.17667822e-06]
 [  3.37367886e-02   7.25313901e-01   2.40949311e-01]
 [  5.66207138e-03   6.53245545e-01   3.41092383e-01]
 [  1.06817066e-02   6.72928600e-01   3.16389693e-01]
 [  8.98402870e-04   6.64470713e-01   3.34630884e-01]
 [  4.06382037e-04   3.86192249e-01   6.13401369e-01]
 [  1.26979439e-01   8.69440588e-01   3.57997319e-03]
 [  8.75544317e-01   1.24437252e-01   1.84312617e-05]
 [  9.11209514e-01   8.87814689e-02   9.01671605e-06]
 [  3.86067682e-04   3.06912689e-01   6.92701243e-01]
 [  6.23261939e-03   7.19220636e-01   2.74546745e-01]
 [  8.90760124e-01   1.09235653e-01   4.22292409e-06]
 [  2.32339490e-03   4.47236837e-01   5.50439768e-01]
 [  8.59945211e-04   4.22804376e-01   5.76335679e-01]
 [  9.24814068e-01   7.51814638e-02   4.46852786e-06]
 [  2.01307999e-02   9.35166320e-01   4.47028801e-02]
 [  1.71215635e-02   5.07246971e-01   4.75631465e-01]
 [  1.83964097e-04   3.17849048e-01   6.81966988e-01]
 [  5.69461042e-01   4.30536566e-01   2.39269631e-06]
 [  8.26025475e-01   1.73971556e-01   2.96936737e-06]
 [  3.05327704e-04   5.15880492e-01   4.83814180e-01]
 [  4.69978972e-03   2.90561777e-01   7.04738434e-01]
 [  8.61077168e-01   1.38915993e-01   6.83858427e-06]
 [  6.99887637e-04   2.48614010e-01   7.50686102e-01]
 [  5.33421842e-02   8.31557126e-01   1.15100690e-01]
 [  2.34973018e-02   3.54915328e-01   6.21587370e-01]
 [  1.63311193e-03   3.48301765e-01   6.50065123e-01]
 [  7.72156866e-01   2.27838662e-01   4.47157219e-06]
 [  9.30816593e-01   6.91640361e-02   1.93708074e-05]]
The accuracy of the Logistic Regression is: 0.958333333333
The accuracy of the Logistic Regression is: 0.8
'''

#查看混淆矩阵
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

#利用热力图对结果进行可视化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

'''
结果如下：
The confusion matrix result:
 [[10  0  0]
 [ 0  7  3]
 [ 0  3  7]]
'''

在这里插入图片描述

三、逻辑回归原理简介

当z≥0 时,y≥0.5,分类为1，当 z<0时,y<0.5,分类为0，其对应的y值我们可以视为类别为1的概率预测值。Logistic回归虽然名字里带有“回归”，但它实际上是一种分类方法。主要用于两类问题（即输出只有两种，分别代表两个类别），所以利用了Logistic函数（或称为Sigmoid函数），函数形式为：
在这里插入图片描述

对应的函数图像可以表示如下：

import numpy as np
import matplotlib.pyplot as plt
x=np.arange(-5,5,0.01)
y=1/(1+np.exp(-x))

plt.plot(x,y)
plt.xlabel('z')
plt.ylabel('y')
plt.grid()
plt.show()

在这里插入图片描述

通过上图，我们可以发现Logistic函数是单调递增函数，并且在z=0
而回归的基本方程，
将回归方程写入其中为：
在这里插入图片描述

所以，在这里插入图片描述
从原理上来说，逻辑回归其实是实现了一个决策边界：对于函数
当z≥0 时,y≥0.5,分类为1，当 z<0时,y<0.5,分类为0，其对应的y值我们可以视为类别为1的概率预测值。
对于模型的训练而言：实际上就是利用数据求解出对应的模型的特定的w，从而得到一个针对与当前数据的特征逻辑回归模型。
而对于多分类而言：将多个二分类的逻辑回归组合，即可实现二分类。