机器学习——逻辑回归

醋酸洋红就是我

已于 2023-06-22 23:36:35 修改

阅读量98

点赞数

分类专栏：机器学习文章标签：机器学习逻辑回归 python

于 2023-06-08 21:43:37 首次发布

本文链接：https://blog.csdn.net/qq_40527560/article/details/131095876

版权

机器学习专栏收录该内容

12 篇文章 0 订阅

订阅专栏

逻辑回归

机器学习中的一种分类模型，是一种分类算法，解决二分类问题的利器

在这里插入图片描述
逻辑回归的输入是线性回归的输出

激活函数

sigmoid函数
把整体回归的内容映射到0-1区间
在这里插入图片描述

损失及优化

损失（对数似然损失）

y是真实值
在这里插入图片描述
当y=1时，我们希望h_θ(x)越大越好
当y=0时，我们希望h_θ(x)越小越好

完整损失函数
在这里插入图片描述

优化

梯度下降法
提升原本属于1类别的概率，降低原本是0类别的概率

sklearn.linear_model.LogisticRegression(solver='liblinear',penalty='l2',c=1.0)
在这里插入图片描述

癌症分类预测-良/恶性乳腺癌肿瘤预测

#1获取数据
import pandas as pd
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']

data=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=names)
data.head()

在这里插入图片描述

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

#2基本数据处理
#2.1缺失值处理
data=data.replace(to_replace='?',value=np.NaN)
data=data.dropna()

#2.1确定特征值，目标值
x=data.iloc[:,1:-1]
y=data['Class']

#2.3分割数据
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=22,test_size=0.2)

#3特征工程（标准化）
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)

#4机器学习（逻辑回归）
estimator=LogisticRegression()
estimator.fit(x_train,y_train)

#5模型评估
#5.1准确率
ret=estimator.score(x_test,y_test)
print('准确率为:\n',ret)

#5.2预测值
y_pre=estimator.predict(x_test)
print('预测值为:\n',y_pre)

准确率为:
0.9854014598540146
预测值为:
[2 4 4 2 2 2 2 2 2 2 2 2 2 4 2 2 4 4 4 2 4 2 4 4 4 2 4 2 2 2 2 2 4 2 2 2 4
2 2 2 2 4 2 4 4 4 4 2 4 4 2 2 2 2 2 4 2 2 2 2 4 4 4 4 2 4 2 2 4 2 2 2 2 4
2 2 2 2 2 2 4 4 4 2 4 4 4 4 2 2 2 4 2 4 2 2 2 2 2 2 4 2 2 4 2 2 4 2 4 4 2
2 2 2 4 2 2 2 2 2 2 4 2 4 2 2 2 4 2 4 2 2 2 4 2 2 2]

分类评估方法

混淆矩阵

在分类任务下，由预测结果和真实结果之间存在的四种不同组合构成
在这里插入图片描述

准确率 （对不对）
预测与真实一样所占的比例
(TP+TN)/(TP+FN+FP+TN)

精确率（查的准不准）
预测结果为正例样本中真实为正例的比例
(TP)/(TP+FP)

召回率（查的全不全）
真实值为正例的样本中预测结果为正例的比例
(TP)/(TP+FN)

F1-score
用来反映模型的稳健性
在这里插入图片描述

分类评估报告api
sklearn.metrics.classification_report(y_true,y_pred,labels=[],target_names=None)
labels是指定类别对应的数字，target_name是目标类别名称

from sklearn.metrics import classification_report

#5.3精确率/召回率指标评价
ret=classification_report(y_test,y_pre,labels=(2,4),target_names=('良性','恶性'))
print(ret)

在这里插入图片描述

ROC曲线和AUC指标
样本不均衡（类别不平衡）
比例大于4比1就认为是样本不均衡，此时就不能用精确率，召回率来评价指标，而用ROC曲线和AUC指标来评价

TPR（召回率）与FPR
在这里插入图片描述

在这里插入图片描述
AUC的概率意义是随机取一对正负样本，正样本得分大于负样本得分的概率
AUC的范围在[0,1]之间，并且越接近1越好，越接近0.5属于乱猜
AUC只能用来评价二分类

sklearn.metrics.roc_auc_score(y_true,y_score)

from sklearn.metrics import roc_auc_score

#5.4auc指标计算
y_test=np.where(y_test>3,1,0) #把y_test大于3的置1，否则置0
roc_auc_score(y_test,y_pre)

y_score是预测得分

ROC曲线的绘制

假设有6次展示记录，有2次被点击了
展示序列(1:1,2:0,3:1,4:0,5:0,6:0)，前面表示序号，后面1表示点击0表示没有点击
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

ROC曲线积分与AUC相等

类别不平衡数据

过采样方法：增加数量较少那一类样本的数量，使得正负样本比例均匀
欠采样方法：减少数量较多那一类样本的数量，使得正负样本比例均匀

准备不平衡数据

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from collections import Counter

X,y = make_classification(n_samples=5000, n_features=2, 
                          n_informative=2, n_redundant=0, 
                          n_repeated=0, n_classes=3, 
                          n_clusters_per_class=1, #某一个类别是由几个cluster构成
                          weights=[0.01, 0.05, 0.94], random_state=0)

Counter(y)

Counter({2: 4674, 1: 262, 0: 64})

数据可视化

plt.scatter(X[:, 0], X[:, 1], c=y) #c=y代表基于y来分不同类别
plt.show()

在这里插入图片描述
过采样
随机过采样
在少数类中随机选择一些样本，然后通过复制所选择的样本生成样本集
缺点：造成模型训练复杂度加大，造成模型过拟合问题

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
Counter(y_resampled)

Counter({2: 4674, 1: 4674, 0: 4674})

# 数据可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

在这里插入图片描述

SMOTE过采样
对每个少数类样本，从它的最近邻中随机选择一个样本（是少数类中的一个样本），然后在其之间的连线上随机选择一点作为新合成的少数类样本

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)
Counter(y_resampled)

Counter({2: 4674, 1: 4674, 0: 4674})

# 数据可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

在这里插入图片描述
欠采样
随机欠采样
在多数类中随机选择一些样本，然后将样本集移除
缺点：会造成一些信息缺失

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
Counter(y_resampled)

Counter({0: 64, 1: 64, 2: 64})

# 数据可视化
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
plt.show()

在这里插入图片描述

醋酸洋红就是我

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
机器学习——逻辑回归

对每个少数类样本，从它的最近邻中随机选择一个样本（是少数类中的一个样本），然后在其之间的连线上随机选择一点作为新合成的少数类样本。展示序列(1:1,2:0,3:1,4:0,5:0,6:0)，前面表示序号，后面1表示点击0表示没有点击。AUC的概率意义是随机取一对正负样本，正样本得分大雨负样本得分的概率。机器学习中的一种分类模型，是一种分类算法，解决二分类问题的利器。在少数类中随机选择一些样本，然后通过复制所选择的样本生成样本集。提升原本属于1类别的概率，降低原本是0类别的概率。预测与真实一样所占的比例。
复制链接

扫一扫