逻辑回归原理与sklearn实现

本文深入解析逻辑回归在广告点击、医疗诊断等场景的应用,介绍了其工作原理、sigmoid激活函数、对数似然损失函数,以及l1/l2正则化、API使用和实例演示。涵盖了分类评估方法如混淆矩阵、精度等,并探讨了ROC曲线和AUC评估指标。
摘要由CSDN通过智能技术生成

一、逻辑回归介绍

1、应用场景

广告点击率、是否为垃圾邮件、是否患病、金融诈骗、虚假账号

2、逻辑回归原理

输入:回归函数
输出:类别
实质:解决的是分类问题

2.1 输入

h ( w ) = w 1 x 1 + w 2 x 2 + . . . + b h(w) = w_1x_1 + w_2x_2+ ... + b h(w)=w1x1+w2x2+...+b

2.2激活函数

  • sigmoid函数。回归的结果输入到sigmoid函数中。输入的结果:[0, 1]区间中的一个概率值,阈值默认为0.5。

g ( θ T x ) = 1 1 + e − θ T x {\rm{g}}({\theta ^T}x) = \frac{1}{{1 + {e^{ - {\theta ^T}x}}}} g(θTx)=1+eθTx1

在这里插入图片描述

3、损失与优化

3.1 损失(对数似然损失)

c o s t ( h θ ( x ) , y ) = { − log ⁡ ( h θ ( x ) ) i f y = 1 − log ⁡ ( 1 − h θ ( x ) ) i f y = 0 cost({h_\theta }(x),y) = \left\{ \begin{array}{l} -\log ({h_\theta }(x)){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} y = 1\\ -\log (1 - {h_\theta }(x)){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} y = 0 \end{array} \right. cost(hθ(x),y)={log(hθ(x))ify=1log(1hθ(x))ify=0

在这里插入图片描述

  • 完整的损失函数
    c o s t ( h θ ( x ) , y ) = ∑ i = 1 m − y i log ⁡ ( h θ ( x ) ) − ( 1 − y i ) log ⁡ ( 1 − h θ ( x ) ) cost({h_\theta }(x),y) = \sum\limits_{i = 1}^m { - {y_i}\log ({h_\theta }(x))} - (1 - {y_i})\log (1 - {h_\theta }(x)) cost(hθ(x),y)=i=1myilog(hθ(x))(1yi)log(1hθ(x))

3.2 优化

提升1对应的概率,降低0对应的概率

二、逻辑回归API介绍

在这里插入图片描述

l1、l2正则化
l1正则化把高次项系数直接变为0.(Lasso回归)
l2正则化把高次项系数前面的系数变为特别小的值(岭回归)

具有l2正则化的线性回归 alpha-正则回归
alpha越大(正则化力度越大),系数越小
alpha越小(正则化力度越小),系数越大

三、案例实现

# 预测挨着
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# 1.获取数据集
names = ['Sample code number'
         'Clump Thickness',
         'Uniformity of Cell Size',
         'Uniformity of Cell Shape',
         'Marginal Adhesion',
         'Single Epithelial Cell Size',
         'Bare Nuclei',
         'Bland Chromatin',
         'Normal Nucleoli',
         'Mitoses',
         'Class']
data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)

# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=15, test_size=0.2)

# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

# 5.模型评估
y_pre = estimator.predict(x_test)

score = estimator.score(x_test, y_test)
print("准确率:", score)

四、分类评估方法

4.1 混淆矩阵

在分类任务下,预测结果与正确结果之间存在四种不同的组合,构成混淆矩阵。

真实结果 / 预测结果阳例阴例
真例TPFN
假例FPTN

4.2 精确率、准确率、召回率、F1-score

计算公式
准 确 率 : A c c u r a c y = T P + T N T P + F N + F P + T N 准确率:Accuracy = \frac{{TP + TN}}{{TP + FN + FP + TN}} Accuracy=TP+FN+FP+TNTP+TN

精 确 率 : P r e c i s i o n = T P T P + F P 精确率:Precision = \frac{{TP}}{{TP + FP}} Precision=TP+FPTP

召 回 率 : R e c a l l = T P T P + F N 召回率:Recall = \frac{{TP}}{{TP + FN}} Recall=TP+FNTP

F 1 = 2 T P 2 T P + F N + F P = 2 ⋅ P r e c i s i o n ⋅ R e c a l l P r e c i s i o n + R e c a l l F1 = \frac{{2TP}}{{2TP + FN + FP}} = \frac{{2 \cdot Precision \cdot Recall}}{{Precision + {\mathop{ Re}\nolimits} call}} F1=2TP+FN+FP2TP=Precision+Recall2PrecisionRecall

精确率:用来评估预测的是否准确
召回率(找出率):用来评估找出阳例是否全
F1-score:用来评估模型的稳健性

4.3 分类评估报告API及实现

在这里插入图片描述

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1.获取数据集
names = ['Sample code number'
         'Clump Thickness',
         'Uniformity of Cell Size',
         'Uniformity of Cell Shape',
         'Marginal Adhesion',
         'Single Epithelial Cell Size',
         'Bare Nuclei',
         'Bland Chromatin',
         'Normal Nucleoli',
         'Mitoses',
         'Class']
data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)

# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)

# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

# 5.模型评估
y_pre = estimator.predict(x_test)

score = estimator.score(x_test, y_test)
print("准确率:", score)

ret = classification_report(y_test, y_pre)
print(ret)

在这里插入图片描述

4.4 ROC曲线和AUC曲线

TPR = TP/(TP+FN)
FPR = FP/(FP+TN)

通过TPR和FPR来进行图形绘制,形成一个AUC指标,AUC越接近1效果越好,越接近0,效果越差

  • ROC曲线
    ROC曲线横轴是FPRate,纵轴是TPRate,当两者相等是,表示的意义是:对于不论只是类别的1还是0的样本,分类器预测为1的概率是相等的,此时AUC=0.5
    在这里插入图片描述
# API介绍
from sklearn.metrics import roc_auc_score
roc_auc_score(y_true, y_score)
"""
AUC:ROC曲线的面积
y_true:必须把正例转换为1,反例转换为0
y_score:预测得分,可以是正类的估计概率、置信值或者分类器方法的返回值
"""

主要用来评价不平衡的二分类问题。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# 1.获取数据集
names = ['Sample code number'
         'Clump Thickness',
         'Uniformity of Cell Size',
         'Uniformity of Cell Shape',
         'Marginal Adhesion',
         'Single Epithelial Cell Size',
         'Bare Nuclei',
         'Bland Chromatin',
         'Normal Nucleoli',
         'Mitoses',
         'Class']
data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)

# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)

# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

# 5.模型评估
y_pre = estimator.predict(x_test)

score = estimator.score(x_test, y_test)
print("准确率:", score)

ret = classification_report(y_test, y_pre)
print(ret)

# 因为y_test的值必须是0或1,且阳性为0,阴性为1,所以将预测值替换为0或1
y_test = np.where(y_test > 3, 1, 0)
auc = roc_auc_score(y_true=y_test, y_score=y_pre)

print("auc的值为:", auc)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
逻辑回归是一种广泛用于分类任务的线性模型,特别适合于二分类问题。其基本思想是建立输入特征和输出概率之间的线性关系,然后使用sigmoid函数将这个线性结果转换为0到1之间的概率,从而预测一个样本属于某个类别的可能性。 Python中,我们可以使用sklearn库中的LogisticRegression模块来实现逻辑回归。以下是逻辑回归的基本步骤: 1. **数据准备**:导入所需的库,如numpy, pandas, 和sklearn,并加载数据集。 ```python import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # 加载数据(假设数据集名为df) X = df.drop('target_column', axis=1) # 特征 y = df['target_column'] # 目标变量 ``` 2. **数据预处理**:通常包括归一化或标准化数据,因为逻辑回归对特征尺度敏感。 ```python scaler = StandardScaler() X_scaled = scaler.fit_transform(X) ``` 3. **划分训练集和测试集**: ```python X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) ``` 4. **创建并训练模型**: ```python model = LogisticRegression(max_iter=10000) # 配置参数,例如迭代次数 model.fit(X_train, y_train) ``` 5. **预测和评估**: ```python y_pred = model.predict(X_test) accuracy = model.score(X_test, y_test) ``` 6. **模型解释**:逻辑回归模型的系数和截距可以用来理解各个特征对目标变量的影响。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值