目录
一、逻辑回归介绍
1、应用场景
广告点击率、是否为垃圾邮件、是否患病、金融诈骗、虚假账号
2、逻辑回归原理
输入:回归函数
输出:类别
实质:解决的是分类问题
2.1 输入
h ( w ) = w 1 x 1 + w 2 x 2 + . . . + b h(w) = w_1x_1 + w_2x_2+ ... + b h(w)=w1x1+w2x2+...+b
2.2激活函数
- sigmoid函数。回归的结果输入到sigmoid函数中。输入的结果:[0, 1]区间中的一个概率值,阈值默认为0.5。
g ( θ T x ) = 1 1 + e − θ T x {\rm{g}}({\theta ^T}x) = \frac{1}{{1 + {e^{ - {\theta ^T}x}}}} g(θTx)=1+e−θTx1
3、损失与优化
3.1 损失(对数似然损失)
c o s t ( h θ ( x ) , y ) = { − log ( h θ ( x ) ) i f y = 1 − log ( 1 − h θ ( x ) ) i f y = 0 cost({h_\theta }(x),y) = \left\{ \begin{array}{l} -\log ({h_\theta }(x)){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} y = 1\\ -\log (1 - {h_\theta }(x)){\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} y = 0 \end{array} \right. cost(hθ(x),y)={−log(hθ(x))ify=1−log(1−hθ(x))ify=0
- 完整的损失函数
c o s t ( h θ ( x ) , y ) = ∑ i = 1 m − y i log ( h θ ( x ) ) − ( 1 − y i ) log ( 1 − h θ ( x ) ) cost({h_\theta }(x),y) = \sum\limits_{i = 1}^m { - {y_i}\log ({h_\theta }(x))} - (1 - {y_i})\log (1 - {h_\theta }(x)) cost(hθ(x),y)=i=1∑m−yilog(hθ(x))−(1−yi)log(1−hθ(x))
3.2 优化
提升1对应的概率,降低0对应的概率
二、逻辑回归API介绍
l1、l2正则化
l1正则化把高次项系数直接变为0.(Lasso回归)
l2正则化把高次项系数前面的系数变为特别小的值(岭回归)
具有l2正则化的线性回归
alpha-正则回归
alpha越大(正则化力度越大),系数越小
alpha越小(正则化力度越小),系数越大
三、案例实现
# 预测挨着
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# 1.获取数据集
names = ['Sample code number'
'Clump Thickness',
'Uniformity of Cell Size',
'Uniformity of Cell Shape',
'Marginal Adhesion',
'Single Epithelial Cell Size',
'Bare Nuclei',
'Bland Chromatin',
'Normal Nucleoli',
'Mitoses',
'Class']
data = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)
# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=15, test_size=0.2)
# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
# 5.模型评估
y_pre = estimator.predict(x_test)
score = estimator.score(x_test, y_test)
print("准确率:", score)
四、分类评估方法
4.1 混淆矩阵
在分类任务下,预测结果与正确结果之间存在四种不同的组合,构成混淆矩阵。
真实结果 / 预测结果 | 阳例 | 阴例 |
---|---|---|
真例 | TP | FN |
假例 | FP | TN |
4.2 精确率、准确率、召回率、F1-score
计算公式
:
准
确
率
:
A
c
c
u
r
a
c
y
=
T
P
+
T
N
T
P
+
F
N
+
F
P
+
T
N
准确率:Accuracy = \frac{{TP + TN}}{{TP + FN + FP + TN}}
准确率:Accuracy=TP+FN+FP+TNTP+TN
精 确 率 : P r e c i s i o n = T P T P + F P 精确率:Precision = \frac{{TP}}{{TP + FP}} 精确率:Precision=TP+FPTP
召 回 率 : R e c a l l = T P T P + F N 召回率:Recall = \frac{{TP}}{{TP + FN}} 召回率:Recall=TP+FNTP
F 1 = 2 T P 2 T P + F N + F P = 2 ⋅ P r e c i s i o n ⋅ R e c a l l P r e c i s i o n + R e c a l l F1 = \frac{{2TP}}{{2TP + FN + FP}} = \frac{{2 \cdot Precision \cdot Recall}}{{Precision + {\mathop{ Re}\nolimits} call}} F1=2TP+FN+FP2TP=Precision+Recall2⋅Precision⋅Recall
精确率
:用来评估预测的是否准确
召回率(找出率)
:用来评估找出阳例是否全
F1-score
:用来评估模型的稳健性
4.3 分类评估报告API及实现
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# 1.获取数据集
names = ['Sample code number'
'Clump Thickness',
'Uniformity of Cell Size',
'Uniformity of Cell Shape',
'Marginal Adhesion',
'Single Epithelial Cell Size',
'Bare Nuclei',
'Bland Chromatin',
'Normal Nucleoli',
'Mitoses',
'Class']
data = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)
# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)
# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
# 5.模型评估
y_pre = estimator.predict(x_test)
score = estimator.score(x_test, y_test)
print("准确率:", score)
ret = classification_report(y_test, y_pre)
print(ret)
4.4 ROC曲线和AUC曲线
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
通过TPR和FPR来进行图形绘制,形成一个AUC指标,AUC越接近1效果越好,越接近0,效果越差
- ROC曲线
ROC曲线横轴是FPRate,纵轴是TPRate,当两者相等是,表示的意义是:对于不论只是类别的1还是0的样本,分类器预测为1的概率是相等的,此时AUC=0.5
# API介绍
from sklearn.metrics import roc_auc_score
roc_auc_score(y_true, y_score)
"""
AUC:ROC曲线的面积
y_true:必须把正例转换为1,反例转换为0
y_score:预测得分,可以是正类的估计概率、置信值或者分类器方法的返回值
"""
主要用来评价不平衡的二分类问题。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
# 1.获取数据集
names = ['Sample code number'
'Clump Thickness',
'Uniformity of Cell Size',
'Uniformity of Cell Shape',
'Marginal Adhesion',
'Single Epithelial Cell Size',
'Bare Nuclei',
'Bland Chromatin',
'Normal Nucleoli',
'Mitoses',
'Class']
data = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=names)
# 2.数据基本处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
x = data.iloc[:, 1:-1]
y = data["Class"]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22, test_size=0.2)
# 3.特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)
# 4.机器学习
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
# 5.模型评估
y_pre = estimator.predict(x_test)
score = estimator.score(x_test, y_test)
print("准确率:", score)
ret = classification_report(y_test, y_pre)
print(ret)
# 因为y_test的值必须是0或1,且阳性为0,阴性为1,所以将预测值替换为0或1
y_test = np.where(y_test > 3, 1, 0)
auc = roc_auc_score(y_true=y_test, y_score=y_pre)
print("auc的值为:", auc)