scikit-learn机器学习笔记——逻辑斯蒂回归
逻辑回归公式
公式:
h
θ
(
x
)
=
g
(
θ
T
x
)
=
1
1
+
e
−
θ
T
x
g
(
z
)
=
1
1
+
e
−
z
\begin{array}{c} h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}} \\ g(z)=\frac{1}{1+e^{-z}} \end{array}
hθ(x)=g(θTx)=1+e−θTx1g(z)=1+e−z1
输出:[0,1]区间的概率值,默认0.5作为阀值
注:g(z)为sigmoid函数
sigmoid函数图形:
逻辑回归的损失函数
与线性回归原理相同,但由于是分类问题, 损失函数不一样,只能通过梯度下降求解。
cost
(
h
θ
(
x
)
,
y
)
=
{
−
log
(
h
θ
(
x
)
)
if
y
=
1
−
log
(
1
−
h
θ
(
x
)
)
if
y
=
0
⟶
−
log
P
(
Y
∣
X
)
\operatorname{cost}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll} -\log \left(h_{\theta}(x)\right) & \text { if } \mathrm{y}=1 \\ -\log \left(1-h_{\theta}(x)\right) & \text { if } \mathrm{y}=0 \end{array} \longrightarrow-\log P(Y \mid X)\right.
cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x)) if y=1 if y=0⟶−logP(Y∣X)
完整的损失函数:
cost
(
h
θ
(
x
)
,
y
)
=
∑
i
=
1
m
−
y
i
log
(
h
θ
(
x
)
)
−
(
1
−
y
i
)
log
(
1
−
h
θ
(
x
)
)
\operatorname{cost}\left(h_{\theta}(x), y\right)=\sum_{i=1}^{m}-y_{i} \log \left(h_{\theta}(x)\right)-\left(1-y_{i}\right) \log \left(1-h_{\theta}(x)\right)
cost(hθ(x),y)=i=1∑m−yilog(hθ(x))−(1−yi)log(1−hθ(x))
cost损失的值越小,那么预测的类别准确度更高。
sklearn逻辑回归API
• sklearn.linear_model.LogisticRegression(penal ty=‘l2’, C = 1.0)
• Logistic回归分类器
• coef_:回归系数
LogisticRegression回归案例:良/恶性乳腺癌肿瘤预测
数据描述
(1)699条样本,共11列数据,第一列用语检索的id,后9列分别是与肿瘤 相关的医学特征,最后一列表示肿瘤类型的数值。
(2)包含16个缺失值,用”?”标出。
pandas使用
• pd.read_csv(’’,names=column_names)
• column_names:指定类别名字,[‘Sample code number’,‘Clump Thickness’, ‘Uniformity of Cell Size’,‘Uniformity of Cell Shape’,‘Marginal Adhesion’, ‘Single Epithelial Cell Size’,‘Bare Nuclei’,‘Bland Chromatin’,‘Normal Nucleoli’,‘Mitoses’,‘Class’]
• replace(to_replace=’’,value=):替代数据
• dropna():返回数据
良/恶性乳腺癌肿分类流程
1、网上获取数据(工具pandas)
2、数据缺失值处理、标准化
3、LogisticRegression估计器流程
代码示例:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
def logistic_regression():
'''
逻辑斯蒂回归案例
:return: None
'''
#读取数据
colnames = ['Sample code number','Clump Thickness',
'Uniformity of Cell Size','Uniformity of Cell Shape',
'Marginal Adhesion', 'Single Epithelial Cell Size',
'Bare Nuclei','Bland Chromatin','Normal Nucleoli',
'Mitoses','Class']
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=colnames)
#特征工程
data = data.replace("?", np.nan)
data = data.dropna()
#划分数据集
X_train, X_test, y_train, y_test = train_test_split(data[colnames[1:10]], data[colnames[10]], test_size=0.25)
#标准化数据集
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)
#实例化模型
lg = LogisticRegression()
#训练
lg.fit(X_train, y_train)
#预测
y_pre = lg.predict(X_test)
acc = lg.score(X_test, y_test)
mat = classification_report(y_true=y_test, y_pred=y_pre, labels=[2, 4], target_names=['良性', '恶性'])
#打印结果
print("预测结果:", y_pre)
print("准确率为:", acc)
print("混淆矩阵为:", mat)
return None
if __name__ == '__main__':
logistic_regression()
所得结果:
预测结果: [2 4 2 4 2 2 2 4 2 2 2 4 2 2 2 2 2 4 2 2 2 4 2 4 2 4 2 2 2 2 2 2 4 2 2 4 2
4 2 2 4 4 4 2 2 2 2 2 2 4 4 2 4 2 2 4 2 2 4 2 2 2 2 4 4 2 4 2 4 2 4 4 2 4
4 4 2 2 2 2 2 2 2 4 4 4 4 2 4 4 4 2 2 4 2 2 4 2 4 4 2 2 2 2 2 2 4 2 2 2 2
2 2 2 4 2 2 4 4 2 4 4 2 2 2 2 4 4 2 4 2 2 4 4 2 4 2 4 4 4 2 2 2 2 2 4 2 2
2 2 4 2 2 2 2 4 4 2 2 4 2 2 4 4 4 2 4 2 4 2 2]
准确率为: 0.9649122807017544
混淆矩阵为: precision recall f1-score support
良性 0.95 0.99 0.97 103
恶性 0.98 0.93 0.95 68
accuracy 0.96 171
macro avg 0.97 0.96 0.96 171
weighted avg 0.97 0.96 0.96 171