机器学习实战第二版---第二节：二分类

最新推荐文章于 2024-07-03 11:24:00 发布

菜椒爱菜鸟

最新推荐文章于 2024-07-03 11:24:00 发布

阅读量436

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/z15005953031/article/details/118606287

版权

python 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

本文介绍了使用随机梯度下降(SGD)分类器对MNIST手写数字数据集进行二分类，并通过交叉验证、PRROC曲线、混淆矩阵评估模型性能。此外，还对比了随机森林分类器，展示了ROC曲线和AUC得分，探讨了选择ROC或PR曲线的依据。

摘要由CSDN通过智能技术生成

使用随机梯度下降分类器；使用首先数字案例

mnist = fetch_openml('mnist_784',version=1) #打开开源库
x=mnist['data']
y=mnist['target']
y= y.astype(np.uint8)

# print(x.iloc[0].values,y[0])
image = x.iloc[0].values.reshape(28,28)  #series 没有reshape 但是有values这个可以转
# plt.imshow(image,cmap='binary')
# plt.axis('off')
# plt.show()
X_train, X_test, y_train, y_test = x[:60000], x[60000:], y[:60000], y[60000:]

#设置一个简单的分类器
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits
y_test_5 = (y_test == 5)

sgd_clf = SGDClassifier(random_state=42)#随机梯度下降分类器  一种快速求解具有损失函数形式的分类算法的分类器
'''
每次迭代都随机从训练集中抽取出1个样本，在样本量极其大的情况下，可能不用抽取出所有样本，就可以获得一个损失值在可接受范围之内的模型了。
缺点是由于单个样本可能会带来噪声，导致并不是每次迭代都向着整体最优方向前进。
'''
sgd_clf.fit(X_train, y_train_5)
# print(sgd_clf.predict([x.iloc[0].values])) #输入验证的图片只能是一维的

手写交叉验证k折折线法

'''


#做划分是需要同时传入数据集和标签
skfolds = StratifiedKFold(n_splits=3, random_state=42,shuffle=True)
for train_index, test_index in skfolds.split(X_train, y_train_5):

    print(train_index,test_index)
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train.iloc[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train.iloc[test_index]
    y_test_fold = y_train_5[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred)) # prints 0.9669  0.91625 0.96785

'''

获取模型的效果----PR ROC 混淆矩阵

#交叉验证用轮子  得到评分
arry=cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
# print(arry)


#使用混淆矩阵 使用混淆矩阵会要先有个预测值 使用交叉验证的预测，这是训练集中没有
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)#返回的是每一个折叠的预测

crossarry=confusion_matrix(y_train_5, y_train_pred)
#计算召回率==针对我们原来的样本而言的，它表示的是样本中的正例有多少被预测正确了。  准确率==针对我们预测结果而言的，它表示的是预测为正的样本中有多少是真正的正样本。
recall = crossarry[1][1]/(crossarry[1][1]+crossarry[1][0])
print('recall',recall)
# y_5=y_train_5
# #混淆矩阵第一行是负类 第二行是正类，第一列是真 第二列为假
# print(confusion_matrix(y_train_5,y_5))
#用轮子
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))
print(f1_score(y_train_5, y_train_pred))  #只有召回率和精度都很高的时候，f1才会很高

y_scores = sgd_clf.decision_function([x.iloc[0].values])  #于预测的决策分数  展示模型对于输入样本的评判结果
print('y_scores',y_scores)
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

#去选择这个分类器合适的阈值，对于训练集每个进行获取实例的分数
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,method="decision_function")
print('scores',y_scores,'y5',y_train_5)  #scores 决策分数  y5每个是否是五 输出true false
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)#绘制所有可能的阈值的精度和召回率
print(precisions,recalls,thresholds)
# plot_precision_recall_vs_threshold(precisions, recalls, thresholds)  #精度 召回率的曲线
# plt.show()
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] #np.argmax 求出最大的值下标   precisions是一维数组
print(threshold_90_precision)
y_train_pred_90 = (y_scores >= threshold_90_precision)
print(precision_score(y_train_5, y_train_pred_90),recall_score(y_train_5, y_train_pred_90))#0.9000345901072293 0.4799852425751706

'''
ROC曲线
1.画图：虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）。
2.有一种比较分类器的方法是测量曲线下面积（AUC）。完美的分类器的ROC AUC等于1，而纯随机分类器的ROC AUC等于0.5。
'''
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
# plot_roc_curve(fpr, tpr)
# plt.show()
print(roc_auc_score(y_train_5, y_scores))

随机森林对比

forest_clf = RandomForestClassifier(random_state=42) #随机森林
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,method="predict_proba")
print(y_probas_forest) #每张图片对于正类和负类的概率
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
print(roc_auc_score(y_train_5, y_scores_forest))#0.9983436731328145

效果：在这里插入图片描述

全部代码

from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib as mpl
import numpy as np
import cv2
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier


def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")# highlight the threshold and add the legend, axis label, and gri

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')  # Dashed diagonal




mnist = fetch_openml('mnist_784',version=1) #打开开源库
x=mnist['data']
y=mnist['target']
y= y.astype(np.uint8)

# print(x.iloc[0].values,y[0])
image = x.iloc[0].values.reshape(28,28)  #series 没有reshape 但是有values这个可以转
# plt.imshow(image,cmap='binary')
# plt.axis('off')
# plt.show()
X_train, X_test, y_train, y_test = x[:60000], x[60000:], y[:60000], y[60000:]

#设置一个简单的分类器
y_train_5 = (y_train == 5) # True for all 5s, False for all other digits
y_test_5 = (y_test == 5)

sgd_clf = SGDClassifier(random_state=42)#随机梯度下降分类器  一种快速求解具有损失函数形式的分类算法的分类器
'''
每次迭代都随机从训练集中抽取出1个样本，在样本量极其大的情况下，可能不用抽取出所有样本，就可以获得一个损失值在可接受范围之内的模型了。
缺点是由于单个样本可能会带来噪声，导致并不是每次迭代都向着整体最优方向前进。
'''
sgd_clf.fit(X_train, y_train_5)
# print(sgd_clf.predict([x.iloc[0].values])) #输入验证的图片只能是一维的

'''
实现交叉验证
StratifiedKFold函数采用分层划分的方法（分层随机抽样思想），验证集中不同类别占比与原始样本的比例保持一致，
故StratifiedKFold在做划分的时候需要传入标签特征。


'''
'''


#做划分是需要同时传入数据集和标签
skfolds = StratifiedKFold(n_splits=3, random_state=42,shuffle=True)
for train_index, test_index in skfolds.split(X_train, y_train_5):

    print(train_index,test_index)
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train.iloc[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train.iloc[test_index]
    y_test_fold = y_train_5[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred)) # prints 0.9669  0.91625 0.96785

'''

#交叉验证用轮子  得到评分
arry=cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
# print(arry)


#使用混淆矩阵 使用混淆矩阵会要先有个预测值 使用交叉验证的预测，这是训练集中没有
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)#返回的是每一个折叠的预测

crossarry=confusion_matrix(y_train_5, y_train_pred)
#计算召回率==针对我们原来的样本而言的，它表示的是样本中的正例有多少被预测正确了。  准确率==针对我们预测结果而言的，它表示的是预测为正的样本中有多少是真正的正样本。
recall = crossarry[1][1]/(crossarry[1][1]+crossarry[1][0])
print('recall',recall)
# y_5=y_train_5
# #混淆矩阵第一行是负类 第二行是正类，第一列是真 第二列为假
# print(confusion_matrix(y_train_5,y_5))
#用轮子
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))
print(f1_score(y_train_5, y_train_pred))  #只有召回率和精度都很高的时候，f1才会很高

y_scores = sgd_clf.decision_function([x.iloc[0].values])  #于预测的决策分数  展示模型对于输入样本的评判结果
print('y_scores',y_scores)
threshold = 8000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

#去选择这个分类器合适的阈值，对于训练集每个进行获取实例的分数
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,method="decision_function")
print('scores',y_scores,'y5',y_train_5)  #scores 决策分数  y5每个是否是五 输出true false
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)#绘制所有可能的阈值的精度和召回率
print(precisions,recalls,thresholds)
# plot_precision_recall_vs_threshold(precisions, recalls, thresholds)  #精度 召回率的曲线
# plt.show()
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)] #np.argmax 求出最大的值下标   precisions是一维数组
print(threshold_90_precision)
y_train_pred_90 = (y_scores >= threshold_90_precision)
print(precision_score(y_train_5, y_train_pred_90),recall_score(y_train_5, y_train_pred_90))#0.9000345901072293 0.4799852425751706

'''
ROC曲线
1.画图：虚线表示纯随机分类器的ROC曲线、一个优秀的分类器应该离这条线越远越好（向左上角）。
2.有一种比较分类器的方法是测量曲线下面积（AUC）。完美的分类器的ROC AUC等于1，而纯随机分类器的ROC AUC等于0.5。
'''
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
# plot_roc_curve(fpr, tpr)
# plt.show()
print(roc_auc_score(y_train_5, y_scores))

#######如何选择ROC曲线还是PR曲线
'''
当正类非常少见或者你更关注假正类而不是假负类时，应该选择PR曲线，反之则是ROC曲线。
'''


#使用随机森林判断 pr roc


'''
1.predict_proba:  模型预测输入样本属于每种类别的概率，概率和为1，每个位置的概率分别对应classes_中对应位置的类别标签。
以上述类别标签为[2 4 6 8]的那个分类器为例，查看一下分类模型预测的概率。
2.predict: 模型预测输入样本所属的类别，
3.且返回结果的数值表示模型预测样本可信度,大于0表示正样本的可信度大于负样本，否则可信度小于负样本

'''
forest_clf = RandomForestClassifier(random_state=42) #随机森林
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,method="predict_proba")
print(y_probas_forest) #每张图片对于正类和负类的概率
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
print(roc_auc_score(y_train_5, y_scores_forest))#0.9983436731328145

菜椒爱菜鸟

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
机器学习实战第二版---第二节：二分类

机器学习实战第二版---第二节：二分类使用随机梯度下降分类器；使用首先数字案例手写交叉验证k折折线法获取模型的效果----PR ROC 混淆矩阵随机森林对比使用随机梯度下降分类器；使用首先数字案例mnist = fetch_openml('mnist_784',version=1) #打开开源库x=mnist['data']y=mnist['target']y= y.astype(np.uint8)# print(x.iloc[0].values,y[0])image = x.iloc[0]
复制链接

扫一扫

专栏目录