Hands-On-Ml 学习笔记（用sklearn构建一个分类器）

最新推荐文章于 2024-04-16 12:53:51 发布

Chen_shu_bct

最新推荐文章于 2024-04-16 12:53:51 发布

阅读量502

点赞数 2

文章标签： sklearn python 机器学习

本文链接：https://blog.csdn.net/Chen_shu_bct/article/details/120676452

版权

用sklearn来构建分类器

首先我们需要先获取一些数据集

datasets模块中，让我们很方便地能够获取一些经典的数据集

具体的参数可以查阅sklearn官方文档

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

X,y = mnist['data'],mnist['target']

X.shape,y.shape

((70000, 784), (70000,))

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
some_digit = X.iloc[36000]
some_digit_image = some_digit.values.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

在开始我们要做的事之前，我们应该先分割训练集和测试集

不过mnist数据集以及被分割好了，我们只需进行切片操作即可

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

import numpy as np

random_index = np.random.permutation(60000)
X_train= X_train.iloc[random_index]

y_train = y_train[X_train.index]

X_train

	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	pixel9	pixel10	...	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783	pixel784
16016	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
9782	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
53731	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
52733	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
52868	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
29464	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
52821	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
8675	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
811	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
23148	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

60000 rows × 784 columns

y_train

16016    7
9782     8
53731    4
52733    8
52868    1
        ..
29464    3
52821    0
8675     8
811      3
23148    6
Name: class, Length: 60000, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

训练一个二分类器

首先我们先将数据集的标签分为是5 和不是5 两类

y_train_5 = (y_train == '5') # True for all 5s, False for all other digits.
y_test_5 = (y_test == '5')

这里我们使用SGD分类器来实现，SGD是随机梯度下降的意思，一次只使用一条数据，所以训练的时间会短很多

如果希望得到相同的结果，则应该使用相同的random_state

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier(random_state=42)

接下来我们用训练好的模型取测试一下

sgd_clf.predict([some_digit])

array([False])

嗯，判断正确

让我们试着来评估一下训练好的模型的性能

交叉验证

这里我们使用cross_val_score()来对模型进行交叉验证

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.97   , 0.96695, 0.9672 ])

这里我们可以看到交叉验证的准确率，有高有低，但是基本上快接近97%了

但是，我们需要考虑数据本身的分布情况，如果数字5本来就占比不高，那么就算瞎猜，也有很高的准确率，所以我们这里的训练结果并不好

我们用另外一种角度去看一下我们训练的模型

混淆矩阵

混淆矩阵的概念可以自行百度

这里主要介绍如何用sklearn来使用混淆矩阵进行模型评估

要得到混淆矩阵，我们首先要得到的是各个预测值

这里使用cross_val_predict()

cross_val_predict()和 cross_val_score()一样，也使用 K 折交叉验证。它不是返回一个评估分数，而是返回基于每一个测试折做出的一个预测值

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

得到我们需要的预测值之后，我们就可以用预测值和实际值构造一个混淆矩阵了

这里我们使用confusion_matrix()函数，只要传入y_train_pred和y_train_5即可

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)

array([[54000,   579],
       [ 1338,  4083]], dtype=int64)

54000表示非5被正确判断为非5 的个数称为True Negative 简称TN
579表示数字5被错误判断为非数字5 的个数称为 False Postive 简称 FP
1338表示数字5的被误判为非数字5的个数，称为 False Negative 简称 FN
4083表示数字5的被正确判断为数字5的个数称为 True Postive 简称 TP

准确率与召回率

准确率 precision = TP/(TP+FP)
召回率 recall = TP/(TP+FN)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fLwr0VaE-1633773505782)(attachment:image-2.png)]

sklearn 也提供了模块来计算准确率和召回率

使用precision_score, recall_score模块

导入的参数和混淆矩阵的一样

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5,y_train_pred)

0.8758043758043758

recall_score(y_train_5, y_train_pred)

0.7531820697288323

这里虽然准确率还可以，但是召回率就比较低

这两个值是很难同时很高的，我们引入一个F1函数来综合这两个值

F1 = 2/((1/precision)+(1/recall))

sklearn中也有相应的模块f1_score

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

0.8098780124962809

这里的评分可以作为一个参考，然而这个模型的f1值也不算高，所以准确率有时候并不能说明什么

而这里的f1值过低大部分是由于召回率过低造成的

这里我们引入一种方法来改变判断的阈值

y_scores = sgd_clf.decision_function([some_digit])
y_scores

array([-5011.31033273])

准确率与召回率的折中

decision_function函数可以用你的模型给你填入的数据计算一个决策分数，然后通过分数和阈值的比较，来决定是否将其判断为True或者False

而sgd_clf模型默认的是使用了阈值为0的情况，所以这里的-1286会被分如非5 的行列

很显然，提高阈值会降低召回率，而调低阈值会降低准确率

那么我们怎么选择一给合适的阈值呢？

我们仍然使用cross_val_predict(),但是这次我们返回一个决策分数

我们返回所有实例的决策分数

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, 
                            method="decision_function")

有了这个决策分数之后，对于任何一个阈值，我们都能计算其准确率与召回率

而且我们还可以把这个准确率与召回率当成是阈值的一个函数，并且画出他们对应的图像

使用precision_recall_curve()模块，很方便地就能得到他们之间的关系

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
    plt.xlim([-20000,20000])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a0JmdLxO-1633773505799)(output_38_0.png)]

召回率的曲线明向比准确率要平滑，因为阈值提高，召回率一定升高，而准确率则不一定

根据这个曲线，我们可以自己选择一个预想的阈值，然后我们就能得到一个相对满意的模型了（不过这个模型的准确率有点不堪入目）

例如，找到我们准确率大概百分之九十的位置，其对应的阈值大概为5000

所以我们只需要将阈值设置为5000，我们就能得到一个准确率为90%(近似)的模型

选择一个理想的阈值，可以画出召回率和准确率的曲线，找到一个转折点，使其占领的面积最大

plt.plot(recalls,precisions)
plt.show()

如图，我们可以选择召回率和准确率都近似为0.8时作为最终的模型

此时面积最大

roc曲线

roc曲线常用于二分类问题中

ROC曲线是真正例率（true postive rate）和假正例率（false positive rate）的曲线，,真正例率就是召回率，FPR 是反例被错误分成正例的比率，表示反例被误判为正例的比例

为了画出roc曲线，我们要计算出各个阈值下的真正例率和假正例率

我们可以推理，如果假正例率提高一点点，真正例率就能提高很多，那这个算法就是比较不错的，所以roc曲线越远离对角线，说明算法的效果越好

我们使用roc_curve()模块，导入的内容仍然是标签和其决策分数

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

然后就用plt画出其对应的曲线就行

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ODgxphhG-1633773505808)(output_45_0.png)]

我们接下来跑一个别的类型的算法（这里用了随机森林）用于比较

由于随机森林是一个集成算法，不提供决策分数，但是会给出每个类别的概率，我们可以用这个概率来替代决策分数

这里只需要将交叉验证中 metheod的参数设置成predict_proba 即可

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

y_probas_forest

array([[1.  , 0.  ],
       [0.94, 0.06],
       [0.97, 0.03],
       ...,
       [0.87, 0.13],
       [0.95, 0.05],
       [1.  , 0.  ]])

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend()
plt.show()

这里明显地看出随机森林比随机梯度下降表现要更加好

多类分类

多类分类基本上是由二分类模型推广而来的

拿数字分类器来举例

第一种想法 ova（一对其他）：训练多个二分类器，每个分类器用于分辨这个数是0或者不是0，是1或者不是1，… 这样要写10个而分类器
然后调用的时候将数据带入十个二分类器中，选择决策分数最高的那个作为结果

第二种想法 ovo（一对一）：训练多个二分类器，每个分类器判断这个数是0或者是1 ，是0或者是2，是0或者是3… ,将10个数进行两两排列组合，最终得到 n(n-1)/2个

二分类器，虽然这里的二分类器个数很多，但是训练的时候，二分类器只需要在数据集中的一部分进行训练

sklearn会自动将二分类器转换为多分类器，并默认执行一对多（ova）的想法（SVM除外，它使用的是ovo）

sgd_clf.fit(X_train, y_train) # y_train, not y_train_5
sgd_clf.predict([some_digit])

array(['9'], dtype='<U1')

这里给出的结果是通过各个二分类器计算，并选出决策分数最高的那个类别作为分类结果的

可以看见结果是9，也是正确预测了

我们也可以用之前用过的decision_function()来获取其决策分数

some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores

array([[-33182.02870179, -25247.26747169, -23244.16894156,
         -1954.00257716,  -4535.15002173,  -8577.10934213,
        -21427.92781335, -16451.52619852,  -7230.20381898,
         -1069.86641745]])

这里的决策分数，9最高

如果你想强制 Scikit-Learn 使用 OvO 策略或者 OvA 策略，你可以使用OneVsOneClassifier类或者OneVsRestClassifier类

只需要创造一个样例，传入你想用的二分类器，就可以用这个去训练模型了

用SGDClassifier随机梯度下降分类器来举个例

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
len(ovo_clf.estimators_)

ovo_clf.predict([some_digit])

array(['4'], dtype=object)

同样的，随机森林也可以很简单的用来作为一个多分类器

但是由于随机森林的运行机制，它可以直接将数据分为多个类别，所以也就没有必要使用ovo或者ova策略了，直接训练就行

forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

array(['9'], dtype=object)

如果你想对这些分类器进行评估，你仍然可以使用交叉验证，还是使用模块 corss_val_predict

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.8693 , 0.853  , 0.85235])

这个准确率还不错的，因为每个数据在数据集中的占比基本上都是百分之十左右，所以这并不是一个坏的准确率

但是我们仍然可以让这个模型由更好的表现，只需要稍微正则化一下

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

array([0.89795, 0.90665, 0.8992 ])

通过正则化，我们提高了准确率，这里的报错指的是随机梯度下降的过程中，我们在迭代完成之前就进入了局部最优值，所以可以适当减小迭代步数

误差分析

多分类问题我们也可以使用之前的方法来进行误差分析，即使用混淆矩阵

和之前一样，我们需要先获取预测的值，然后用函数得到混淆矩阵

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

array([[5590,    0,   17,    7,    7,   41,   33,    7,  220,    1],
       [   0, 6424,   42,   21,    3,   44,    5,   11,  181,   11],
       [  30,   27, 5270,   84,   73,   21,   61,   37,  345,   10],
       [  27,   18,  113, 5251,    1,  198,   24,   43,  391,   65],
       [   8,   13,   41,   13, 5228,   11,   40,   17,  296,  175],
       [  29,   19,   29,  162,   48, 4471,   77,   19,  506,   61],
       [  28,   17,   43,    2,   42,   91, 5555,    5,  135,    0],
       [  18,   14,   54,   28,   46,   12,    5, 5725,  162,  201],
       [  16,   67,   39,   91,    3,  123,   32,    9, 5426,   45],
       [  19,   20,   32,   66,  126,   37,    1,  184,  328, 5136]],
      dtype=int64)

plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

这里的混淆矩阵看起来还是很不错的，因为都分布在主对角线上，

同时我们也能看到，这里的对角线上的方块有些会暗一点（比如5，8，9这些），说明这些地方产生的误差相对来说较多

我们也可以用另外一种方法来观察误差的分布情况，我们用错误的个数除以该分类的总数，就能求出每个数字被错误判断成其他数字的概率

然后用黑色来填充对角线就能得到一个合适的混淆矩阵的图片了

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

import matplotlib.pyplot as plt
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-twlqvYDS-1633773505821)(output_74_0.png)]

这里我们很明显能感觉到，5很容易被误判成8，所以我们说不定可以针对5和8之间的判断来做一些优化

Chen_shu_bct

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Hands-On-Ml 学习笔记（用sklearn构建一个分类器）

用sklearn来构建分类器首先我们需要先获取一些数据集datasets模块中，让我们很方便地能够获取一些经典的数据集具体的参数可以查阅sklearn官方文档from sklearn.datasets import fetch_openmlmnist = fetch_openml('mnist_784')X,y = mnist['data'],mnist['target']X.shape,y.shape((70000, 784), (70000,))%matplotlib inli
复制链接

扫一扫