第三章 mnist分类

整理主要流程——MNIST

mnist = fetch_openml(‘mnist_784’, version=1)
第一个实例的特征向量重组 some_digit_image = some_digit.values.reshape(28, 28)
切分训练测试集 X_train\X_test\y_train\y_test

二元分类

目标与非目标布尔向量组 y_train_5 = (y_train == 5) y_test_5 = (y_test == 5)
上SGD分类器训练 实例(种子),数据训练,predict True or False
评估分类器——交叉验证
StratifiedKFold(组数,种子),split将D(X_train)按y_train_5布尔组的T、F比例分层采样
按长度对训练数据评估正确率sum(y_pred == y_test_fold) / len(y_pred),上cross_val_score评估准确率
但类别多数据少,即使猜数据中全是False也会很高正确率——正确率不能评估分类器
混淆矩阵(行是实际,列是预测,得混淆次数)
上干净预测数据 y_train_pred = cross_val_predict(分类器, D, 布尔向量, cv3)
上实际f与预测h的混淆矩阵 confusoin_matrix(y_train_5, y_train_pred)
计算精度召回率\谐波均值 precision_score(布尔向量f, 预测数据pred) recall_score(f, pred) f1_score(f, pred)
调整判断正类的阙值:提高阙值,假正会变真负,所以精度提高,真正会变假负,所以召回率降低
阙值决定:返回所有实例决策分数 y_scores = cross_val_predict(sgd_clf, D, 布尔向量f, cv3, method=‘decision_function’)
绘制精度召回阙值函数图 precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
绘制精度召回函数图(PR曲线)
查精度90的阙值 threshold_90_precision = thresholds[np.argmax(precisions >= 0.9)]
ROC曲线:绘制FPR对TPR曲线 fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
查ROC AUC(越接近1越准) roc_auc_score(y_train_5, y_scores)
经验法则:关注假正(错误率),选择PR;;;关注假负(召回率),选择ROC
分类器比较——RandomForest分类器
创建实例、获取概率矩阵 y_probas_forest = cross_val_predict(forest_clf, D, f, cv3, method=‘predict_proba’)
获取正类分数、绘图 y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)
查roc auc roc_auc_score(布尔数组f, 决策分数score)

多类分类

使用二元分类分类多类时,Scikit-Learn自动Ovr、OvO
SVM分类器:
可强制策略: ovr_clf = OneVsRestClassfier(SVC())
实例、训练数据(D, 标签向量f)、利用特征进行预测svm_clf.predict([some_digit])
获取决策分数 some_digit_scores = svm_clf.decision_function([some_digit])、class_获取原始类别
性能提升:特征缩放、超参微调、

误差分析

多类分类预测数据 y_train_pred = cross_val_predict(分类器, D, 标签向量f, cv3)
混淆矩阵 conf_mx = confusion_matrix(实际标签y_train, 预测数据y_train_pred)
matshow直接看矩阵,暗表示不太好
matshow看错误率矩阵 conf_mx / conf_mx.sum(axis=1, keepdims=True)
分析错误并改进:增加数据、搞新特征区分、上实例图看哪有问题使图片突出模式

多标签分类()

多标签布尔数组 y_multilabel = np.c_[判定布尔向量f1, 判定布尔向量f2]
KNeighborsClassfier().fit(D, y_multilabel)
KNeighborsClassfier().predict([some_digit])
计算f1分数:
上预测数据 pred = cross_val_predict(分类器, D, y_multilabel, cv3)
计算平均f1: f1_score(y_multilabel, pred, average=‘macro’)

多输出分类

给D每个像素加入噪音,将干净数据作为标签y,训练分类器 KNeighborsClassfier().fit(D, y)
使用含噪测试集预测数据 KNeighborsClassfier().predict([X_test_mod[some_index]])

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
from sklearn.datasets import fetch_openml
import matplotlib as mpl
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
X, y = mnist['data'], mnist['target']
X.shape
(70000, 784)
y.shape
(70000,)
some_digit = X.iloc[0]
some_digit_image = some_digit.values.reshape(28, 28)
plt.imshow(some_digit_image, cmap='binary')
plt.axis('off')
plt.show()

y[0]
'5'
y = y.astype(np.uint8)
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
SGDClassifier(random_state=42)
sgd_clf.predict([some_digit])
array([ True])
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train.iloc[train_index]
    y_train_folds = y_train_5.iloc[train_index]
    X_test_fold = X_train.iloc[test_index]
    y_test_fold = y_train_5.iloc[test_index]
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))
0.9669
0.91625
0.96785
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, scoring=
  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值