第一章讲最常见监督式学习为回归任务,分类任务。
MNIST
MNIST是机器学习领域基本的数据集,类似于“Hello World!”,可以通过Scikit-Learn直接引。在此之前的几个准备工作当然不能少:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
做一个分train_set 和 test_set的函数,并且sort一下
def sort_by_target(mnist):
reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
mnist.data[:60000] = mnist.data[reorder_train]
mnist.target[:60000] = mnist.target[reorder_train]
mnist.data[60000:] = mnist.data[reorder_test + 60000]
mnist.target[60000:] = mnist.target[reorder_test + 60000]
获取数据
try:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist["data"], mnist["target"]
Scikit-Learn加载数据集通常具有字典结构:
- DESCR:描述数据集
- data:实例是一行,特征为一列
- target:包含带标记的数组
赋值:X,y = mnist["data"],mnist["target"]
查看某个图:
some_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap = mpl.cm.binary,interpolation="nearest")
plt.axis("off")
plt.show()
看标签:y[36000]
设置各种集:X_train,X_text,y_train,y_test = X[:60000],X[60000:],y[:60000],y[60000:]
给数据集洗牌:
import numpy as np
shuffle_index = np.random.permutation(60000)
X_train,y_train = X_train[shuffle_index],y_train[shuffle_index]
Training a Binary Classifier
创建目标向量
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
挑选一个分类器进行训练,这里选择Scikit-Learn的SGDClassifier(随机梯度下降分类器),SGD单独吃力训练实例,适合在线学习,创建并训练:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5,tol=-np.infty,random_state=42)
#最大迭代次数5 阈值负无穷 设置random_state使其可复现结果
sgd_clf.fit(X_train,y_train_5)
预测一下:sgd_clf.predict([some_digit])
,猜对,看下性能
Performance Measures
Measuring Accuracy Using Cross-Validation(使用交叉验证得到精准度)
自行实施交叉验证(见书,此处 不予展示)
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")#折叠次数为3
准确率>95%,事实真是如此?先看一个将每个图片都化为非5的预测器
from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
def fit(self, X, y=None):
pass
def predict(self, X):
return np.zeros((len(X), 1), dtype=bool)
看下准确度:
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
结果>90%,所以只有10的图像是5,这说明准确率无法成为分类器首要性能指标,特别是skewed datasets(偏斜数据集:某些类比其他类更频繁)
Confusion Matrix(混淆矩阵)
就是统计A、B混淆次数,第A行第B列/第B行第A列。通过cross_val_predict()
替代测试集与之比较。cross_val_predict执行K-fold 交叉验证,返回对每个折叠的预测:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
使用confusion_matrix()获取混淆矩阵。给出目标类别、预测类别即可。
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
得到的数据行表实际类别,列表预测类别:
真负类 | 假正类 |
---|---|
假负类 | 真正类 |
正类预测准确度(Precision),也叫精度:TP(真正类数量)/(TP(真正类数量)+FP(假正类数量))
召回率(Recall)/灵敏度(sensitivity)/真正率(TNR):TP(真正类数量)/(TP(真正类数量)+FN(假负类数量))
Precision and Recall
Scikit-Learn得到:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred)
recall_score(y_train_5, y_train_pred)
F1分数(F1 score):是精度和召回率的谐波平均值,当两者都高时,F1分数才高:
调用:
from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)
Precision/Recall Tradeof(精度召回度权衡)
意思是两种情况:
- 精度高/召回率低:预测的都是我想要的,但铁定还有一些我想要的被预测到我不想要的
- 召回率高/精度低:我想要的都被预测出来了,但肯定有相当一部分是我不想要的,给预测错了。
所以说找了一个阈值来权衡:y_scores = sgd_clf.decision_function([some_digit])
SGDClassifier使用阈值为0,可以根据y_score提高阈值:
threshold = 200000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
200000时就错过了,召回率是降低了,所以应该使用cross_val_predict()
获取训练集所有实例分数
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")
有了分数后,就可以用precision_recall_curve()
计算所有可能阈值的精度与召回率。
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
画个图看看:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.xlabel("Threshold", fontsize=16)
plt.legend(loc="upper left", fontsize=16)
plt.ylim([0, 1])
plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
plt.show()
精度曲线之所以要崎岖一些是因为阈值提高时,精度也可能会下降。找到好的权衡方法是直接绘制精度和召回率曲线图:
def plot_precision_vs_recall(precisions, recalls):
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.axis([0, 1, 0, 1])
plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()
尽量在下降之前找到权衡,如果你需要90%精度,从第一张图看大概是70000的 阈值:
y_train_pred_90 = (y_scores > 70000)
precision_score(y_train_5, y_train_pred_90)
recall_score(y_train_5, y_train_pred_90)
一应俱全。
The ROC Curve (ROC曲线)
ROC(receiver operating characteristic/受试者工作特征曲线)经常和二元分类器一起使用。绘制的是真正类率(TPR)和假正类率(FPR),FPR = 1-TNR(真负类率 特异度 正确分类为负类的比率):ROC–灵敏度与1-TNR关系
先使用roc_curve()
计算多种阈值TPR,FPR
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
画一下:
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()
则TPR越高,FPR越多虚线表示纯随机分类器曲线,优秀曲线越向左上角越好。比较分类器的一种好方法是测量曲线下面积(AUC),完美=1,纯随机=0.5,Scikit-Learn提供计算:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
那么如何选择ROC曲线与精度召回率曲线?
正类少或更关注假正类:PR,否则ROC。
训练一个RandomForestClassifier将它与SGDClassifier比较 ROC和ROC AUC分数,RandomForestClassifier没有decision_function(),但是有dict_proba(),可以返回一个数组,行位实例,列为类别,然后给出概率。
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10,random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
method="predict_proba")
没分数用正类概率做分数:
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
画图看看效果:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()
这图比SGDClassifier更优秀(离左上角更近),而且ROC AUC分数也更高
roc_auc_score(y_train_5, y_scores_forest)
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)
recall_score(y_train_5, y_train_pred_forest)
看一下精度和召回率也相当不错。
Multiclass Classification
多项分类器可以区分两个以上类别。有些算法(随机森林,朴素贝叶斯)可以直接处理多个类别,还有些不可以(支持向量机,线性),但可以偶多种策略让你用二元实现多类分类:
- OvA:获取每个分类器分数,分入最高的
- OvO:两个两个分,一直分出最后
有些算法(支持向量机)在数据扩大表现糟糕,用OvO更好。大多数优先OvA。Scikit-Learn自动OvA(SVM OvO)。用SGDClassifier试试:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
实际上Scikit-Learn是训练了10个二元分类器训练的,选出分数 最高的,可以用decision_function()来查看:
some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
np.argmax(some_digit_scores)
如果你需要强制OvA或OVO。可以使用OneVsOneClassifier or OneVsRestClassifier classes
举个用OvO的例子:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
用len(ova_clf.estimators_
一看果然,训练RandomForest
Classifier也一样
forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])
随机森林直接可以分为多类别,用predict_proba()
看看就知道了:forest_clf.predict_proba([some_digit])
使用交叉验证看看准确率:cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
用简单缩放可以继续提升准确度:
在这里插入代码片from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Error Analysis
按正常步骤,应该用网格搜索微调超参数,得到好的模型,这里不搞了,直接看看怎么改进——Error Analysis
先看混淆矩阵,用cross_val_predict()预测,再调用confusion_matrix()
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
用Matplotlib的matshow()查看混淆矩阵
def plot_confusion_matrix(matrix):
"""If you prefer color and a colorbar"""
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
cax = ax.matshow(matrix)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
把焦点放到错误上,将混淆矩阵每个值/先对类别数量,看概率更清晰。
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
#用0填充对角线
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
看出8、9整的很乱,3和5单独也不好,分析单个3、5是怎么回事
def plot_digits(instances, images_per_row=10, **options):
size = 28
images_per_row = min(len(instances), images_per_row)
images = [instance.reshape(size,size) for instance in instances]
n_rows = (len(instances) - 1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row - len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = mpl.cm.binary, **options)
plt.axis("off")
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()
3-5不讲武德,连我这个69岁的老同志都分不清,更别提分类器了。
Multilabel Classification(多标签分类)
比如人脸识别,识别小红,小明,小强那你只能是[1,0,0]不能是[1,1,0],看个小例子:
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
第一个分类器>=7,第二个判奇,用KNeighborsClassifier训练,再预测:knn_clf.predict([some_digit])
,结果对了。
然后如何评估多标签分类器取决于项目,比如F1分数,这是所有标签权重相同情况下:
_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
可以用average="macro"
设置自身支持权重
Multioutput Classification(多输出分类)
最后一种是多输出-多类别分类,即标签也可以是多类别的。
构建一个系统去噪声,再输入一张带噪声的图片,则希望输出一张干净图。用MNIST表示就是用像素强度(0-255)表现。
创建训练集和测试集,用NumPy的randint()增加噪声,目标还原原始图:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
看下图,我giao:
some_index = 5500
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
save_fig("noisy_digit_example_plot")
plt.show()
清除干净:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)
save_fig("cleaned_digit_example_plot")
差不多第三章更到这,还有一个泰坦尼克号和邮件过滤器没有更,等抽空写吧。