Demo调试
实作交叉验证, 参考:
https://github.com/apachecn/hands-on-ml-2e-zh/blob/master/docs/3.md
StratifiedKFold参考: https://blog.csdn.net/weixin_44110891/article/details/95240937
StratifiedKFold用法类似Kfold,但是它是分层采样,确保训练集,验证集中各类别样本的比例与原始数据集中相同。因此一般使用StratifiedKFold。
from sklearn.datasets import fetch_mldata mnist = fetch_mldata('MNIST original') X, y = mnist["data"], mnist["target"] X.shape (70000, 784) y.shape (70000,) X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] y_train_5 = (y_train == 5) # True for all 5s, False for all other digits. y_test_5 = (y_test == 5) from sklearn.linear_model import SGDClassifier sgd_clf = SGDClassifier(random_state=42) sgd_clf.fit(X_train, y_train_5) # StratifiedKFold类实现了分层采样(详见第二章的解释),生成的折(fold)包含了各类相应比例的样例。 # 在每一次迭代,上述代码生成分类器的一个克隆版本,在训练折(training folds)的克隆版本上进行训练,在测试折(test folds)上进行预测。然后它计算出被正确预测的数目和输出正确预测的比例。 from sklearn.model_selection import StratifiedKFold from sklearn.base import clone skfolds = StratifiedKFold(n_splits=3, random_state=42) for train_index, test_index in skfolds.split(X_train, y_train_5): clone_clf = clone(sgd_clf) X_train_folds = X_train[train_index] y_train_folds = (y_train_5[train_index]) X_test_fold = X_train[test_index] y_test_fold = (y_train_5[test_index]) clone_clf.fit(X_train_folds, y_train_folds) y_pred = clone_clf.predict(X_test_fold) n_correct = sum(y_pred == y_test_fold) print(n_correct / len(y_pred)) # prints 0.9502, 0.96565 and 0.96495 # (对比)使用cross_val_score()函数来评估SGDClassifier模型,同时使用 K 折交叉验证,此处让k=3 from sklearn.model_selection import cross_val_score cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy") # array([ 0.9502 , 0.96565, 0.96495] |
默认3折交叉验证, 也就是训练集获得2/3的数据
报错
cannot import name 'fetch_mldata' from 'sklearn.datasets' |
参考: https://blog.csdn.net/qq_34769162/article/details/107961825
版本不同需要用不同方式导入
from sklearn.datasets import fetch_openml dataset = fetch_openml("mnist_784") |
报错
The number of classes has to be greater than one; got 1 class sgd_clf.fit(X_train, y_train_5) |
参考: https://blog.csdn.net/Asher117/article/details/107444781
把y_train_5改成y_train
报错
Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True. File "D:\Projects\anacondaProjects\Cao_Liver_transplantation_Complication_prediction\cross_val.py", line 25, in <module> skfolds = StratifiedKFold(n_splits=3, random_state=42) |
按照提示修改
skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) |
报错
None of [Int64Index([],\n dtype='int64', length=40000)] are in the [columns] X_train_folds = X_train[train_index] |
参考
https://blog.csdn.net/zhongkeyuanchongqing/article/details/120796789
[train_index]是numpy数组的语法, 此处应该在训练时候使用如下语法:
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
得到结果:
0.87155 0.8499 0.8632 |
Clone
参考
https://blog.csdn.net/ningyanggege/article/details/82687325
clone_clf = clone(sgd_clf) |
其中模型不直接定义而使用clone的原因:
直接定义需要同时对类新建两个对象, 然后再fit, 否则会出现参数覆盖, clone则相当于新建对象这个过程
计算F1
就像 cross_val_score(),cross_val_predict()也使用 K 折交叉验证。它不是返回一个评估分数,而是返回基于每一个测试折做出的一个预测值。
from sklearn.model_selection import cross_val_predict y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) |
使用 confusion_matrix()函数,你将会得到一个混淆矩阵
from sklearn.metrics import confusion_matrix confusion_matrix(y_train_5, y_train_pred) array([[53272, 1307], [ 1077, 4344]]) |
Scikit-Learn 提供了一些函数去计算分类器的指标,包括准确率和召回率, F1。
from sklearn.metrics import precision_score, recall_score precision_score(y_train_5, y_pred) # == 4344 / (4344 + 1307) #0.76871350203503808 recall_score(y_train_5, y_train_pred) # == 4344 / (4344 + 1077) |
from sklearn.metrics import f1_score f1_score(y_train_5, y_train_pred) |