数据处理笔记3: 分层采样-k折交叉验证

Demo调试

实作交叉验证, 参考:

https://github.com/apachecn/hands-on-ml-2e-zh/blob/master/docs/3.md

StratifiedKFold参考: https://blog.csdn.net/weixin_44110891/article/details/95240937

StratifiedKFold用法类似Kfold,但是它是分层采样,确保训练集,验证集中各类别样本的比例与原始数据集中相同。因此一般使用StratifiedKFold。

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist["target"]

X.shape

(70000, 784)

y.shape

(70000,)

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.

y_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, y_train_5)

# StratifiedKFold类实现了分层采样(详见第二章的解释),生成的折(fold)包含了各类相应比例的样例。

# 在每一次迭代,上述代码生成分类器的一个克隆版本,在训练折(training folds)的克隆版本上进行训练,在测试折(test folds)上进行预测。然后它计算出被正确预测的数目和输出正确预测的比例。

from sklearn.model_selection import StratifiedKFold

from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):

    clone_clf = clone(sgd_clf)

    X_train_folds = X_train[train_index]

    y_train_folds = (y_train_5[train_index])

    X_test_fold = X_train[test_index]

    y_test_fold = (y_train_5[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)

    y_pred = clone_clf.predict(X_test_fold)

    n_correct = sum(y_pred == y_test_fold)

    print(n_correct / len(y_pred)) # prints 0.9502, 0.96565 and 0.96495

# (对比)使用cross_val_score()函数来评估SGDClassifier模型,同时使用 K 折交叉验证,此处让k=3

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

# array([ 0.9502 , 0.96565, 0.96495]

默认3折交叉验证, 也就是训练集获得2/3的数据

报错

cannot import name 'fetch_mldata' from 'sklearn.datasets'

参考: https://blog.csdn.net/qq_34769162/article/details/107961825

版本不同需要用不同方式导入

from sklearn.datasets import fetch_openml

dataset = fetch_openml("mnist_784")

报错

The number of classes has to be greater than one; got 1 class

    sgd_clf.fit(X_train, y_train_5)

参考: https://blog.csdn.net/Asher117/article/details/107444781

把y_train_5改成y_train

报错

Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.

  File "D:\Projects\anacondaProjects\Cao_Liver_transplantation_Complication_prediction\cross_val.py", line 25, in <module>

    skfolds = StratifiedKFold(n_splits=3, random_state=42)

按照提示修改

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

报错

None of [Int64Index([],\n           dtype='int64', length=40000)] are in the [columns]

X_train_folds = X_train[train_index]

参考

https://blog.csdn.net/zhongkeyuanchongqing/article/details/120796789

[train_index]是numpy数组的语法, 此处应该在训练时候使用如下语法:

 X_train, X_test = X.iloc[train_index], X.iloc[test_index]

 y_train, y_test = y.iloc[train_index], y.iloc[test_index]

得到结果:

0.87155

0.8499

0.8632

Clone

参考

https://blog.csdn.net/ningyanggege/article/details/82687325

    clone_clf = clone(sgd_clf)

其中模型不直接定义而使用clone的原因:

直接定义需要同时对类新建两个对象, 然后再fit, 否则会出现参数覆盖, clone则相当于新建对象这个过程

计算F1

就像 cross_val_score(),cross_val_predict()也使用 K 折交叉验证。它不是返回一个评估分数,而是返回基于每一个测试折做出的一个预测值。

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

使用 confusion_matrix()函数,你将会得到一个混淆矩阵

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

array([[53272, 1307],

        [ 1077, 4344]])

Scikit-Learn 提供了一些函数去计算分类器的指标,包括准确率和召回率, F1。

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_pred) # == 4344 / (4344 + 1307)

#0.76871350203503808

recall_score(y_train_5, y_train_pred) # == 4344 / (4344 + 1077)

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值