数据处理笔记3: 分层采样-k折交叉验证

最新推荐文章于 2024-07-01 16:33:16 发布

lagoon_lala

最新推荐文章于 2024-07-01 16:33:16 发布

阅读量3.3k

收藏 12

点赞数 4

分类专栏：数据分析人工智能文章标签：数据处理交叉验证

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/lagoon_lala/article/details/121752899

版权

人工智能同时被 2 个专栏收录

90 篇文章 52 订阅

订阅专栏

数据分析

11 篇文章 3 订阅

订阅专栏

Demo调试

实作交叉验证, 参考:

https://github.com/apachecn/hands-on-ml-2e-zh/blob/master/docs/3.md

StratifiedKFold参考: https://blog.csdn.net/weixin_44110891/article/details/95240937

StratifiedKFold用法类似Kfold，但是它是分层采样，确保训练集，验证集中各类别样本的比例与原始数据集中相同。因此一般使用StratifiedKFold。

from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist["target"]

X.shape

(70000, 784)

y.shape

(70000,)

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

y_train_5 = (y_train == 5) # True for all 5s, False for all other digits.

y_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, y_train_5)

# StratifiedKFold类实现了分层采样（详见第二章的解释），生成的折（fold）包含了各类相应比例的样例。

# 在每一次迭代，上述代码生成分类器的一个克隆版本，在训练折（training folds）的克隆版本上进行训练，在测试折（test folds）上进行预测。然后它计算出被正确预测的数目和输出正确预测的比例。

from sklearn.model_selection import StratifiedKFold

from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):

clone_clf = clone(sgd_clf)

X_train_folds = X_train[train_index]

y_train_folds = (y_train_5[train_index])

X_test_fold = X_train[test_index]

y_test_fold = (y_train_5[test_index])

clone_clf.fit(X_train_folds, y_train_folds)

y_pred = clone_clf.predict(X_test_fold)

n_correct = sum(y_pred == y_test_fold)

print(n_correct / len(y_pred)) # prints 0.9502, 0.96565 and 0.96495

# (对比)使用cross_val_score()函数来评估SGDClassifier模型，同时使用 K 折交叉验证，此处让k=3

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

# array([ 0.9502 , 0.96565, 0.96495]

默认3折交叉验证, 也就是训练集获得2/3的数据

报错

cannot import name 'fetch_mldata' from 'sklearn.datasets'

参考: https://blog.csdn.net/qq_34769162/article/details/107961825

版本不同需要用不同方式导入

from sklearn.datasets import fetch_openml

dataset = fetch_openml("mnist_784")

报错

The number of classes has to be greater than one; got 1 class

sgd_clf.fit(X_train, y_train_5)

参考: https://blog.csdn.net/Asher117/article/details/107444781

把y_train_5改成y_train

报错

Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.

File "D:\Projects\anacondaProjects\Cao_Liver_transplantation_Complication_prediction\cross_val.py", line 25, in <module>

skfolds = StratifiedKFold(n_splits=3, random_state=42)

按照提示修改

skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

报错

None of [Int64Index([],\n dtype='int64', length=40000)] are in the [columns]

X_train_folds = X_train[train_index]

参考

https://blog.csdn.net/zhongkeyuanchongqing/article/details/120796789

[train_index]是numpy数组的语法, 此处应该在训练时候使用如下语法：

X_train, X_test = X.iloc[train_index], X.iloc[test_index]

y_train, y_test = y.iloc[train_index], y.iloc[test_index]

得到结果:

0.87155

0.8499

0.8632

Clone

参考

https://blog.csdn.net/ningyanggege/article/details/82687325

clone_clf = clone(sgd_clf)

其中模型不直接定义而使用clone的原因:

直接定义需要同时对类新建两个对象, 然后再fit, 否则会出现参数覆盖, clone则相当于新建对象这个过程

计算F1

就像 cross_val_score()，cross_val_predict()也使用 K 折交叉验证。它不是返回一个评估分数，而是返回基于每一个测试折做出的一个预测值。

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

使用 confusion_matrix()函数，你将会得到一个混淆矩阵

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

array([[53272, 1307],

[ 1077, 4344]])

Scikit-Learn 提供了一些函数去计算分类器的指标，包括准确率和召回率, F1。

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_pred) # == 4344 / (4344 + 1307)

#0.76871350203503808

recall_score(y_train_5, y_train_pred) # == 4344 / (4344 + 1077)

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

lagoon_lala

关注

4
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。

余额充值