K折交叉验证

最新推荐文章于 2024-04-21 18:42:26 发布

ning_ww

最新推荐文章于 2024-04-21 18:42:26 发布

阅读量212

点赞数

分类专栏：机器学习文章标签：机器学习 python 数据挖掘

本文链接：https://blog.csdn.net/bb_sy_w/article/details/107574611

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

数据集分割、打分

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

#load  dataset
data = pd.read_csv('train.csv')
data = data.iloc[:,[1,3,4,80]]
data = data.fillna(0)

X_train,X_val,y_train,y_val = train_test_split(data.iloc[:,:3],data.SalePrice,test_size=0.3,random_state = 0)
#random_state代表随机状态

#fit
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=10)
regr_1.fit(X_train,y_train)
regr_2.fit(X_train,y_train)
regr_3.fit(X_train,y_train)

#score
for col in [regr_1,regr_2,regr_3]:
    print(col.score(X_train,y_train),col.score(X_val,y_val))

由于原数据集特征数81，这里只选取了3个特征，没做必要的数据预处理，缺失值也是简单的补0，得分较低。
不过也可以得出随着max_depth的加深，训练集上拟合的越来越好，验证集上表现越来越差，过拟合严重。

max_depth	train_score	val_score
2	0.240195091296191	0.23574989781877642
5	0.4587461320459466	0.2986072801772752
10	0.8603522174841458	0.12554005906320287

交叉验证的指标

最简单的方式：cross_val_score

from sklearn.model_selection import cross_val_score

def scores(regr,data,target):
    scores = cross_val_score(regr,data,target,cv = 5)
    print(scores,"Accuracy: %0.2f (+/- %0.2f)"%(scores.mean(),scores.std()*2))

for col in [regr_1,regr_2,regr_3]:
    scores(col,data.iloc[:,:3],data.SalePrice)

评分估计的平均得分和 95% 置信区间由此给出（这个95%我也不知道怎么来的。。。）

K折交叉验证

将所有的样例划分为 k 个组，称为折叠 (fold) 。预测函数学习时使用 k - 1 个折叠中的数据，最后一个剩下的折叠会用于测试。
作为一般规则，大多数作者和经验证据表明， 5或者 10交叉验证效果更优。

简单示例

在 4 个样例的数据集上使用 2-fold 交叉验证的示例:

import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
	print("%s  %s" % (train, test))

在这里插入图片描述
应用到自身数据集上

#K折交叉验证
from sklearn.model_selection import KFold

kf = KFold(n_splits=2)

X = np.array(data.iloc[:,:3])
y = np.array(data.SalePrice)
for train, test in kf.split(data):
    #print("%s  %s" % (train, test))
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    print(X_train)

重复 K-折交叉验证

重复 K-Fold n 次。当需要运行时可以使用它 KFold n 次，在每次重复中产生不同的分割。
2折 K-Fold 重复 2 次的示例:

#2折重复二次的示例
from sklearn.model_selection import RepeatedKFold

random_state = 123455
rkf = RepeatedKFold(n_splits=2,n_repeats=2,random_state = random_state)
for train, test in rkf.split(data):
    print("%s  %s" % (train, test))

ning_ww

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
K折交叉验证

在验证集上简单打分import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeRegressor#load datasetdata = pd.read_csv('train.csv')data = data.iloc[:,[1,3,4,80]]data = data.fillna(0)X_tra
复制链接

扫一扫

专栏目录