【机器学习（10）】模型评价：数据集划方法（留出法和交叉验证法）

最新推荐文章于 2023-05-01 20:06:09 发布

lys_828

最新推荐文章于 2023-05-01 20:06:09 发布

阅读量1.8k

点赞数

分类专栏：机器学习文章标签：机器学习 python 深度学习数据分析数据挖掘

本文链接：https://blog.csdn.net/lys_828/article/details/104677832

版权

机器学习专栏收录该内容

18 篇文章 60 订阅

订阅专栏

数据集划分方法

1）划分基本准则：保持训练集和验证集之间的 互斥性
准则解释：测试样本尽量不在训练样本中出现，以保证验证集上的表现能代表模型的泛化能力（比如期末测试题上出的内容不是课上讲的原题）

2）留出法：
      直接将数据集划分成两个互斥的集合，其中一个做训练集，一个做验证集
      常用划分比例：7:3、7.5:2.5、8:2
      存在的问题：随机取样对模型的影响（比如这一次考试随机抽取的题目都是会做的，而第二次抽取的又恰巧是我不会做的），这种情况下测试的结果并不能代表我的真实水平

3）交叉验证法（cv）
将数据集划分为k个大小相似的互斥子集，每一次以k-1个子集做训练，1个子集做验证，训练k次，最终返回的是k次训练结果的均值，因此交叉验证法又称为k折交叉法（k-fold）

留出法进行数据划分及模型评价得分

这里直接进行数据的分割

from sklearn.model_selection import train_test_split
training, testing = train_test_split(df,test_size=0.25, random_state=1)
x_train = training.copy().drop(columns=['average_price','id'])
y_train = training.copy()['average_price']
x_test = testing.copy().drop(columns=['average_price','id'])
y_test = testing.copy()['average_price']
print(f'the shape of the training set is {x_train.shape}')
print(f'the shape of the testing set is {x_test.shape}')

–> 输出的结果为：（按照0.75:0.25进行分割数据）

the shape of the training set is (673, 7)
the shape of the testing set is (225, 7)

查看留出法验证集上模型的表现

import warnings
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
warnings.filterwarnings('ignore')
pipe_lm.fit(x_train,y_train)
y_predict = pipe_lm.predict(x_test)
print(f'mean squared error is: {mean_squared_error(y_test,y_predict)}')
print(f'mean absolute error is: {mean_absolute_error(y_test,y_predict)}')
print(f'R Squared is: {r2_score(y_test,y_predict)}')

–> 输出的结果为：

mean squared error is: 37995892.98761668
mean absolute error is: 4396.432366811368
R Squared is: 0.5719194373282056

交叉验证法进行数据划分及模型评价得分

from sklearn.model_selection import KFold
k = 10
kf = KFold(n_splits=k, shuffle=True)

mse = []
mae = []
r_s2 = []

for train_index, test_index in kf.split(df):  # 拆分
    x_train, x_test = x.loc[train_index], x.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    pipe_lm.fit(x_train,y_train)
    y_predict = pipe_lm.predict(x_test)  # 模型原型 选择
    k_mse = mean_squared_error(y_test,y_predict)
    mse.append(k_mse)
    print(f'mean squared error is {k_mse}')
    k_mae = mean_absolute_error(y_test,y_predict)
    mae.append(k_mae)
    print(f'mean absolute error is {k_mae}')
    k_r_s2 = r2_score(y_test,y_predict)
    r_s2.append(k_r_s2)
    print(f'R Squared is {k_r_s2}')

–> 输出的结果为：（分十次进行结果输出）

mean squared error is 33944053.2775839
mean absolute error is 4091.3521198501926
R Squared is 0.5092029932534381
mean squared error is 35114434.65383525
mean absolute error is 4058.507125221178
R Squared is 0.5689849523694475
mean squared error is 18197156.68175975
mean absolute error is 3369.9995687271
R Squared is 0.776182527685076
mean squared error is 34535243.46299915
mean absolute error is 4192.820971274831
R Squared is 0.5779088133315295
mean squared error is 40233080.92316229
mean absolute error is 4097.433595440258
R Squared is 0.5424187136274303
mean squared error is 30563555.203029394
mean absolute error is 4036.1654178309395
R Squared is 0.599523380094302
mean squared error is 45418168.17460259
mean absolute error is 4849.161828850654
R Squared is 0.5349900325354779
mean squared error is 39047300.59608697
mean absolute error is 4684.974341683951
R Squared is 0.6127954527533661
mean squared error is 40672733.401221015
mean absolute error is 4738.685116862728
R Squared is 0.5256179983368192
mean squared error is 37635022.702838585
mean absolute error is 4099.567148872294
R Squared is 0.4709723577684112

最后取十次结果的平均值即可

import numpy as np
print(f'mean squared error is {np.array(mse).mean()}')
print(f'mean absolute error is {np.array(mae).mean()}')
print(f'R Squared is {np.array(r_s2).mean()}')

–> 输出的结果为：

mean squared error is 35536074.90771189
mean absolute error is 4221.866723461413
R Squared is 0.5718597221755297

对比结果

可以看出两种方法最后的模型评价的分中，r2基本上是一致的，但是使用交叉验证法的mae/mse结果要比留出法要低，可以认为交叉验证法能有效的控制随机取样对模型的影响

lys_828

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
0
评论
【机器学习（10）】模型评价：数据集划方法（留出法和交叉验证法）

数据集划分方法1）划分基本准则：保持训练集和验证集之间的互斥性        准则解释：测试样本尽量不在训练样本中出现，以保证验证集上的表现能代表模型的泛化能力（比如期末测试题上出的内容不是课上讲的原题）2）留出法：        直接将数据集划分成两个互斥的集合，其中一个做训练集，一...
复制链接

扫一扫