train_test_split（），随机划分训练集和测试集的函数

最新推荐文章于 2024-08-10 03:31:35 发布

黑蛋最爱啾啾

最新推荐文章于 2024-08-10 03:31:35 发布

阅读量4.2w

点赞数 19

分类专栏： python 函数文章标签： python 函数

本文链接：https://blog.csdn.net/jiushinayang/article/details/81098186

版权

python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

函数

1 篇文章 0 订阅

订阅专栏

1. 官网文档：

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

2. train_test_split()是sklearn包的model_selection模块中提供的随机划分训练集（train subsets）和测试集（test subsets）的函数。

原型为 train_test_split(test_size, train_size, rondom_state=None, shuffle=True, stratify=None)

参数：

test_size：测试集大小。如果为浮点型，则在0.0-1.0之间，代表测试集的比例；如果为整数型，则为测试集样本的绝对数量；如果没有，则为训练集的补充。默认情况下，值为0.25 。此外，还与版本有关。

train_size: 训练集大小。如果为浮点型，则在0.0-1.0之间，代表训练集的比例；如果为整数型，则为训练集样本的绝对数量；如果没有，则为测试集的补充。

random_state：指定随机方式。一个整数或者RandomState实例，或者None 。如果为整数，则它指定了随机数生成器的种子；如果为RandomState实例，则指定了随机数生成器；如果为None，则使用默认的随机数生成器，随机选择一个种子。

shuffle：布尔值。是否在拆分前重组数据。如果shuffle=False，则stratify必须为None。

stratify：array-like or None。如果不是None,则数据集以分层方式拆分，并使用此作为类标签。

返回值：拆分得到的train和test数据集。

3. 示例

（1）固定随机种子（random_state），可以让每次划分训练集和验证集的时候都是完全一样的。

# 导入包
import numpy as np
from sklearn import model_selection


# 设置数据集
x, y = np.arange(10).reshape((5, 2)), range(5)

# x = array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
# y = range(0, 5)


# 切分数据，不固定随机种子（random_state）时，同样的代码，得到的训练集数据不同。
x_train0, x_test0, y_train0, y_test0 =  model_selection.train_test_split( x, y, test_size=0.33)
x_train1, x_test1, y_train1, y_test1 =  model_selection.train_test_split( x, y, test_size=0.33)

# x_train0 = array([[2, 3], [6, 7], [4, 5]])
# x_train1 = array([[4, 5], [8, 9], [0, 1]])


#切分数据，固定随机种子（random_state）时，同样的代码，得到的训练集数据相同。
x_train2, x_test2, y_train2, y_test2 = model_selection.train_test_split( x, y, test_size=0.33, random_state=42)
x_train3, x_test3, y_train3, y_test3 = model_selection.train_test_split( x, y, test_size=0.33, random_state=42)

# x_train2 = array([[4, 5], [0, 1], [6, 7]])
# x_train3 = array([[4, 5], [0, 1], [6, 7]])

（2）shuffle参数默认是True，洗牌，会在拆分前重组数据顺序。

# 导入包
import numpy as np
from sklearn import model_selection


# 设置数据集
x, y = np.arange(10).reshape((5, 2)), range(5)

# x = array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
# y = range(0, 5)


# 将shuffle参数作为唯一的变化
x_train4, x_test4, y_train4, y_test4 = model_selection.train_test_split( x, y, test_size=0.33, random_state=42, shuffle=True)
x_train5, x_test5, y_train5, y_test5 = model_selection.train_test_split( x, y, test_size=0.33, random_state=42, shuffle=False)

# x_train4 = array([[4, 5], [0, 1], [6, 7]])  数据顺序是重组的
# x_train5 = array([[0, 1], [2, 3], [4, 5]])  数据顺序与最初的数据集顺序相同。

参考网址：

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

https://blog.csdn.net/yaj13346943285/article/details/71630189?locationNum=6&fps=1