sklearn.model_selection.train_test_split划分训练集和测试集

最新推荐文章于 2024-07-19 03:14:54 发布

夏洛的网

最新推荐文章于 2024-07-19 03:14:54 发布

阅读量2.9w

点赞数 7

分类专栏：机器学习 python

本文链接：https://blog.csdn.net/liuxiao214/article/details/79019901

版权

机器学习同时被 2 个专栏收录

26 篇文章 10 订阅

订阅专栏

python

7 篇文章 0 订阅

订阅专栏

sklearn是python的一个模块，用于机器学习方面。

train_test_split是划分数据集的一个函数。

1、函数原型

1.1 参数：

def train_test_split(*arrays, **options):

    """
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.

    test_size : float, int, None, optional
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. By default, the value is set to 0.25.
        The default will change in version 0.21. It will remain 0.25 only
        if ``train_size`` is unspecified, otherwise it will complement
        the specified ``train_size``.

    train_size : float, int, or None, default None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

    shuffle : boolean, optional (default=True)
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.

    stratify : array-like or None (default is None)
        If not None, data is split in a stratified fashion, using this as
        the class labels.
   """

*array：有相同长度或者shape[0]的可索引序列，可以是lists，numpy arrays，scipy-sparse matrices和pandas dataframes。
test_size：三种类型。float，int，None，可选参数。
- float：0.0-1.0之间。代表测试数据集占总数据集的比例。
- int：代表测试数据集具体的样本数量。
- None：设置为训练数据集的补。
- default：默认设置为0.25，当且train_size没有设置的时候，如果有设置，则按照train_size的补来计算。
train_size：三种类型。float，int，None。
- float：0.0-1.0之间，代表训练数据集占总数据集的比例。
- int：代表训练数据集具体的样本数量。
- None：设置为test_size的补。
- default：默认为None。
random_state：三种类型。int，randomstate instance，None。
- int：是随机数生成器的种子。每次分配的数据相同。
- randomstate：random_state是随机数生成器的种子。（这里没太理解）
- None：随机数生成器是使用了np.random的randomstate。
- 种子相同，产生的随机数就相同。种子不同，即使是不同的实例，产生的种子也不相同。
shuffle：布尔值，可选参数。默认是None。在划分数据之前先打乱数据。如果shuffle=FALSE，则stratify必须是None。
stratify：array-like或者None，默认是None。如果不是None，将会利用数据的标签将数据分层划分。
- 若为None时，划分出来的测试集或训练集中，其类标签的比例也是随机的。
- 若不为None时，划分出来的测试集或训练集中，其类标签的比例同输入的数组中类标签的比例相同，可以用于处理不均衡的数据集。

1.2 返回值：

    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.

        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the input type.

返回一个list，长度是2倍的arrays。包含训练和测试的输入。

在0.16版本添加：如果输入是sparse的，则输出将会是scipy.sparse.csr_matrix，其他则与输入的类型相同。

1.3 举例

Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]

    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]

    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]

2、实际测试

2.1 准备数据

共10个数据，10个标签。

import numpy as np
import tensorflow as tf
import cv2
import imageio
from sklearn.model_selection import train_test_split
import zipfile

x = np.arange(10)
y = ['a','a','b','b','b','c','c','a','b','c']
y = np.array(y)
print "x:", x
print "y:", y
print "############"

这里写图片描述

2.2 设置test_size=0.23，randomstate=2

0.23会向上取整，test的数据集大小为3，randomstate=2设置成整数，每次划分的数据都相同。

x_train, y_train, x_test, y_test = train_test_split(x, y,
                                                    test_size=0.23, random_state=2)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train

print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test

这里写图片描述

2.3 设置test_size=0.23，randomstate=None

每次生成的数据都不相同。不设置randomstate时也是这种结果。

x_train, y_train, x_test, y_test = train_test_split(x, y,
                                                    test_size=0.23, random_state=None)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train

print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test

可以看到两次结果都不一样。

这里写图片描述

2.4 设置test_size=0.23，randomstate=2，shuffle=True

将数据打乱后，由于初始的数据顺序不再保持与原来一致，则即使设置了randomstate为整数，每次的结果也不再相同。

x_train, y_train, x_test, y_test = train_test_split(x, y,
        test_size=0.23, random_state=None, shuffle=True)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train

print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test

这里写图片描述

2.5 设置test_size=0.23，randomstate=2，shuffle=True，straight=y

straight按照原数据y的比例划分。原数据集中，a:b:c约等于3:3:3。

x_train, y_train, x_test, y_test = train_test_split(x, y,
        test_size=0.23, random_state=None, shuffle=True, stratify=y)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train

print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test