sklearn是python的一个模块,用于机器学习方面。
train_test_split是划分数据集的一个函数。
1、函数原型
1.1 参数:
def train_test_split(*arrays, **options):
"""
Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. By default, the value is set to 0.25.
The default will change in version 0.21. It will remain 0.25 only
if ``train_size`` is unspecified, otherwise it will complement
the specified ``train_size``.
train_size : float, int, or None, default None
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False
then stratify must be None.
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as
the class labels.
"""
*array:有相同长度或者shape[0]的可索引序列,可以是lists,numpy arrays,scipy-sparse matrices和pandas dataframes。
test_size:三种类型。float,int,None,可选参数。
float:0.0-1.0之间。代表测试数据集占总数据集的比例。
int:代表测试数据集具体的样本数量。
None:设置为训练数据集的补。
default:默认设置为0.25,当且train_size没有设置的时候,如果有设置,则按照train_size的补来计算。
train_size:三种类型。float,int,None。
float:0.0-1.0之间,代表训练数据集占总数据集的比例。
int:代表训练数据集具体的样本数量。
None:设置为test_size的补。
default:默认为None。
random_state:三种类型。int,randomstate instance,None。
int:是随机数生成器的种子。每次分配的数据相同。
randomstate:random_state是随机数生成器的种子。(这里没太理解)
None:随机数生成器是使用了np.random的randomstate。
种子相同,产生的随机数就相同。种子不同,即使是不同的实例,产生的种子也不相同。
shuffle:布尔值,可选参数。默认是None。在划分数据之前先打乱数据。如果shuffle=FALSE,则stratify必须是None。
stratify:array-like或者None,默认是None。如果不是None,将会利用数据的标签将数据分层划分。
若为None时,划分出来的测试集或训练集中,其类标签的比例也是随机的。
若不为None时,划分出来的测试集或训练集中,其类标签的比例同输入的数组中类标签的比例相同,可以用于处理不均衡的数据集。
1.2 返回值:
Returns
-------
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
.. versionadded:: 0.16
If the input is sparse, the output will be a
``scipy.sparse.csr_matrix``. Else, output type is the same as the input type.
返回一个list,长度是2倍的arrays。包含训练和测试的输入。
在0.16版本添加:如果输入是sparse的,则输出将会是scipy.sparse.csr_matrix
,其他则与输入的类型相同。
1.3 举例
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
2、实际测试
2.1 准备数据
共10个数据,10个标签。
import numpy as np
import tensorflow as tf
import cv2
import imageio
from sklearn.model_selection import train_test_split
import zipfile
x = np.arange(10)
y = ['a','a','b','b','b','c','c','a','b','c']
y = np.array(y)
print "x:", x
print "y:", y
print "############"
2.2 设置test_size=0.23,randomstate=2
0.23会向上取整,test的数据集大小为3,randomstate=2设置成整数,每次划分的数据都相同。
x_train, y_train, x_test, y_test = train_test_split(x, y,
test_size=0.23, random_state=2)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train
print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test
2.3 设置test_size=0.23,randomstate=None
每次生成的数据都不相同。不设置randomstate时也是这种结果。
x_train, y_train, x_test, y_test = train_test_split(x, y,
test_size=0.23, random_state=None)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train
print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test
可以看到两次结果都不一样。
2.4 设置test_size=0.23,randomstate=2,shuffle=True
将数据打乱后,由于初始的数据顺序不再保持与原来一致,则即使设置了randomstate为整数,每次的结果也不再相同。
x_train, y_train, x_test, y_test = train_test_split(x, y,
test_size=0.23, random_state=None, shuffle=True)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train
print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test
2.5 设置test_size=0.23,randomstate=2,shuffle=True,straight=y
straight按照原数据y的比例划分。原数据集中,a:b:c约等于3:3:3。
x_train, y_train, x_test, y_test = train_test_split(x, y,
test_size=0.23, random_state=None, shuffle=True, stratify=y)
print "x_train:", x_train.shape, x_train
print "y_train:", y_train.shape, y_train
print "x_test:", x_test.shape, x_test
print "y_test:", y_test.shape, y_test
可以看到,划分的后的训练和测试数据集,也各按照2:2:2和1:1:1的比例。
而当没有设置straight时,可以看到数据有些不平衡,这里觉得例子数据量过小,当数据量大时更明显。
以上。