sklearn.model_selection.train_test_split 用法

sklearn之前没有接触过,以练代学了。常用的用法记录下来,这样才能慢慢总结。

  • sklearn.model_selection.train_test_split 用法
在使用python做机械学习时候,为了制作训练数据(training samples)和测试数据(testing samples),常使用sklearn里面的
sklearn.model_selection.train_test_split模块。
train_test_split的使用方法:
sklearn.model_selection.train_test_split(*arrays, **options)
train_test_split里面常用的因数(arguments)介绍:
arrays:分割对象同样长度的列表或者numpy arrays,矩阵。
test_size:两种指定方法。1:指定小数。小数范围在0.0~0.1之间,它代表test集占据的比例。2:指定整数。整数的大小必须在这个数据集个数范围内,总不能指定一个数超出了数据集的个数范围吧。要是test_size在没有指定的场合,可以通过train_size来指定。(两个是对应关系)。如果train_size也没有指定,那么默认值是0.25.
train_size:和test_size相似。
random_state:这是将分割的training和testing集合打乱的个数设定。如果不指定的话,也可以通过numpy.random来设定随机数。
shuffle和straify不常用。straify就是将数据分层。
train_test_split 用法举例:
这个数据集 4列(カラム),12行(レコード)。
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> 
>>> namelist = pd.DataFrame({
...    "name" : ["Suzuki", "Tanaka", "Yamada", "Watanabe", "Yamamoto",
...              "Okada", "Ueda", "Inoue", "Hayashi", "Sato",
...              "Hirayama", "Shimada"],
...    "age": [30, 40, 55, 29, 41, 28, 42, 24, 33, 39, 49, 53],
...    "department": ["HR", "Legal", "IT", "HR", "HR", "IT",
...                   "Legal", "Legal", "IT", "HR", "Legal", "Legal"],
...    "attendance": [1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1]
... })
>>> print(namelist)
 
    age  attendance department      name
0    30           1         HR    Suzuki
1    40           1      Legal    Tanaka
2    55           1         IT    Yamada
3    29           0         HR  Watanabe
4    41           1         HR  Yamamoto
5    28           1         IT     Okada
6    42           1      Legal      Ueda
7    24           0      Legal     Inoue
8    33           0         IT   Hayashi
9    39           1         HR      Sato
10   49           1      Legal  Hirayama
11   53           1      Legal   Shimada
将testing数据指定为0.3(test_size=0.3),从而将testing和training 集合分开。
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=0.3)
>>> print(namelist_train)
 
    age  attendance department      name
10   49           1      Legal  Hirayama
1    40           1      Legal    Tanaka
7    24           0      Legal     Inoue
2    55           1         IT    Yamada
4    41           1         HR  Yamamoto
3    29           0         HR  Watanabe
9    39           1         HR      Sato
6    42           1      Legal      Ueda
 
>>> print(namelist_test)
 
    age  attendance department     name
0    30           1         HR   Suzuki
8    33           0         IT  Hayashi
11   53           1      Legal  Shimada
5    28           1         IT    Okada

接下来是将testing数据指定为具体数目。test_size=5。
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=5)
>>> print(namelist_train)
 
   age  attendance department      name
3   29           0         HR  Watanabe
4   41           1         HR  Yamamoto
6   42           1      Legal      Ueda
1   40           1      Legal    Tanaka
9   39           1         HR      Sato
8   33           0         IT   Hayashi
7   24           0      Legal     Inoue
 
>>> print(namelist_test)
 
    age  attendance department      name
2    55           1         IT    Yamada
10   49           1      Legal  Hirayama
5    28           1         IT     Okada
11   53           1      Legal   Shimada
0    30           1         HR    Suzuki
接下来将training data 指定为0.5(training_size=0.5)
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=None, train_size=0.5)
>>> print(namelist_train)
 
    age  attendance department      name
5    28           1         IT     Okada
2    55           1         IT    Yamada
3    29           0         HR  Watanabe
4    41           1         HR  Yamamoto
10   49           1      Legal  Hirayama
0    30           1         HR    Suzuki
 
>>> print(namelist_test)
 
    age  attendance department     name
6    42           1      Legal     Ueda
7    24           0      Legal    Inoue
9    39           1         HR     Sato
11   53           1      Legal  Shimada
8    33           0         IT  Hayashi
1    40           1      Legal   Tanaka
接下来是是shuffle和straify功能。例题欣赏。
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> namelist_train, namelist_test = train_test_split(namelist, shuffle=False)
>>> print(namelist_train)
 
   age  attendance department      name
0   30           1         HR    Suzuki
1   40           1      Legal    Tanaka
2   55           1         IT    Yamada
3   29           0         HR  Watanabe
4   41           1         HR  Yamamoto
5   28           1         IT     Okada
6   42           1      Legal      Ueda
7   24           0      Legal     Inoue
8   33           0         IT   Hayashi
 
>>> print(namelist_test)
 
    age  attendance department      name
9    39           1         HR      Sato
10   49           1      Legal  Hirayama
11   53           1      Legal   Shimada
  • summary
  • train_test_split(arrays,options)  arrays确定需要分割的对象,数据集。
  • train_test_split(arrays,options)  options确定需要分割的方法。例如比例,随机性,分层等。
  • 日本人写的博客还是清楚,详细。以后除了中文博客,日文博客也应该多参考一下。
reference:





  • 30
    点赞
  • 99
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值