sklearn.model_selection

最新推荐文章于 2023-07-01 22:35:50 发布

csdn_lzw

最新推荐文章于 2023-07-01 22:35:50 发布

阅读量628

点赞数

分类专栏：数据挖掘

数据挖掘专栏收录该内容

2 篇文章 1 订阅

订阅专栏

sklearn 中文文档
http://sklearn.apachecn.org/cn/stable/modules/model_evaluation.html

一、 sklearn.model_selection.KFold

K折交叉验证：sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

思路：将训练/测试数据集划分n_splits个互斥子集，每次用其中一个子集当作验证集，剩下的n_splits-1个作为训练集，进行n_splits次训练和测试，得到n_splits个结果

注意点：对于不能均等份的数据集，其前n_samples % n_splits子集拥有n_samples // n_splits + 1个样本，其余子集都只有n_samples // n_splits样本

参数说明：

n_splits：表示划分几等份

shuffle：在每次划分时，是否进行洗牌

①若为Falses时，其效果等同于random_state等于整数，每次划分的结果相同

②若为True时，每次划分的结果都不一样，表示经过洗牌，随机取样的

random_state：随机种子数

属性：

①get_n_splits(X=None, y=None, groups=None)：获取参数n_splits的值

②split(X, y=None, groups=None)：将数据集划分成训练集和测试集，返回索引生成器

通过一个不能均等划分的栗子，设置不同参数值，观察其结果

①设置shuffle=False，运行两次，发现两次结果相同

In [1]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=False)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 3  4  5  6  7  8  9 10 11] , test_index: [0 1 2]
train_index:[ 0  1  2  6  7  8  9 10 11] , test_index: [3 4 5]
train_index:[ 0  1  2  3  4  5  8  9 10 11] , test_index: [6 7]
train_index:[ 0  1  2  3  4  5  6  7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
 
In [2]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=False)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 3  4  5  6  7  8  9 10 11] , test_index: [0 1 2]
train_index:[ 0  1  2  6  7  8  9 10 11] , test_index: [3 4 5]
train_index:[ 0  1  2  3  4  5  8  9 10 11] , test_index: [6 7]
train_index:[ 0  1  2  3  4  5  6  7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]

②设置shuffle=True时，运行两次，发现两次运行的结果不同

In [3]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 0  1  2  4  5  6  7  8 10] , test_index: [ 3  9 11]
train_index:[ 0  1  2  3  4  5  9 10 11] , test_index: [6 7 8]
train_index:[ 2  3  4  5  6  7  8  9 10 11] , test_index: [0 1]
train_index:[ 0  1  3  4  5  6  7  8  9 11] , test_index: [ 2 10]
train_index:[ 0  1  2  3  6  7  8  9 10 11] , test_index: [4 5]
 
In [4]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 0  1  2  3  4  5  7  8 11] , test_index: [ 6  9 10]
train_index:[ 2  3  4  5  6  8  9 10 11] , test_index: [0 1 7]
train_index:[ 0  1  3  5  6  7  8  9 10 11] , test_index: [2 4]
train_index:[ 0  1  2  3  4  6  7  9 10 11] , test_index: [5 8]
train_index:[ 0  1  2  4  5  6  7  8  9 10] , test_index: [ 3 11]

③设置shuffle=True和random_state=整数，发现每次运行的结果都相同

In [5]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 0  1  2  3  5  7  8  9 10] , test_index: [ 4  6 11]
train_index:[ 0  1  3  4  5  6  7  9 11] , test_index: [ 2  8 10]
train_index:[ 0  2  3  4  5  6  8  9 10 11] , test_index: [1 7]
train_index:[ 0  1  2  4  5  6  7  8 10 11] , test_index: [3 9]
train_index:[ 1  2  3  4  6  7  8  9 10 11] , test_index: [0 5]
 
In [6]: from sklearn.model_selection import KFold
   ...: import numpy as np
   ...: X = np.arange(24).reshape(12,2)
   ...: y = np.random.choice([1,2],12,p=[0.4,0.6])
   ...: kf = KFold(n_splits=5,shuffle=True,random_state=0)
   ...: for train_index , test_index in kf.split(X):
   ...:     print('train_index:%s , test_index: %s ' %(train_index,test_index))
   ...:
   ...:
train_index:[ 0  1  2  3  5  7  8  9 10] , test_index: [ 4  6 11]
train_index:[ 0  1  3  4  5  6  7  9 11] , test_index: [ 2  8 10]
train_index:[ 0  2  3  4  5  6  8  9 10 11] , test_index: [1 7]
train_index:[ 0  1  2  4  5  6  7  8 10 11] , test_index: [3 9]
train_index:[ 1  2  3  4  6  7  8  9 10 11] , test_index: [0 5]

④n_splits属性值获取方式

In [8]: kf.split(X)
Out[8]: <generator object _BaseKFold.split at 0x00000000047FF990>
 
In [9]: kf.get_n_splits()
Out[9]: 5
 
In [10]: kf.n_splits
Out[10]: 5

二、 sklearn.cross_validation.cross_val_score:

原文：https://blog.csdn.net/Dream_angel_Z/article/details/47048285
调用形式是scores = cross_validation.cross_val_score(clf, raw_data, raw_target, cv=5, score_func=None)

clf:表示的是不同的分类器，可以是任何的分类器。比如支持向量机分类器。clf = svm.SVC(kernel=’linear’, C=1)；
raw_data：原始数据；
raw_target:原始类别标号；
cv：代表的就是不同的cross validation的方法了。引用scikit-learn上的一句话（When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.）如果cv是一个int数字的话，那么默认使用的是KFold或者StratifiedKFold交叉，如果如果指定了类别标签则使用的是StratifiedKFold。
cross_val_score:这个函数的返回值就是对于每次不同的的划分raw_data时，在test_data上得到的分类的准确率。至于准确率的算法可以通过score_func参数指定，如果不指定的话，是用clf默认自带的准确率算法。

>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_validation.cross_val_score(
...    clf, iris.data, iris.target, cv=5)
...
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])
--------------------- 
作者：拾毅者 
来源：CSDN 
原文：https://blog.csdn.net/Dream_angel_Z/article/details/47048285 
版权声明：本文为博主原创文章，转载请附上博文链接！

通过scores.mean()求出平均值，得到平均精度。还可以通过指定scoring来设置准确率算法

>>> from sklearn import metrics
>>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target,
...     cv=5, scoring='f1_weighted')
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

csdn_lzw

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
sklearn.model_selection

sklearn 中文文档http://sklearn.apachecn.org/cn/stable/modules/model_evaluation.html一、 sklearn.model_selection.KFoldK折交叉验证：sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)思路：...
复制链接

扫一扫