@Sklearn 数据集划分为训练集测试集方法,python实现
K折交叉验证:KFold ,GroupKFold,StratifiedKFold
1:将全部训练集S分成k个不相交子集,如S中训练样本个数为m,则每个子集含有m/k个训练样例,对应的子集为{s1,s2,…,sk}
2:每次从分好的子集里面,拿出一个作为测试集,其他k-1个作为训练集
3:在k-1个训练集上得出训练的学习器模型
4:把测试集运用到训练出的学习器模型,得出分类率
5:计算k次求得的分类率平均值,作为该模型的真实分类率
KFold
from sklearn.model_selection import KFold
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
kf=KFold(n_splits=2) #kf的分类类型:n_splits=2,random_state=None,shuffle=False,顺序未打乱,分成两个类别 Train_index:[3 4 5],Test_index:[0 1 2];与 Train_index:[0 1 2],Test_index[3 4 5]
for train_index,test_index in kf.split(x):
print("Train_index:",train_index,",Test_index:",test_index)
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)
GroupKFold
from sklearn.model_selection import GroupKFold
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
groups=np.array([1,2,3,4,5,6])
group_kfold=GroupKFold(n_splits=2)
group_kfold.get_n_splits(x,y,groups)
print(group_kfold)
for train_index,test_index in group_kfold.split(x,y,groups):
print("Train_index:",train_index,",Test_index:",test_index)
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)
留一法:Leave One Group Out,LeavePGroupsOut, LeaveOneOut,LeavePOut
留一法验证: 假设有N个样本,将每一个样本都作为测试集,其他N-1个样本作为训练样本,循环N次,得到N个分类器,N个测试结果,用这N个结果的平均值来衡量模型的性能。N样本数量不是很大,能快速出结果;KFold 中 k<<N .
from sklearn.model_selection import LeaveOneOut
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
loo=LeaveOneOut()
loo.get_n_splits(x)
print(loo)
for train_index,test_index in loo.split(x):
""" Train_index:[1 2 3 4 5],Test_index:[0] 依次循环取6次"""
print("Train_index:",train_index,",Test_index:",test_index)
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)
留P法验证 (Leave-p-out):
有N个样本,每次留p个作为测试集,用N-p样本作为训练集,共有C(p,N)种情况,p>1时,测试集会发生重叠
from sklearn.model_selection import LeavePOut
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
lpo=LeavePOut(p=3)#任留3个,C(3,6)=A(3,6)/A(3,3)=6*5*4/3*2*1=20
print(lpo)
for train_index,test_index in lpo.split(x,y):
print("Train_index:",train_index,",Test_index:",test_index)#
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)
#LeavePOut(p=2) Train_index: [2 3 4 5] ,Test_index: [0 1]
# Train_index: [1 3 4 5] ,Test_index: [0 2] Test_index [0]重叠
随机划分法:
ShuffleSplit,StratifiedShuffleSplit
ShuffleSplit 迭代器会产生指定数量的独立 train/test数据集划分,首先对样本全体随机打乱,然后再划分出train/test 对,使用随机数种子 random_state 来控制随机数序列发生器使得运算结果可重现。ShuffleSplit是KFold交叉验证的替代,可以更好的控制迭代次数和 train/test样本比例
StratifiedShuffleSplit是ShuffleSplit 的一个变体,返回分层划分,在创建划分时必须保证每个划分中类的样本比例与整体数据集中的原始比例保持一致。
ShuffleSplit
from sklearn.model_selection import ShuffleSplit
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
ss=ShuffleSplit(n_splits=3,test_size=1,random_state=0)#test_size 可以是个数,可以是比例
ss.get_n_splits(x)
for train_index,test_index in ss.split(x):
print("Train_index:",train_index,",Test_index:",test_index)
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)
StratifiedShuffleSplit
#StratifiedShuffleSplit把数据集打乱,分测试集、训练集、 保证训练集中各类类别所占比例一样,
from sklearn.model_selection import StratifiedShuffleSplit
x=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,1,3,2])
"""y此处类别:多单一类别 报错ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
"""
sss=StratifiedShuffleSplit(n_splits=3,test_size=0.5,random_state=0)
sss.get_n_splits(x,y)
print(sss)
for train_index,test_index in sss.split(x,y):
print("Train_index:",train_index,",Test_index:",test_index)
x_train,x_test=x[train_index],x[test_index]
y_train,y_test=y[train_index],y[test_index]
print(x_train,x_test,y_train,y_test)