scikit中StratifiedKFold和KFold区别,以及split方法
测试代码:
# -*- coding: utf-8 -*-
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
test_array = np.zeros((8, 5))
for idx in range(test_array.shape[0]):
test_array[idx] = np.arange((idx + 1)*10 + 1, (idx + 1)*10 + 6)
print(test_array)
print(f'{"-" * 30}')
test_array_lab = np.ones((8,))
skf = StratifiedKFold(n_splits=4, shuffle=True)
kf = KFold(n_splits=4, shuffle=True)
print('StratifiedKFold result')
for train, val in enumerate(skf.split(test_array, test_array_lab)):
print(f'train idx: {train}, val idx: {val}')
print(f'{"-" * 30}')
print('KFold result')
for train, val in enumerate(kf.split(test_array, test_array_lab)):
print(f'train idx: {train}, val idx: {val}')
对8组数据进行4折,输出训练集和测试集对应索引。
运行结果:
>>> test_array
[[11. 12. 13. 14. 15.]
[21. 22. 23. 24. 25.]
[31. 32. 33. 34. 35.]
[41. 42. 43. 44. 45.]
[51. 52. 53. 54. 55.]
[61. 62. 63. 64. 65.]
[71. 72. 73. 74. 75.]
[81. 82. 83. 84. 85.]]
------------------------------
StratifiedKFold result:
train idx: 0, val idx: (array([2, 3, 4, 5, 6, 7]), array([0, 1]))
train idx: 1, val idx: (array([0, 1, 2, 3, 4, 7]), array([5, 6]))
train idx: 2, val idx: (array([0, 1, 2, 5, 6, 7]), array([3, 4]))
train idx: 3, val idx: (array([0, 1, 3, 4, 5, 6]), array([2, 7]))
------------------------------
KFold result:
train idx: 0, val idx: (array([0, 1, 2, 4, 5, 7]), array([3, 6]))
train idx: 1, val idx: (array([0, 1, 3, 5, 6, 7]), array([2, 4]))
train idx: 2, val idx: (array([0, 1, 2, 3, 4, 6]), array([5, 7]))
train idx: 3, val idx: (array([2, 3, 4, 5, 6, 7]), array([0, 1]))
StratifiedKFold方法:官方文档描述,与KFold不同的是,StratifiedKFold对划分的训练集和测试集中保留每个类别的样本百分比;对比KFold来说样本分布中类别更平均,模型测试中反映性能会更加直观。
KFold方法:官方文档描述
split方法:
split方法返回训练集和测试集的数据索引值。