首先来生成一个训练集
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
filename_label = {'filename':[str(i)+'.jpg' for i in range(100)], 'label':[np.random.randint(0,5) for i in range(100)]}
train = pd.DataFrame(filename_label)
print(train['label'].value_counts())
'''
2 23
1 23
0 20
4 18
3 16
Name: label, dtype: int64
'''
train.head(10)
接下来使用 sklearn.model_selection.StratifiedKFold,把这个 CSV 文件分成 2K 个文件,即 K 个训练集加 K 个测试集:
n_splits = 5 # K
x = train['filename'].values
y = train['label'].values
skf = StratifiedKFold(n_splits=n_splits, random_state=42, shuffle=True)
for index,(train_index,test_index) in enumerate(skf.split(x,y), start=1):
res_train = pd.DataFrame()
res_train['filename'] = train['filename'].iloc[train_index]
res_train['label'] = train['label'].iloc[train_index]
res_train.to_csv("train_{}.csv".format(index),index=False)
res_train = pd.DataFrame()
res_train['filename'] = train['filename'].iloc[test_index]
res_train['label'] = train['label'].iloc[test_index]
res_train.to_csv("test_{}.csv".format(index),index=False)
因为是 5 折交叉验证,所以训练集和测试集的行数之比为 4:1