采用 K 折交叉验证之前需要先划分好数据,这里记录一下各种 K 折划分数据的方法
0. 示例 csv
构造一个 example.csv 为例,其中 image_name 为特征,patient_id 为分组,target 为标签。
1. KFold
太简单不写了
2. StratifiedKFold
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
k = 3 # 将分为 3 折
df = pd.read_csv('example.csv')
df.insert(len(df.columns), 'StratifiedKFold', np.nan)
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=2020)
for fold, (train_ids, valid_ids) in enumerate(skf.split(X=np.zeros(len(df)), y=df['target']),
start=1):
df.loc[valid_ids, 'StratifiedKFold'] = fold
# 保存
df.to_csv('example_skf.csv')
# 看看 fold == 1 作为验证集时,训练集和验证集的标签计数
df = pd.read_csv('example_skf.csv', index_col=0)
train_df = df[df['StratifiedKFold'] != 1]
valid_df = df[df['Stratified