数据集划分方法论：训练集、验证集、测试集详解

六月五日

于 2025-02-23 12:49:44 发布

阅读量2k

点赞数 46

分类专栏： Pytorch指南文章标签： pytorch

本文链接：https://blog.csdn.net/2401_86968005/article/details/145808162

版权

Pytorch指南专栏收录该内容

83 篇文章

订阅专栏

数据集划分方法论：训练集、验证集、测试集详解

核心概念定义

1. 训练集（Training Set）

作用：用于模型参数训练
数据占比：通常60-80%（自己一般80%训练，10%验证，10%测试）
关键特性：最大的数据子集，直接影响模型权重

2. 验证集（Validation Set）

作用：用于超参数调优和模型选择
数据占比：10-20%
特殊说明：在交叉验证中动态变化

3. 测试集（Test Set）

作用：最终模型性能评估
数据占比：10-20%
准则：只能使用一次，严禁参与训练，参与训练==面向结果学习！！！

7种划分方法详解

方法1：简单划分（Hold-Out）

from sklearn.model_selection import train_test_split

# 基础划分（6:2:2）
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

方法2：分层抽样（Stratified Split）

# 保持类别分布一致
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, stratify=y, test_size=0.3)

方法3：K折交叉验证（K-Fold CV）

from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

方法4：时间序列划分（Time-Based Split）

# 按时间戳划分
train_end = '2023-06-30'
val_end = '2023-09-30'

train = df[df['date'] <= train_end]
val = df[(df['date'] > train_end) & (df['date'] <= val_end)]
test = df[df['date'] > val_end]

方法5：留一法（Leave-One-Out）

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]

方法6：组划分（Group Split）

from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, test_size=0.2)
for train_idx, test_idx in gss.split(X, groups=patient_ids):
    X_train, X_test = X[train_idx], X[test_idx]

方法7：嵌套交叉验证

from sklearn.model_selection import GridSearchCV, KFold

inner_cv = KFold(n_splits=3)
outer_cv = KFold(n_splits=5)

clf = GridSearchCV(estimator=svm, param_grid=params, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv)

数学原理与评估指标

数据泄露检测公式

$\text{Leakage Score} = \frac{|\text{Test Performance} - \text{Validation Performance}|}{\max(\text{Validation Performance}, \text{Test Performance})}$

最优划分比例公式

$R_{opt} = \arg\min_R \left[ \lambda_1 \text{Variance}(R) + \lambda_2 \text{Bias}(R) \right]$
其中：

( $\lambda_1$ , $\lambda_2$ ) 为权衡系数
( R ) 表示划分比例组合

代码实现范例

进阶时间序列验证

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    # 保证时序连续性
    assert X_train[-1]['timestamp'] < X_test[0]['timestamp']

多模态数据划分

# 同时划分文本和图像数据
(text_train, text_temp, 
 img_train, img_temp) = train_test_split(
    text_data, image_data, test_size=0.3)

text_val, text_test, img_val, img_test = train_test_split(
    text_temp, img_temp, test_size=0.5)