数据集划分方法论:训练集、验证集、测试集详解

数据集划分方法论:训练集、验证集、测试集详解

核心概念定义

1. 训练集(Training Set)

  • 作用:用于模型参数训练
  • 数据占比:通常60-80%(自己一般80%训练,10%验证,10%测试)
  • 关键特性:最大的数据子集,直接影响模型权重

2. 验证集(Validation Set)

  • 作用:用于超参数调优和模型选择
  • 数据占比:10-20%
  • 特殊说明:在交叉验证中动态变化

3. 测试集(Test Set)

  • 作用:最终模型性能评估
  • 数据占比:10-20%
  • 准则:只能使用一次,严禁参与训练,参与训练==面向结果学习!!!

7种划分方法详解

方法1:简单划分(Hold-Out)

from sklearn.model_selection import train_test_split

# 基础划分(6:2:2)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

方法2:分层抽样(Stratified Split)

# 保持类别分布一致
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, stratify=y, test_size=0.3)

方法3:K折交叉验证(K-Fold CV)

from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

方法4:时间序列划分(Time-Based Split)

# 按时间戳划分
train_end = '2023-06-30'
val_end = '2023-09-30'

train = df[df['date'] <= train_end]
val = df[(df['date'] > train_end) & (df['date'] <= val_end)]
test = df[df['date'] > val_end]

方法5:留一法(Leave-One-Out)

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]

方法6:组划分(Group Split)

from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, test_size=0.2)
for train_idx, test_idx in gss.split(X, groups=patient_ids):
    X_train, X_test = X[train_idx], X[test_idx]

方法7:嵌套交叉验证

from sklearn.model_selection import GridSearchCV, KFold

inner_cv = KFold(n_splits=3)
outer_cv = KFold(n_splits=5)

clf = GridSearchCV(estimator=svm, param_grid=params, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv)

数学原理与评估指标

数据泄露检测公式

Leakage Score = ∣ Test Performance − Validation Performance ∣ max ⁡ ( Validation Performance , Test Performance ) \text{Leakage Score} = \frac{|\text{Test Performance} - \text{Validation Performance}|}{\max(\text{Validation Performance}, \text{Test Performance})} Leakage Score=max(Validation Performance,Test Performance)Test PerformanceValidation Performance

最优划分比例公式

R o p t = arg ⁡ min ⁡ R [ λ 1 Variance ( R ) + λ 2 Bias ( R ) ] R_{opt} = \arg\min_R \left[ \lambda_1 \text{Variance}(R) + \lambda_2 \text{Bias}(R) \right] Ropt=argRmin[λ1Variance(R)+λ2Bias(R)]
其中:

  • ( λ 1 \lambda_1 λ1, λ 2 \lambda_2 λ2) 为权衡系数
  • ( R ) 表示划分比例组合

代码实现范例

进阶时间序列验证

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    # 保证时序连续性
    assert X_train[-1]['timestamp'] < X_test[0]['timestamp']

多模态数据划分

# 同时划分文本和图像数据
(text_train, text_temp, 
 img_train, img_temp) = train_test_split(
    text_data, image_data, test_size=0.3)

text_val, text_test, img_val, img_test = train_test_split(
    text_temp, img_temp, test_size=0.5)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值