数据集划分方法论:训练集、验证集、测试集详解
目录
核心概念定义
1. 训练集(Training Set)
- 作用:用于模型参数训练
- 数据占比:通常60-80%(自己一般80%训练,10%验证,10%测试)
- 关键特性:最大的数据子集,直接影响模型权重
2. 验证集(Validation Set)
- 作用:用于超参数调优和模型选择
- 数据占比:10-20%
- 特殊说明:在交叉验证中动态变化
3. 测试集(Test Set)
- 作用:最终模型性能评估
- 数据占比:10-20%
- 准则:只能使用一次,严禁参与训练,参与训练==面向结果学习!!!
7种划分方法详解
方法1:简单划分(Hold-Out)
from sklearn.model_selection import train_test_split
# 基础划分(6:2:2)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
方法2:分层抽样(Stratified Split)
# 保持类别分布一致
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, stratify=y, test_size=0.3)
方法3:K折交叉验证(K-Fold CV)
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_index, val_index in kf.split(X):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
方法4:时间序列划分(Time-Based Split)
# 按时间戳划分
train_end = '2023-06-30'
val_end = '2023-09-30'
train = df[df['date'] <= train_end]
val = df[(df['date'] > train_end) & (df['date'] <= val_end)]
test = df[df['date'] > val_end]
方法5:留一法(Leave-One-Out)
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
方法6:组划分(Group Split)
from sklearn.model_selection import GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=1, test_size=0.2)
for train_idx, test_idx in gss.split(X, groups=patient_ids):
X_train, X_test = X[train_idx], X[test_idx]
方法7:嵌套交叉验证
from sklearn.model_selection import GridSearchCV, KFold
inner_cv = KFold(n_splits=3)
outer_cv = KFold(n_splits=5)
clf = GridSearchCV(estimator=svm, param_grid=params, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv)
数学原理与评估指标
数据泄露检测公式
Leakage Score = ∣ Test Performance − Validation Performance ∣ max ( Validation Performance , Test Performance ) \text{Leakage Score} = \frac{|\text{Test Performance} - \text{Validation Performance}|}{\max(\text{Validation Performance}, \text{Test Performance})} Leakage Score=max(Validation Performance,Test Performance)∣Test Performance−Validation Performance∣
最优划分比例公式
R
o
p
t
=
arg
min
R
[
λ
1
Variance
(
R
)
+
λ
2
Bias
(
R
)
]
R_{opt} = \arg\min_R \left[ \lambda_1 \text{Variance}(R) + \lambda_2 \text{Bias}(R) \right]
Ropt=argRmin[λ1Variance(R)+λ2Bias(R)]
其中:
- ( λ 1 \lambda_1 λ1, λ 2 \lambda_2 λ2) 为权衡系数
- ( R ) 表示划分比例组合
代码实现范例
进阶时间序列验证
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
# 保证时序连续性
assert X_train[-1]['timestamp'] < X_test[0]['timestamp']
多模态数据划分
# 同时划分文本和图像数据
(text_train, text_temp,
img_train, img_temp) = train_test_split(
text_data, image_data, test_size=0.3)
text_val, text_test, img_val, img_test = train_test_split(
text_temp, img_temp, test_size=0.5)