对抗验证
交叉验证(Cross Validation)是常用的一种用来评估模型效果的方法。
当样本分布发生变化时,交叉验证无法准确评估模型在测试集上的效果,这导致模型在测试集上的效果远低于训练集。
通过本文,你将通过一个kaggle的比赛实例了解到,样本分布变化如何影响建模,如何通过对抗验证辨别样本的分布变化,以及有哪些应对方法。
直接给链接:https://zhuanlan.zhihu.com/p/93842847
在此之前,不过不懂AUC,请学习下面链接:
https://www.zhihu.com/question/39840928/answer/241440370
贴上自己的代码,以便于理解:
##df_train:用给的训练集
##df_test:用测试机当验证集
df_train = train_lables
df_test = test_lables
# 定义新的Y
df_train['Is_Test'] = 0
df_test['Is_Test'] = 1
# 将 Train 和 Test 合成一个数据集。
df_adv = pd.concat([df_train, df_test])
# features:训练时所用到的特征
X = df_adv[features]
y = df_adv['Is_Test']
模型训练
# 定义模型参数
params = {
'boosting_type': 'gbdt',
'colsample_bytree': 1,
'learning_rate': 0.1,
'max_depth': 5,
'min_child_samples': 100,
'min_child_weight': 1,
'min_split_gain': 0.0,
'num_leaves': 20,
'objective': 'binary',
'random_state': 50,
'subsample': 1.0,
'subsample_freq': 0,
'metric': 'auc',
'num_threads': 8
}
cv_pred = []
best_loss = []
test_prob = 0
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
for index, (train_idx, test_idx) in enumerate(fold.split(X, y)):
lgb_model = lgb.LGBMClassifier(**params)
train_x, test_x, train_y, test_y = X.loc[train_idx], X.loc[test_idx], y.loc[train_idx], y.loc[test_idx]
eval_set = [(test_x, test_y)]
lgb_model.fit(train_x, train_y, eval_set = eval_set, eval_metric='auc',early_stopping_rounds=100,verbose=None)
best_loss.append(lgb_model.best_score_['valid_0']['auc'])
print(best_loss, np.mean(best_loss))
AUC结果:
[0.5548410714285714] 0.5548410714285714
[0.5548410714285714, 0.5788339285714286] 0.5668375
[0.5548410714285714, 0.5788339285714286, 0.5695142857142858] 0.5677297619047619
[0.5548410714285714, 0.5788339285714286, 0.5695142857142858, 0.5460357142857143] 0.56230625
[0.5548410714285714, 0.5788339285714286, 0.5695142857142858, 0.5460357142857143, 0.5811589285714286] 0.5660767857142857