关于bagging和boosting的集成学习的例子很多,就不讲解了,本节讲解stacking集成学习。
Stacking
Stacking方法其实弄懂之后应该是比Boosting要简单的,毕竟小几十行代码可以写出一个Stacking算法。我先从一种“错误”但是容易懂的Stacking方法讲起。
Stacking模型本质上是一种分层的结构,这里简单起见,只分析二级Stacking.假设我们有3个基模型M1、M2、M3。
该图是一个基模型得到P1和T1的过程,采用的是5折交叉验证,所以循环了5次,拼接得到P1,测试集预测了5次,取平均得到T1。而这仅仅只是第二层输入的一列/一个特征,并不是整个训练集。再分析作者的代码也就很清楚了。也就是刚刚提到的两层循环。
python实现
用了一个泰坦尼克号的尝试了一下代码,从头到尾都是可以运行的。代码放在Github,针对其中一段关键的稍作分析
def get_oof(clf, x_train, y_train, x_test):
oof_train = np.zeros((ntrain,))
oof_test = np.zeros((ntest,))
oof_test_skf = np.empty((NFOLDS, ntest)) #NFOLDS行,ntest列的二维array
for i, (train_index, test_index) in enumerate(kf): #循环NFOLDS次
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]
clf.fit(x_tr, y_tr)
oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test) #固定行填充,循环一次,填充一行
oof_test[:] = oof_test_skf.mean(axis=0) #axis=0,按列求平均,最后保留一行
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1) #转置,从一行变为一列
这里只实现了针对一个基模型做K折交叉验证,因为P1和T1都是多行一列的结构,这里是先存储为一行多列,最后进行转置。
但是Stacking方法其实在R中也有集成好的可以调用。
caretEnsemble包下的caretStack()方法
关键代码如下:
algorithmList <- c('lda', 'rpart', 'glm', 'knn', 'svmRadial')
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
有一篇博文讲的比较详细
h2o包的h2o.stack()方法
关键代码如下:
nfolds <- 5
glm1 <- h2o.glm(x = x, y = y, family = family,
training_frame = train,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
gbm1 <- h2o.gbm(x = x, y = y, distribution = "bernoulli",
training_frame = train,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
rf1 <- h2o.randomForest(x = x, y = y, # distribution not used for RF
training_frame = train,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
dl1 <- h2o.deeplearning(x = x, y = y, distribution = "bernoulli",
training_frame = train,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"
stack <- h2o.stack(models = models,
response_frame = train[,y],
metalearner = metalearner,
seed = 1,
keep_levelone_data = TRUE)
# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = test)
详情见h2o的Github网站
最后放一张H2O分享的图片总结一下