KAGGLE ENSEMBLING GUIDE

最新推荐文章于 2021-09-05 17:34:34 发布

MemRay

最新推荐文章于 2021-09-05 17:34:34 发布

阅读量1.3k

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

74 篇文章 0 订阅

订阅专栏

转载自：http://mlwave.com/kaggle-ensembling-guide/

好文分享，原文较长，但有价值。

一、Creating ensembles from submission files

简单方案，直接通过其他人提交的结果进行整合

1. Voting ensemble

2. Averaging

3. Rank averaging

二、Stacked Generalization & Blending

高端方案，融合多个模型。

1. Stacked generalization

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

The procedure is as follows:

Split the training set into two disjoint sets.
Train several base learners on the first part.
Test the base learners on the second part.
Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.

Note that steps 1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, we combine the base learners, possibly nonlinearly.

2. Blending

Blending这个词很多地方都是和stacking一样用。这里单独提了一下，我没太看懂，好像意思是：上面的stacking方法是在第一部分数据上train在第二部分上test，颇为类似于交叉检验，充分利用整个数据集进行model的训练。blending额外取一部分进行test，避免了stacker和generalizer使用一样的数据。

3. Stacking with logistic regression

并无特别，使用LR作为stacker。

4. Stacking with non-linear algorithms

线性的试完了，还有非线性的stacker：GBM, KNN, NN, RF and ET(求指点，et是哪个分类器？)...

Non-linear algorithms find useful interactions between the original features and the meta-model features.

5. Feature weighted linear stacking

09年Netfilx一个队伍的方法，称作Feature-Weighted LinearStacking (FWLS)。相比于一般stacking只是使用一个linear regression将不同模型通过线性权重参数整合到一起，这里的权重是一个特征的线性组合，从而整个模型被拓展成了一个feature*model组合的形式，增强了模型的表达能力。（下图来自原论文，也是挺简陋的，看得懂就行 :D）

6. Quadratic linear stacking of models

和上面的feature*model的组合类似，这里可以说通过interaction对所有generalizer output又额外包装了一层model*model的组合，以上图为例创造新的特征如SVD*K-NN或者SVD*RBM，就像二次项似的。如果你想的话构造三次、四次的也可以，有没有效就不知道了 :D

7. Stacking classifiers with regressors and vice versa
用stacking解决回归问题。

8. Stacking unsupervised learned features
整合非监督学习特征，方法很多，这里举了K-Means和t-SNE，就是降维取主特征并加入stacking。

9. Online Stacking

作者关于在线stacking的一些想法，对于Kaggle主要的比赛形式没什么帮助。

还有一些别的内容我就不转述了，比如作者设计了一个ensemble自动机，然后自动跑到了前10%甚至第5。。此外还提到了一个人在比赛中blend了1000+个模型，最后拿了第一。总之，ensemble确实是个艺术，一再创造新的记录。