KAGGLE比赛中集成方法使用教程（KAGGLE ENSEMBLING GUIDE）

本文分享了在Kaggle比赛中使用的模型集成方法，包括如何从提交文件创建集成及堆叠泛化/混合集成，解释了集成能降低泛化错误的原因，并展示了不同集成方法及其结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

KAGGLE比赛中集成方法使用教程（KAGGLE ENSEMBLING GUIDE）

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions.

模型集成在很多机器学习任务中是一个很有效的方法。这篇文章主要分享下我在Kaggle比赛中使用到的集成方法。

For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending.

第一部分我们先从提交的文件来创建集成。第二部分将会看一下从堆叠泛化/混合来创建集成。

I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together

with their results and code to try it out for yourself.

我会回答为什么集成可以降低模型泛化的错误率。最后我们展示不同的集成方法，包括他们的结果和代码供你尝试。

This is how you win ML competitions: you take other peoples’ work and ensemble them together.”

Vitaly Kuznetsov NIPS2014

“怎么赢取ML比赛：拿到别人的工作，然后把他们集成起来”Vitaly Kuznetsov NIPS2014

Creating ensembles from submission files
从提交的文件创建集成

The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need thepredictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemblealready existing model predictions, ideal when teaming up.
最简单和方便的集成方法就是集成

Kaggle提交的CSV文件。你仅仅需要这些方法的预测结果-不需要重新训练模型。这个方法

迅速的集成现有模型的预测结果，适合团队合作。

Voting ensembles.

投票集成

We first take a look at a simple majority vote ensemble. Let’s see why model ensembling reduces error rate and

why it works better to ensemble low-correlated model predictions.

我们首先看一下简单的多数表决集成。看一下为什么模型集成可以降低错误率，为什么集成一些不太相关的模型预测效果会

更好。

Error correcting codes
纠错码错误

During space missions it is very important that all signals are correctly relayed.

在空间任务中，所有信号被正确地传播十分重要。

If we have a signal in the form of a binary string like:

如果我们有一个二进制字符串信号比如：

1110110011101111011111011011

and somehow this signal is corrupted (a bit is flipped) to:

并且不知何故，这个信号被损坏（一个位被翻转）：

1010110011101111011111011011

then lives could be lost.

A coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the

signal multiple times in equally sized chunks and have a majority vote.

在纠错码中发现了一个编码解决方案。最简单的纠错码是重复码：用相同的大小块多次传播信号并且进行多数表决。

Original signal:
1110110011

Encoded:
10,3 101011001111101100111110110011

Decoding:
1010110011
1110110011
1110110011

Majority vote:
1110110011

Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a

corrupted majority vote.

信号损坏是非常罕见的状况，并且经常发生在小突发中。因此它认为在多数表决中出现损坏的概率更小。

As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired.

只要损坏不是完全不可预测的（有50％的概率发生），那么信号就可以被修复。

A machine learning example

一个机器学习的例子

Suppose we have a test set of 10 samples. The ground truth is all positive (“1”):

假设我们有一个10个样本的测试集，真实值全部都是1：

1111111111

We furthermore have 3 binary classifiers (A,B,C) with a 70% accuracy. You can view these classifiers for now as

pseudo-random number generators which output a “1” 70% of the time and a “0” 30% of the time.

我们还有3个准确率为70%的二分类器（A,B,C）。你暂时可以认为是一个伪随机数发生器，70%的时间输出为1，30%的时间输出

0。

We will now show how these pseudo-classifiers are able to obtain 78% accuracy through a voting ensemble.

我们现在将展示这些伪随机数发生器是如何通过表决集成来达到78%的准确率的。

A pinch of maths

一些数学

For a majority vote with 3 members we can expect 4 outcomes:

对一个3个成员的多数表决，我们可以获得4中输出：

All three are correct
  0.7 * 0.7 * 0.7
= 0.3429

Two are correct
  0.7 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7
= 0.4409

Two are wrong
  0.3 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.3
+ 0.7 * 0.3 * 0.3
= 0.189

All three are wrong
  0.3 * 0.3 * 0.3
= 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an

average of ~78% (0.3429 + 0.4409 = 0.7838).

我们可以看到大部分时间（~44%）多数表决纠正了错误。多数表决集成~78%（ 0.3429 + 0.4409 = 0.7838）的时间是正确的。