Scikit-Learn中的集成算法简介

When we describe a model as an ensemble model in machine learning, all we are saying is that our predictor is made up of multiple base models, or iterations on a base model. These models will all have a say in the prediction output, with the goal being a more robust model, higher accuracy and better performance.

当我们将模型描述为机器学习中的整体模型时,我们所说的是我们的预测变量由多个基本模型或基本模型上的迭代组成。 这些模型都将在预测输出中具有发言权,目标是建立更健壮的模型,更高的准确性和更好的性能。

Ensemble modeling is loosely based on Condorcet’s Jury Theorem and the theory of crowds, the main idea being that more educated voices (or well performing models) in a decision making process is better than fewer. However, it’s also important to note that combining multiple poorly-performing models will most likely result in even worse performance overall. That being said, ensembles can drastically boost accuracy on your data set, and help smooth bias/variance depending on the method you choose.

集成建模大致基于Condorcet的Jury定理和人群理论 ,其主要思想是在决策过程中受过良好教育的声音(或表现良好的模型)胜于少。 但是,还需要注意的是,组合多个性能不佳的模型很可能会导致整体性能甚至更差。 话虽如此,合奏可以大大提高数据集的准确性,并有助于根据选择的方法来平滑偏差/方差。

方法 (The Methods)

The three main ensemble techniques being used are right now are bagging, boosting, and stacking. Stacking currently is not supported in Scikit-Learn, but at its core, stacking is just a fancy majority vote, which currently is supported in Scikit-Learn. Here’s an high-level summary of each:

目前正在使用的三种主要合奏技术是装袋提升堆叠 。 目前堆放 Scikit,学习支持,但其核心,堆叠只是一个花哨的多数票 ,目前 Scikit-学习支持。 以下是每个内容的简要概述:

Majority voting is one of the simplest ways of combining outputs from multiple models. At their base, boosting, bagging and stacking ensembles are implementing some form of majority vote, in addition to some other design and optimization features. Here, we are simply taking the majority of two or more — typically different — model predictions. No weights are applied by default, and although you can specify weights for each model, doing so by hand is very difficult.

多数表决是合并多个模型输出的最简单方法之一。 除了其他一些设计和优化功能外,增强,装袋和堆叠合奏还在其基础上实现了某种形式的多数表决。 在这里,我们只是采用两个或多个(通常是不同的)模型预测中的大多数。 默认情况下,不应用权重,尽管您可以为每个模型指定权重,但是手工操作非常困难。

In Stacking, instead of all our models having an equal say in the prediction, our algorithm learns the best way to weight our predictions across models. Weights are applied to either a base predictor’s vote, or to samples/features during training depending on the method being used. The three main stacking techniques being used right now are Feature weighted linear stacking, Quadratic weighted stacking and a modeling framework called StackNet. I’ll briefly go over each technique, but it’s very difficult to sum up each method in only a few sentences so I suggest further reading on each one if you’re interested.

Stacking中 ,我们的算法学习了在模型之间加权我们的预测的最佳方法,而不是所有模型在预测中都有同等的发言权。 权重应用于基本预测变量的投票,或在训练过程中应用于样本/功能,具体取决于所使用的方法。 目前使用的三种主要堆叠技术是特征加权线性堆叠二次加权堆叠和称为StackNet的建模框架 我将简要介绍每种技术,但是很难仅用几句话来概括每种方法,因此,如果您有兴趣,建议您进一步阅读每种方法。

  • In feature weighted linear stacking, our model actually engineers some meta-features which it will stack with its predictions. The idea being that our ensemble learns which base model is best at predicting samples with a certain feature value.

    特征加权线性堆叠中 ,我们的模型实际上设计了一些元特征,并将其与预测一起堆叠。 这个想法是我们的​​系综学习哪种基本模型最适合预测具有特定特征值的样本。

  • Quadratic weighted stacking is similar to feature weighted stacking, but here our ensemble creates non-linear combinations of our predictions and uses them as features for our second-stage model.

    二次加权叠加与特征加权叠加相似,但是在这里,我们的集合创建了我们预测的非线性组合,并将其用作第二阶段模型的特征。

  • StackNet is a meta-modelling framework designed to resemble a feed-forward neural network. Instead of being trained via back propagation like traditional neural networks, StackNet is constructed iteratively, layer by layer, using stacked generalization

    StackNet是一个元建模框架,旨在类似于前馈神经网络。 无需像传统的神经网络一样通过反向传播进行训练,而是使用堆叠归纳法逐层迭代地构造StackNet

Unfortunately there is no custom stacking implementation supported in Scikit-Learn, but quite a few ensembles implement stacking at various levels.

不幸的是,Scikit-Learn不支持自定义堆栈实现,但是有很多集成在各个级别上实现堆栈。

Bagging involves building multiple models, usually of the same type, and training them only on subsets of the full dataset. The final prediction output is the average of votes across all sub-models. Bagging works best for models that tend to overfit (e.g. Decision Trees) as the sub-sampling helps compensate for some of the variance introduced by overly specific criteria.

套袋涉及建立通常通常相同类型的多个模型,并仅在完整数据集的子集上对其进行训练。 最终预测输出是所有子模型的平均票数。 装袋最适合于那些倾向于过度拟合的模型(例如决策树),因为子采样有助于补偿因过于具体的标准而引入的一些方差。

In boosting, again we are using copies of the same base model, however we use all of our data (as opposed to subsets) to train each model. Each new model iteration attempts to correct the prediction errors of the previous model, and our final prediction is a vote across models again. Votes and samples can be weighted by many factors, such as the model’s demonstrated accuracy, or the difficulty of samples to classify.

boosting中 ,我们再次使用相同基本模型的副本,但是我们使用所有数据(而不是子集)来训练每个模型。 每次新模型迭代都会尝试纠正前一个模型的预测误差,而我们的最终预测又一次是跨模型的投票。 可以通过许多因素来加权投票和样本,例如模型证明的准确性或样本分类的难度。

Okay great, so we have a rough understanding of what each ensemble does, now how do we implement them? First let’s set up a scenario, and grab some data. For this, we’re going to use Scikit-Learn’s Iris Dataset.

好的,我们对每个合奏的工作都有一个大概的了解,现在我们如何实现它们呢? 首先,让我们设置一个场景,并获取一些数据。 为此,我们将使用Scikit-Learn的Iris数据集。

问题 (The Problem)

We’re given a dataset containing labeled samples of three different kinds of iris’ (Setosa, Versicolour, and Virginica) and their features (Sepal Length, Sepal Width, Petal Length and Petal Width), and asked to build an algorithm that would classify new iris data, predicting the species based on those four features.

我们获得了一个数据集,其中包含三种不同类型的虹膜(Setosa,Versicolour和Virginica)的标记样本及其特征(Sepal Length,Sepal Width,Petal Length和Petal Width),并要求建立一种算法进行分类新的虹膜数据,根据这四个特征预测物种。

First thing’s first, let’s load in the data and take a look at the features

首先,让我们加载数据并查看功能

Here we simply loaded in the data, and took a random sample of 5 rows. Looks like we have all numeric features, so best practice would be to next scale our X, and then create a train/test split for model evaluation. I’m not going to show many of these intermediate steps, but feel free to check out the full jupyter notebook that accompanies this blog here. You should also note that I am using cross validation to evaluate my models, which splits my data into k-folds (removing the need for train/test split), as well as calling fit() and handling scoring for me.

在这里,我们只是加载数据,并随机抽取了5行。 看起来我们具有所有数字功能,因此最佳实践是下一步缩放X,然后创建训练/测试对进行模型评估。 我不会展示这些中间步骤中的许多步骤,但是请随时在此处查看此博客随附的完整的jupyter笔记本。 您还应该注意,我正在使用交叉验证来评估我的模型,该模型将我的数据拆分为k折(不需要训练/测试拆分),并调用fit()并为我处理评分。

Okay fine, we’ve got data. Let’s see how well a K-Nearest Neighbors performs with cross validation. By default Scikit-Learn’s cross_val_score performs a 5-fold split. For more information on cross validation in sklearn, check this documentation

好的,我们有数据。 让我们看看K最近邻在交叉验证中的表现如何。 默认情况下,Scikit-Learn的cross_val_score执行5倍拆分。 有关sklearn中交叉验证的更多信息,请参阅此文档

Pretty dang good for a base classifier with no additional parameter tuning, but I bet we can do better.

对于没有附加参数调整的基本分类器来说,这很好,但是我敢打赌我们可以做得更好。

多数票 (Majority Vote)

Now let’s see what happens if we implement a couple more algorithms, and then have them vote on the results. For this we’re going to use Scikit-Learn’s Voting Classifier to combine our KNN algorithm with a Gaussian Naive Bayes algorithm, and a Decision Tree.

现在让我们看看如果我们实现另外两种算法,然后让它们对结果进行投票,会发生什么。 为此,我们将使用Scikit-Learn的投票分类器将KNN算法与高斯朴素贝叶斯算法和决策树相结合。

Remember in a simple majority vote, the votes are typically not weighed, meaning that the output is simply whichever class was predicted the most. However, Scikit-Learn’s implementation allows us to specify voting='soft' which tells our voter to predict class labels based on the argmax of the sums of the predicted probabilities. This is recommended for well-balanced ensembles, as it takes into account how sure each model is of it’s predictions.

请记住,在简单的多数表决中,通常不会对表决进行权衡,这意味着输出只是最能预测到的哪个类。 但是,Scikit-Learn的实现允许我们指定voting='soft' ,这告诉我们的投票者根据预测概率之和的argmax来预测类别标签。 建议将其用于平衡良好的合奏,因为它考虑了每个模型对其预测的确定程度。

Let’s see how our voter does compared to our KNN algorithm

让我们看看我们的选民与我们的KNN算法相比如何

There you have it, by including a couple more models in our prediction process we’ve increased our accuracy! And by implementing a vote, we can be pretty certain our model is more generalizable.

通过在我们的预测过程中添加几个模型,您已经拥有了它,从而提高了准确性! 通过投票,我们可以肯定我们的模型更具通用性。

装袋 (Bagging)

Remember in bagging algorithms our goal is to reduce overfitting, so we create multiple copies of our base model, and train them on subsets of our original data to reduce variance. The two bagging algorithms we are going to focus on here are the Bagged Decision Tree, and Random Forest Classifier. If you are unfamiliar with how a decision tree algorithm works, I recommend this article.

请记住,在装袋算法中,我们的目标是减少过度拟合,因此我们创建了基础模型的多个副本,并将其训练在原始数据的子集上以减少差异。 我们将在此处重点关注的两种装袋算法是袋装决策树和随机森林分类器。 如果你不熟悉决策树算法是如何工作的,我建议这个文章。

In a Bagged Decision Tree, we do exactly what we’d expect. We create multiple (you can specify the number) trees, each trained on different subsets of our data, and our final output is a vote between all the trees. We know that most likely, each tree in the system has different criteria from each other at each decision node, and therefore our samples have to pass more total checks in order to make it to their final classification. Because of this, it makes sense that we can expect more robust results than a simple decision tree.

在袋装决策树中,我们完全按照预期进行。 我们创建了多棵(您可以指定数量)树,每棵树都针对我们数据的不同子集进行了训练,而我们的最终输出是所有树之间的投票。 我们知道,系统中的每个树很可能在每个决策节点上具有彼此不同的标准,因此,我们的样本必须通过更多的总检查才能使其最终分类。 因此,与简单的决策树相比,我们可以期待更可靠的结果。

A Random Forest takes the concept of a bagged tree even further. Here, each tree is trained on a subset of data, just like before, but we also only take a subset of features as well. For example, one tree in our Random Forest could be making predictions only using sepal length and petal width as criteria, while another tree could be using sepal width as well as petal length and width to make its predictions. The process of what features to include or not, as well as the method for selecting samples, is all random, hence the name.

随机森林使袋装树的概念更进一步。 在这里,就像以前一样,每棵树都在数据子集上训练,但是我们也只采用特征子集。 例如,我们随机森林中的一棵树可能仅使用萼片长度和花瓣宽度作为标准进行预测,而另一棵树可能使用萼片宽度以及花瓣长度和宽度进行预测。 包含或不包含哪些特征的过程以及选择样本的方法都是随机的,因此是名称。

Here’s what those two implementations look like in Scikit-Learn

这是Scikit-Learn中这两种实现的样子

The BaggingClassifier object can take most classifiers as a base (e.g. SVC), and you can create an effect similar to a Random Forest by setting boostrap_features=True. This tells each new model to take only a subset of features, as well as samples, when training.

BaggingClassifier对象可以将大多数分类器作为基础(例如SVC),并且可以通过设置boostrap_features=True.来创建类似于随机森林的效果boostrap_features=True. 这告诉每个新模型在训练时仅采用部分特征和样本。

提升 (Boosting)

In boosting, remember the concept is that we’re creating copies of a base model, such that each iteration is built to correct errors in the previous model. For more information on boosting, I suggest reading Hardshdeep Singh’s article Understanding Gradient Boosting Machines. The two algorithms we are going to implement today using Scikit-Learn are AdaBoost and Stochastic Gradient Boosting.

在进行增强时,请记住概念是我们正在创建基本模型的副本,以便构建每次迭代以更正先前模型中的错误。 有关增强的更多信息,建议阅读Hardshdeep Singh的文章了解梯度增强机 我们今天将使用Scikit-Learn实现的两种算法是AdaBoost和随机梯度增强。

AdaBoost was one of first successful gradient boosting ensemble algorithms, and consists of iterations on a decision tree. The first tree is created with equal weights applied to the samples in the data set. In future iterations, samples are weighted according to how easy, or hard they are to classify; this tells subsequent models to pay more or less attention to them in the future.

AdaBoost是第一个成功的梯度提升集成算法之一,由决策树上的迭代组成。 创建第一棵树时,会将相等的权重应用于数据集中的样本。 在将来的迭代中,将根据样本分类的难易程度对样本进行加权。 这告诉后续的模型将来会或多或少地关注它们。

Of course, sklearn has an implementation for this

当然,sklearn有一个实现

By default, the Scikit-Learn version creates 50 iterations, by setting n_estimators=100 we are simply telling the algorithm to make 100 copies instead.

默认情况下,Scikit-Learn版本创建50次迭代,通过设置n_estimators=100我们只是在告诉算法制作100份副本。

Stochastic Gradient Boosting is almost a boost-bag hybrid. Here, the model still has weights that are being updated constantly, but each iteration only takes a subsample (without replacement) of the dataset when training. Other versions on stochastic boosting involve Random Forest-like transformations to the data, where only subsamples of rows, as well as columns, are considered in each model. This is quite a broad topic so if you want more information, here is another article that dives deeper into the different types of boosted models.

随机梯度增强几乎是增强型混合动力。 这里,模型仍然具有不断更新的权重,但是每次迭代在训练时仅获取数据集的子样本(无替换)。 关于随机增强的其他版本涉及对数据的类似于随机森林的转换,其中在每个模型中仅考虑行和列的子样本。 这是一个广泛的主题,因此,如果您需要更多信息, 这是另一篇文章 ,深入探讨了各种类型的增强模型。

Here’s the Scikit-Learn implementation

这是Scikit-Learn的实现

Woo hoo, you made it to the end and you now have a working understanding of what the different types of ensemble models are, and know how to implement them quickly in Python using Scikit-Learn! Now it’s your turn to try a few of these out on some real data. For your best performance, as always, I would highly encourage reading the documentation for each model, and exploring some of the parameters. The best results always come from a well-tuned model, as opposed to throwing a mess of evaluators into a pile and asking it for a prediction.

哇,您走到了最后,现在您对不同类型的集成模型有了一定的了解,并且知道如何使用Scikit-Learn在Python中快速实现它们! 现在轮到您对一些真实数据进行一些尝试了。 为了获得最佳性能,我将一如既往地鼓励阅读每个模型的文档,并探索一些参数。 最好的结果总是来自经过良好调整的模型,而不是将一堆评估者扔进一堆,并要求其做出预测。

Have fun!

玩得开心!

翻译自: https://medium.com/the-innovation/an-introduction-to-ensemble-algorithms-in-scikit-learn-a7d4f6845668

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值