模型融合的三大方法（Ensemble learning）

Seven7543

已于 2023-02-23 00:17:08 修改

阅读量834

点赞数

分类专栏：机器学习文章标签：机器学习

于 2022-08-19 01:27:35 首次发布

本文链接：https://blog.csdn.net/Seven7543/article/details/126416598

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Introduction

Ensemble learning is a common way to improve date modeling performance in classification or regression problems. As I have recently been enrolled in a Kaggle competition hosted by Huawei: recommendation system for user CTR prediction, I reviewed the chapter 14 of the PRML book. This blog is a summary of the model combining ideas including bagging, boosting and stacking.

This blog focus more on the theory than mathematically equations since most of the methods are supported by Python/R libraries, hence readers interested in the proving process are referred to the book.

Model combination Vs Bayesian Model Averaging

To start with, I would like to emphasize that model combination methods are different from Bayesian Model Averaging: Bayesian Model Averaging assumes the whole data set is generated by a single Model that lies in a model distribution set, and the probability distribution simply reflects the uncertainty on which model is generating. With the increasing data set size, the uncertainty decrease.

On the other hand, model combination methods see that each points within the data set can potentially be generated by one model different from others. Therefore, either averaging the results of all models, or selecting the right models for the specific data region can achieve better performance than a single model.

Bagging

Everything is decided by the committee.

It really comes down to one question: how the committee of models make decisions. An easy strategy to naturally start with is Voting, which is common in classification problems. Weighted Averaging is more often used in the regression problems.

As in practice, we only have a single data set where we have to find a way to introduce variability between models within the committee. One approach is to use bootstrap data sets: we sample a sub set from the training set and construct a new model on the subset. Suppose we generate $M$ bootstrap data sets and the committee prediction is given by:
$y_{com}(x) = \frac{1}{M}\sum_{m=1}^{M}y_m(x)$

This procedure is known as bootstrap aggregation or bagging. Random Forest is an example where the base models are decision trees. The randomness means while creating new decision tree models, it randomly choose features from the total to ensure uncorrelateness between the decision tress.

(The limitation of decision tree is that the division of the input space is based on hard splits in which only one model is responsible for prediction making for any given value of the input variables. This can be soften by moving to a probabilistic framework. For further detail, one can read Section 14.5 of the book.)

One advantage of Bagging is it allows parallel training of the base models. However, it assumes that the errors due to the individual models are uncorrelated, which usually is not the case but it still can achieve better performance than any single models. This is why boosting comes in.

Boosting

As a variant of committee, boosting is an iterative method: it trains the model sequentially and allocate higher weights to the data points mis-classified by the previous models. Common methods include AdaBoost, GBDT.
Boosting1
Boosting2

Stacking

Stacking is a very powerful method and frequently used in Kaggle competitions for performance boosting. The aim is to get enlarged data set for uncorrelated committee models. It took me some reading on an explanation to understand its mechanism from here and this video.

The idea of stacking and blending is similar: using the predictions of the based models (level-0) as the input features to construct a meta model (level-1). (Think about that: the meta model is doing a similar job of what the last layer of a neural network does). What the difference between stacking and blending is stacking uses K-fold:
在这里插入图片描述

The implementation steps of stacking is all on the training set:

Split training data set into n-folds
Now train the first base model using the n-1 folds data and make predictions on the test fold, save the predictions
continue doing the step 2 for the base model n times until it produce prediction data set of same size of the original training data set
repeated step 2& 3 for every baseline models, eventually you will have predictions from all the base models on the whole training data set.
Now construct the level 1 model, also named meta model who uses the predictions of all the base models as input features, the target value of the original training set as the target.

The final model we obtain is the base models and a meta model, combined as a double-layered model. As it is widely supported by the Scikit libraries, the raw code of implementation will not be given but a demon usage code:

# Importing pakages
from sklearn.ensemble import RandomForestClassifier
from skelearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegresssion
from sklearn.naive_bayes import KNeighborsClassifier
from sklearn import model_selection
from mlxtend.classifier import StackClassifier
import warnings
warnings.filterwarnings('ignore')

# Creating dataset
from sklearn importdatasets
iris = datasets.load_iris()
X_train,y_train = iris.data[:,1:3], iris.target #using the entire dataset as training sets

# Defining base leaerners
mycl1 = KNeighborsClassifier(n_neighbors =1)
mycl2 = RandomForestClassifier(random_state =1)
mycl3 = GaussianNB()

# Defining meta model
mylr = LogisticRegresssion()

# Creating stacking classifier with above models
stackingclf = StackingClassifier(classifiers = [mycl1,mycl2,mycl3], meta_classifier = mylr)

# Let's start!
print('Doing 3-fold cross validation here:\n')
for iterclf, iterlable in zip([myclf1,mycl2,mycl3,stackingclf],
['K-Nearest Neighbour Model',
'Random Forest Model',
'Naive Bayes Model',
'Stacking Classifier Model']
):
scores = model_selection.cross_val_score(iterclf,X_train,y_train, cv =3,scorinng='accuracy')
print('Accuray: %.3f(+/- %.3f) [%s]' % (scores.mean(),scores.std(),iterlabel))