集成学习之Blending

最新推荐文章于 2024-04-11 12:46:13 发布

to2

最新推荐文章于 2024-04-11 12:46:13 发布

阅读量928

点赞数

文章标签：数据挖掘机器学习

本文链接：https://blog.csdn.net/to222/article/details/116646924

版权

Blending算法

算法思想

Blending算法是一种集成学习的思想，它本身比较简单，没有涉及到太多的数学理论知识，而是一种用多个模型来减少方差的方式来提高模型的预测精度。

算法流程

1、将数据集划分为训练集、验证集与测试集
2、搭建Blending模型框架，Blending可以分为两个阶段，第一个阶段模型可以由多个基模型构成，这些模型可以是同质的，也可以是异质的。第二个阶段的模型用来融合第一阶段模型的输出特征，最后得到输出。
3、首先用训练集对第一阶段模型进行训练，之后分别使用验证集和测试集在训练好的第一阶段模型上运行，得到验证集的输出结果和测试集的输出结果。之后将验证集的输出结果组成的特征向量作为第二阶段模型的训练集，最后将第一阶段模型得到的测试集结果作为第二阶段模型的测试数据集，最后得到的结果就是模型最后的输出结果
Blending的具体流程如图所示：
Blending算法具体流程

Blending的优缺点：

优点：Blending主要偏工程实践，数学理论知识比较少，理解起来较容易，模型的可扩展高
缺点：Blending是由多个模型拼接起来的所以运算量往往很大，同时在集成时，Blending只使用到了验证集的数据，对数据资源是一种浪费，会降低模型的精度

Blending代码实现（结合鸢尾花数据集）：

# 加载相关工具包
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns

# 创建数据
from sklearn.model_selection import train_test_split
## 创建训练集和测试集
X_train1,X_test,y_train1,y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
## 创建训练集和验证集
X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1, stratify=y_train1)
print("The shape of training X:",X_train.shape)
print("The shape of training y:",y_train.shape)
print("The shape of test X:",X_test.shape)
print("The shape of test y:",y_test.shape)
print("The shape of validation X:",X_val.shape)
print("The shape of validation y:",y_val.shape)
'''
output:
The shape of training X: (84, 4)
The shape of training y: (84,)
The shape of test X: (30, 4)
The shape of test y: (30,)
The shape of validation X: (36, 4)
The shape of validation y: (36,)
'''

#  设置第一层分类器
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]

# 设置第二层分类器
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# 输出第一层的验证集结果与测试集结果
for i,clf in enumerate(clfs):
    clf.fit(X_train,y_train)
    val_feature = clf.predict_proba(X_val)
    test_feature = clf.predict_proba(X_test)
    val_features = (np.c_[val_features,val_feature] if i != 0 else val_feature)
    test_features = (np.c_[test_features,test_feature] if i != 0 else test_feature)

# 将第一层的验证集的结果输入第二层训练第二层分类器
lr.fit(X_val,y_val)
blending = lr
# 输出预测的结果
from sklearn.model_selection import cross_val_score
cross_val_score(lr,test_features,y_test,cv=3)
'''
array([0.97764625, 0.97134138, 0.8147559 ])
'''

# 对单个模型做测试
svc = SVC(probability = True)
rf = RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini')
knn = KNeighborsClassifier()
lr = LinearRegression()
ls = ['svc', 'rf', 'knn', 'lr']
for m in ls:
    eval(m).fit(X_train1,y_train1)
for m in ls:
    print(m, cross_val_score(eval(m),X_test,y_test,cv=3))
'''
output:
svc [0.9 1.  0.7]
rf [1.  1.  0.9]
knn [1.  1.  0.7]
lr [0.89329241 0.91247512 0.84146873]
'''

# 使用Mlxtend绘制出决策边界
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from mlxtend.plotting import plot_decision_regions
gs = gridspec.GridSpec(3, 2)
fig = plt.figure(figsize=(16, 18))
plt.rcParams['savefig.dpi'] = 600 #图片像素
plt.rcParams['figure.dpi'] = 600 #分辨率

X = iris.data[:,1:3]

labels = ['svc',
          'rf',
          'knn',
          'lr',
          'blending']
for clf, lab, grd in zip([svc, rf, knn, lr, blending],
                         labels,
                         itertools.product([0, 1, 2],
                         [0,1])):
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y,
                                clf=clf, legend=2)
    plt.title(lab)
plt.show()

决策边界如下图所示：
在这里插入图片描述

总结

Blending第一阶段模型使用了SVM、随机森林、K近邻作为第一阶段的基分类器，第二阶段使用了逻辑回归分类器，最后用了三折交叉验证输出结果。之后用单个模型在相同的测试集上做了对比，发现Blending这种方法并不总是有效，还是要具体情况具体分析。

参考资料：
https://blog.csdn.net/qq_42008588/article/details/116422129
本文章主要内容源自Datawhale开源课程

to2

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
集成学习之Blending

Blending算法算法思想Blending算法是一种集成学习的思想，它本身比较简单，没有涉及到太多的数学理论知识，而是一种用多个模型来减少方差的方式来提高模型的预测精度。算法流程1、将数据集划分为训练集、验证集与测试集2、搭建Blending模型框架，Blending可以分为两个阶段，第一个阶段模型可以由多个基模型构成，这些模型可以是同质的，也可以是异质的。第二个阶段的模型用来融合第一阶段模型的输出特征，最后得到输出。3、首先用训练集对第一阶段模型进行训练，之后分别使用验证集和测试集在训练好的
复制链接

扫一扫