集成学习2-bagging的原理与案例分析

投票学习

bagging原理

image-20210416110048495
  1. bootstrap抽样:

    • 有放回地从原始数据集中,随机抽取相同数量的数据
    • 也可以对特征属性进行抽样
  2. 降低模型的方差

    • 因为每个基模型的训练数据都不同,因此模型之间存在细微的差异,这样可以有效降低最终模型的结果方差,并且提高泛化能力
  3. 与投票法的区别

    • 基模型可以选择相同的模型
    • 在投票环节,方法与投票法相同
  4. 缺点

    • 不能降低的模型的偏差,也就是如果基模型效果不好,那么无论如何改进bagging模型,也无法得到较好的训练结果
    • 训练时间偏长,
    • boosting既可以降低方差,又可以降低偏差

bagging案例

数据读取

通过酒的属性,对酒进行分类

import pandas as pd 
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                     header=None)
df_wine.head()
012345678910111213
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
df_wine.columns = ['label', 'Alcohol'
,'Malic ac'
,'Ash'
,'Alcalinity of ash ' 
,'Magnesium'
,'Total phenols'
,'Flavanoids'
,'Nonflavanoid phenols'
,'Proanthocyanins'
,'Color intensity'
,'Hue'
,'OD280/OD315 of diluted wines'
,'Proline'       ]
df_wine.head()
labelAlcoholMalic acAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735

划分训练测试集

Y = df_wine.label
X = df_wine.iloc[:,1:]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.30,random_state = 1)
X_train.shape,X_test.shape
((124, 13), (54, 13))

构造bagging分类器

from sklearn.model_selection import cross_val_score,RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np 
tree = DecisionTreeClassifier()

bag_model = BaggingClassifier(base_estimator=tree,
                             n_estimators=100,
                             max_samples=1.0,
                             max_features=1.0,
                             bootstrap=True,
                             bootstrap_features=True,
                             n_jobs=-1,
                             random_state=1)


cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores_en = cross_val_score(bag_model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
n_scores_tree = cross_val_score(DecisionTreeClassifier(), X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Bagging:Accuracy: %.3f (%.3f)' % (np.mean(n_scores_en), np.std(n_scores_en)))
print('tree   :Accuracy: %.3f (%.3f)' % (np.mean(n_scores_tree), np.std(n_scores_tree)))
Bagging:Accuracy: 0.979 (0.041)
tree   :Accuracy: 0.922 (0.075)

可以看到在训练集上,准确率提升了大概5%,不是很多,但是模型的方差降低了接近50%,bagging可以有效地降低模型的方差

泛化能力提高

from sklearn.metrics import accuracy_score
base_tree = DecisionTreeClassifier()

base_tree.fit(X_train,y_train)
y_test_pred = base_tree.predict(X_test)
print('tree   :Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))

bag_model.fit(X_train,y_train)
y_test_pred = bag_model.predict(X_test)
print('Bagging:Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))
tree   :Accuracy: 0.944
Bagging:Accuracy: 0.981

在测试集上,bagging的准确率也高于单个决策树模型

参考资料:《Python 机器学习》

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值