集成学习2-bagging的原理与案例分析

最新推荐文章于 2021-08-15 09:56:57 发布

莫知我哀

最新推荐文章于 2021-08-15 09:56:57 发布

阅读量277

点赞数 1

分类专栏：集成学习文章标签：数据分析 python

本文链接：https://blog.csdn.net/weixin_43822124/article/details/115757909

版权

集成学习专栏收录该内容

6 篇文章 2 订阅

订阅专栏

投票学习

bagging原理

bootstrap抽样：
- 有放回地从原始数据集中，随机抽取相同数量的数据
- 也可以对特征属性进行抽样
降低模型的方差
- 因为每个基模型的训练数据都不同，因此模型之间存在细微的差异，这样可以有效降低最终模型的结果方差，并且提高泛化能力
与投票法的区别
- 基模型可以选择相同的模型
- 在投票环节，方法与投票法相同
缺点
- 不能降低的模型的偏差，也就是如果基模型效果不好，那么无论如何改进bagging模型，也无法得到较好的训练结果
- 训练时间偏长，
- boosting既可以降低方差，又可以降低偏差

bagging案例

数据读取

通过酒的属性，对酒进行分类

import pandas as pd 
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                     header=None)
df_wine.head()

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

df_wine.columns = ['label', 'Alcohol'
,'Malic ac'
,'Ash'
,'Alcalinity of ash ' 
,'Magnesium'
,'Total phenols'
,'Flavanoids'
,'Nonflavanoid phenols'
,'Proanthocyanins'
,'Color intensity'
,'Hue'
,'OD280/OD315 of diluted wines'
,'Proline'       ]
df_wine.head()

	label	Alcohol	Malic ac	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

划分训练测试集

Y = df_wine.label
X = df_wine.iloc[:,1:]

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.30,random_state = 1)

X_train.shape,X_test.shape

((124, 13), (54, 13))

构造bagging分类器

from sklearn.model_selection import cross_val_score,RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np 
tree = DecisionTreeClassifier()

bag_model = BaggingClassifier(base_estimator=tree,
                             n_estimators=100,
                             max_samples=1.0,
                             max_features=1.0,
                             bootstrap=True,
                             bootstrap_features=True,
                             n_jobs=-1,
                             random_state=1)


cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores_en = cross_val_score(bag_model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
n_scores_tree = cross_val_score(DecisionTreeClassifier(), X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Bagging:Accuracy: %.3f (%.3f)' % (np.mean(n_scores_en), np.std(n_scores_en)))
print('tree   :Accuracy: %.3f (%.3f)' % (np.mean(n_scores_tree), np.std(n_scores_tree)))

Bagging:Accuracy: 0.979 (0.041)
tree   :Accuracy: 0.922 (0.075)

可以看到在训练集上，准确率提升了大概5%，不是很多，但是模型的方差降低了接近50%，bagging可以有效地降低模型的方差

泛化能力提高

from sklearn.metrics import accuracy_score
base_tree = DecisionTreeClassifier()

base_tree.fit(X_train,y_train)
y_test_pred = base_tree.predict(X_test)
print('tree   :Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))

bag_model.fit(X_train,y_train)
y_test_pred = bag_model.predict(X_test)
print('Bagging:Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))

tree   :Accuracy: 0.944
Bagging:Accuracy: 0.981

在测试集上，bagging的准确率也高于单个决策树模型

参考资料：《Python 机器学习》

莫知我哀

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
集成学习2-bagging的原理与案例分析

投票学习bagging原理bootstrap抽样：有放回地从原始数据集中，随机抽取相同数量的数据也可以对特征属性进行抽样降低模型的方差因为每个基模型的训练数据都不同，因此模型之间存在细微的差异，这样可以有效降低最终模型的结果方差，并且提高泛化能力与投票法的区别基模型可以选择相同的模型在投票环节，方法与投票法相同缺点不能降低的模型的偏差，也就是如果基模型效果不好，那么无论如何改进bagging模型，也无法得到较好的训练结果训练时间偏长，boosti.
复制链接

扫一扫