bagging原理
-
bootstrap抽样:
- 有放回地从原始数据集中,随机抽取相同数量的数据
- 也可以对特征属性进行抽样
-
降低模型的方差
- 因为每个基模型的训练数据都不同,因此模型之间存在细微的差异,这样可以有效降低最终模型的结果方差,并且提高泛化能力
-
与投票法的区别
- 基模型可以选择相同的模型
- 在投票环节,方法与投票法相同
-
缺点
- 不能降低的模型的偏差,也就是如果基模型效果不好,那么无论如何改进bagging模型,也无法得到较好的训练结果
- 训练时间偏长,
- boosting既可以降低方差,又可以降低偏差
bagging案例
数据读取
通过酒的属性,对酒进行分类
import pandas as pd
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header=None)
df_wine.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
df_wine.columns = ['label', 'Alcohol'
,'Malic ac'
,'Ash'
,'Alcalinity of ash '
,'Magnesium'
,'Total phenols'
,'Flavanoids'
,'Nonflavanoid phenols'
,'Proanthocyanins'
,'Color intensity'
,'Hue'
,'OD280/OD315 of diluted wines'
,'Proline' ]
df_wine.head()
label | Alcohol | Malic ac | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
划分训练测试集
Y = df_wine.label
X = df_wine.iloc[:,1:]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size = 0.30,random_state = 1)
X_train.shape,X_test.shape
((124, 13), (54, 13))
构造bagging分类器
from sklearn.model_selection import cross_val_score,RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np
tree = DecisionTreeClassifier()
bag_model = BaggingClassifier(base_estimator=tree,
n_estimators=100,
max_samples=1.0,
max_features=1.0,
bootstrap=True,
bootstrap_features=True,
n_jobs=-1,
random_state=1)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores_en = cross_val_score(bag_model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
n_scores_tree = cross_val_score(DecisionTreeClassifier(), X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Bagging:Accuracy: %.3f (%.3f)' % (np.mean(n_scores_en), np.std(n_scores_en)))
print('tree :Accuracy: %.3f (%.3f)' % (np.mean(n_scores_tree), np.std(n_scores_tree)))
Bagging:Accuracy: 0.979 (0.041)
tree :Accuracy: 0.922 (0.075)
可以看到在训练集上,准确率提升了大概5%,不是很多,但是模型的方差降低了接近50%,bagging可以有效地降低模型的方差
泛化能力提高
from sklearn.metrics import accuracy_score
base_tree = DecisionTreeClassifier()
base_tree.fit(X_train,y_train)
y_test_pred = base_tree.predict(X_test)
print('tree :Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))
bag_model.fit(X_train,y_train)
y_test_pred = bag_model.predict(X_test)
print('Bagging:Accuracy: %.3f' % (accuracy_score(y_pred=y_test_pred,y_true=y_test)))
tree :Accuracy: 0.944
Bagging:Accuracy: 0.981
在测试集上,bagging的准确率也高于单个决策树模型
参考资料:《Python 机器学习》