套袋被称为引导聚合,即并没有使用相同的训练集来拟合组合中的各个分类器,而是从初始训练集中抽取自举样本(随机可替换样本)
随机森林是套袋方法的一个特例,在随机森林方法中,我们在 拟合每个决策树的时候也用到随机抽取的特征子集。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
df_wine = pd.read_csv("xxx\\wine.data",
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
df_wine = df_wine[df_wine['Class label'] != 1]
y = df_wine['Class label'].values
X = df_wine[['Alcohol', 'OD280/OD315 of diluted wines']].values
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=1,
stratify=y)
# scikit-learn已经实现了BaggingClassifier算法,可以从ensemble子模块导入。
# 这里将用修剪决策树作为基本分类器并创建一个有500棵决策树的组合,
# 这些决策树将拟合训练集中的不同引导样本:
tree = DecisionTreeClassifier(criterion='entropy',
max_depth=None,
random_state=1)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=500,
max_samples=1.0, # 从X抽取以训练每个基本估计量的样本数。如果为int,则抽取样本 max_samples。如果float,则抽取本 max_samples * X.shape[0]
max_features=1.0, # 从X绘制以训练每个基本估计量的要素数量。如果为int,则绘制特征 max_features。如果是浮动的,则绘制特征 max_features * X.shape[1]
bootstrap=True, # 是否抽取样本进行替换
bootstrap_features=False,
n_jobs=1, # fit和predict并行运行的作业数。
random_state=1) # 随机数生成器使用的种子
# 对套袋分类器的性能和单一修剪决策树的性能进行比较:
tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
# 修剪决策树正确地预测了所有训练样本的分类标签。然而,测试准确率相当低,这表明模型存在高方差(过拟合)
print('Decision tree train/test accuracies %.3f/%.3f'
% (tree_train, tree_test))
bag = bag.fit(X_train, y_train)
y_train_pred = bag.predict(X_train)
y_test_pred = bag.predict(X_test)
bag_train = accuracy_score(y_train, y_train_pred)
bag_test = accuracy_score(y_test, y_test_pred)
print('Bagging train/test accuracies %.3f/%.3f'
% (bag_train, bag_test))
# 比较决策树和套袋分类器之间在决策区域的差别
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows=1, ncols=2,
sharex='col',
sharey='row',
figsize=(8, 3))
for idx, clf, tt in zip([0, 1],
[tree, bag],
['Decision tree', 'Bagging']):
clf.fit(X_train, y_train)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[idx].contourf(xx, yy, Z, alpha=0.3)
axarr[idx].scatter(X_train[y_train == 0, 0],
X_train[y_train == 0, 1],
c='blue', marker='^')
axarr[idx].scatter(X_train[y_train == 1, 0],
X_train[y_train == 1, 1],
c='green', marker='o')
axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.text(10.2, -0.5,
s='OD280/OD315 of diluted wines',
ha='center', va='center', fontsize=12)
plt.tight_layout()
#plt.savefig('images/07_08.png', dpi=300, bbox_inches='tight')
plt.show()
运行结果:
Decision tree train/test accuracies 1.000/0.833
Bagging train/test accuracies 1.000/0.917
运行结果图:
备注:实践中更复杂的分类任务和高维度数据集容易导致单个决策树模型过拟合,这正是套袋算法能真正发挥作用的地方。最后注意到套袋算法可以有效地减少模型方差。然而,套袋在减少模 型偏差方面却无效