集成学习3-Boosting的原理和案例

1投票学习
2bagging

bootsting原理

image-20210419103831137

如上图所示,问题是为了解决一个二分类问题,为此,我们选择一个深度为1的单层决策树进行训练

  • 图1:
    • 原始分布中,通过最小化代价函数(不纯度等),得到一个决策边界,可以看到,两个圆形被错误分类,因此要增加他们的权重,并且降低正确分类的样本的权重,变成图2的分布
  • 图2 :
    • 由于上次模型训练中,错误分类的两个圆被赋予了更大的权重,因此产生了新的决策边界
    • 将正确分类的两个大圆和全部三角形的权重进一步降低,增大错误分类的右上方三个圆的权重,分布如图3所示
  • 图3:
    • 产生新的分类边界
  • 图4:
    • 对1,2,3的分类结果进行多数投票,得到结果4

因此,简单地说,boosting就是通过不断地增大分类错误的样本的权重,降低分类正确的权重,从而生成不同的弱分类器,并通过多数投票等组合形式,得到最终分类结果。

boosting 案例

数据读取

# 引入数据科学相关工具包:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns
# 加载训练数据:         
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 
                'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
# 数据查看:
print("Class labels",np.unique(wine["Class label"]))
wine.head()
Class labels [1 2 3]
Class labelAlcoholMalic acidAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735

数据划分

y = wine['Class label'].values
X = wine[['Alcohol','OD280/OD315 of diluted wines']].values

# 按8:2分割训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)  # stratify参数代表了按照y的类别等比例抽样

弱分类器:单层决策树

# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth = 1)
from sklearn.metrics import accuracy_score
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))
Decision tree train/test accuracies 0.597/0.611

adaboost

# 使用sklearn实现Adaboost(基分类器为决策树)
'''
AdaBoostClassifier相关参数:
base_estimator:基本分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators:终止迭代的次数
learning_rate:学习率
algorithm:训练的相关算法,{'SAMME','SAMME.R'},默认='SAMME.R'
random_state:随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=tree,n_estimators=300,learning_rate=0.01,random_state=1)
ada = ada.fit(X_train,y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train,y_train_pred)
ada_test = accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))
Adaboost train/test accuracies 0.855/0.852

结果对比

  1. 可以看到以二层决策树为弱分类器的adaboost模型,在训练集和测试集上,效果都略好于弱分类器分身
  2. 从下图分类边界曲线可以看到,adaboost的模型更加复杂,决策边界更加曲折,有可能产生过拟合
# 画出单层决策树与Adaboost的决策边界:
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
    clf.fit(X_train, y_train)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train==2, 0],X_train[y_train==2, 1],c='red', marker='o')
    axarr[idx].scatter(X_train[y_train==3, 0],X_train[y_train==3, 1],c='green', marker='x')
    axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.tight_layout()
plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()

image-20210419105501973

当弱分类器变成强分类器时:

def fuc(max_depth=2):
    tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth =max_depth)
    ada = AdaBoostClassifier(base_estimator=tree,n_estimators=300,learning_rate=0.01,random_state=1)
    tree = tree.fit(X_train,y_train)
    y_train_pred = tree.predict(X_train)
    y_test_pred = tree.predict(X_test)
    tree_train = accuracy_score(y_train,y_train_pred)
    tree_test = accuracy_score(y_test,y_test_pred)
    print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))

    ada = ada.fit(X_train,y_train)
    y_train_pred = ada.predict(X_train)
    y_test_pred = ada.predict(X_test)
    ada_train = accuracy_score(y_train,y_train_pred)
    ada_test = accuracy_score(y_test,y_test_pred)
    print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))

    x_min = X_train[:, 0].min() - 1
    x_max = X_train[:, 0].max() + 1
    y_min = X_train[:, 1].min() - 1
    y_max = X_train[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
    f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
    for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
        clf.fit(X_train, y_train)
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        axarr[idx].contourf(xx, yy, Z, alpha=0.3)
        axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='blue', marker='^')
        axarr[idx].scatter(X_train[y_train==2, 0],X_train[y_train==2, 1],c='red', marker='o')
        axarr[idx].scatter(X_train[y_train==3, 0],X_train[y_train==3, 1],c='green', marker='x')
        axarr[idx].set_title(tt)
    axarr[0].set_ylabel('Alcohol', fontsize=12)
    plt.tight_layout()
    plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
    plt.show()

fuc(2)
Decision tree train/test accuracies 0.903/0.852
Adaboost train/test accuracies 0.960/0.870

image-20210419110031133

fuc(3)
Decision tree train/test accuracies 0.927/0.926
Adaboost train/test accuracies 1.000/0.852

output_7_1

从结果可以看出,当弱分类器的分类效果逐渐增强时,boosting模型逐渐变得过拟合,泛化能力变化不大

boosting的特点

 1. 可以降低模型的偏差
    2. 当若作为boosting的弱分类器分类能力比较强时,模型可能会出现过拟合现象,从而导致模型泛化能力下降。
    3. 计算成本较高,需要训练很多个弱分类器
    4. 无法进行并行计算,每一步的迭代都依赖于前一步
  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值