小白的集成学习之路——Boosting

Boosting(1)

前提说明:本文是在我学习集成学习时的浅显总结,由于个人水平暂时有限,故基本没有推导与公式过程,可能部分内容还存在错误的理解,请谅解。

一、总概:

Boosting 是一类可将弱学习器提升为强学习器的算法。Boosting的理论依据来自,Valiant和 Kearns(1989)提出的经典理论问题:"强可学习"和"弱可学习"问题是否等价。Schapire(1990)通过构造性方法,即第一个Boosting方法,证明此问题的答案是肯定的。故,我们得到如下结论:任何弱学习器都有被提升为强学习器的潜力

二、Boostring 算法

首先是一般方法:
在这里插入图片描述
Boosting方法串行地训练一系列分类器,使得先前基分类器做错的样本在后续受到更多关注,并将这些分类器进行结合,以便获得性能更好的强分类器。

接下是经典的AdaBoosting算法:

在这里插入图片描述
首先,假设训练数据的权值分布是均匀分布,是为了使得第一次没有先验信息的条件下每个样本在基本分类器的学习中作用一样。
然后,每一次迭代产生的基本分类器 h t ( x ) h_t(x) ht(x), ϵ t \epsilon_t ϵt代表了 h t ( x ) h_t(x) ht(x)中分类错误的样本权重和,这点直接说明了权重分布 D t D_t Dt h t ( x ) h_t(x) ht(x)的分类错误率 ϵ t \epsilon_t ϵt有直接关系。同时,使用 α t = 1 2 ln ⁡ 1 − ϵ t ϵ t \alpha_t=\frac{1}{2} \ln \frac{1-\epsilon_t}{\epsilon_t} αt=21lnϵt1ϵt计算分类器 h t ( x ) h_t(x) ht(x)的系数权重,它表示了 h t ( x ) h_t(x) ht(x)在最终分类器的重要性程度,当 ϵ t ≤ 1 2 \epsilon_t \le \frac{1}{2} ϵt21时, α t ≥ 0 \alpha_t \ge0 αt0,并且 α t \alpha_t αt随着 ϵ t \epsilon_t ϵt的减少而增大,因此分类错误率越小的基本分类器在最终分类器的作用越大。
接下来是划重点,样本权重进行更新(展开后):
w t + 1 , i = { w t i Z t ϵ − α t , G t ( x i ) = y i w t i Z t ϵ α t , G t ( x i ) ≠ y i w_{t+1, i}=\left\{\begin{array}{ll} \frac{w_{t i}}{Z_{t}} \mathrm{\epsilon}^{-\alpha_{t}}, & G_{t}\left(x_{i}\right)=y_{i} \\ \frac{w_{t i}}{Z_{t}} \mathrm{\epsilon}^{\alpha_{t}}, & G_{t}\left(x_{i}\right) \neq y_{i} \end{array}\right. wt+1,i={Ztwtiϵαt,Ztwtiϵαt,Gt(xi)=yiGt(xi)=yi
最后,线性组合 h t ( x ) h_t(x) ht(x) 实现了将T个基本分类器的加权表决,系数 α t \alpha_{t} αt 标志了基本分类器 h t ( x ) h_t(x) ht(x) 的重要性,值得注意的是:所有的 α t \alpha_{t} αt之和不为1。 h t ( x ) h_t(x) ht(x) 的符号决定了样本x属于哪一类。

下图可以更形象的理解:
在这里插入图片描述
完整AdaBoost算法的python实现

def my_adaboost_clf(Y_train, X_train, Y_test, X_test, M=20, weak_clf=DecisionTreeClassifier(max_depth = 1)):
    n_train, n_test = len(X_train), len(X_test)
    # Initialize weights
    w = np.ones(n_train) / n_train
    pred_train, pred_test = [np.zeros(n_train), np.zeros(n_test)]

    for i in range(M):
        # Fit a classifier with the specific weights
        weak_clf.fit(X_train, Y_train, sample_weight = w)
        pred_train_i = weak_clf.predict(X_train)
        pred_test_i = weak_clf.predict(X_test)

        # Indicator function
        miss = [int(x) for x in (pred_train_i != Y_train)]
        print("weak_clf_%02d train acc: %.4f"
         % (i + 1, 1 - sum(miss) / n_train))

        # Error
        err_m = np.dot(w, miss)
        # Alpha
        alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
        # New weights
        miss2 = [x if x==1 else -1 for x in miss] # -1 * y_i * G(x_i): 1 / -1
        w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))
        w = w / sum(w)

        # Add to prediction
        pred_train_i = [1 if x == 1 else -1 for x in pred_train_i]
        pred_test_i = [1 if x == 1 else -1 for x in pred_test_i]
        pred_train = pred_train + np.multiply(alpha_m, pred_train_i)
        pred_test = pred_test + np.multiply(alpha_m, pred_test_i)

    pred_train = (pred_train > 0) * 1
    pred_test = (pred_test > 0) * 1

    print("My AdaBoost clf train accuracy: %.4f" % (sum(pred_train == Y_train) / n_train))
    print("My AdaBoost clf test accuracy: %.4f" % (sum(pred_test == Y_test) / n_test

如果通过上面的代码和图例还没看懂,那~
下面我们用一个例子来解释(我觉得看懂这个例子就可以初步知道AdaBoost算法了):
训练数据如下表,假设基本分类器的形式是一个分割 𝑥<𝑣 或 𝑥>𝑣 表示,阈值v由该基本分类器在训练数据集上分类错误率 e m e_m em 最低确定(见统计学习方法P59例8.1)。
 序号  1 2 3 4 5 6 7 8 9 10 x 0 1 2 3 4 5 6 7 8 9 y 1 1 1 − 1 − 1 − 1 1 1 1 − 1 \begin{array}{ccccccccccc} \hline \text { 序号 } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline x & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ y & 1 & 1 & 1 & -1 & -1 & -1 & 1 & 1 & 1 & -1 \\ \hline \end{array}  序号 xy1012113214315416517618719811091
解:
初始化样本权值分布
D 1 = ( w 11 , w 12 , ⋯   , w 110 ) w 1 i = 0.1 , i = 1 , 2 , ⋯   , 10 \begin{aligned} D_{1} &=\left(w_{11}, w_{12}, \cdots, w_{110}\right) \\ w_{1 i} &=0.1, \quad i=1,2, \cdots, 10 \end{aligned} D1w1i=(w11,w12,,w110)=0.1,i=1,2,,10
对m=1:
在权值分布 𝐷1 的训练数据集上,遍历每个结点并计算分类误差率 𝑒𝑚 ,阈值取v=2.5时分类误差率最低,那么基本分类器为:
G 1 ( x ) = { 1 , x < 2.5 − 1 , x > 2.5 G_{1}(x)=\left\{\begin{array}{ll} 1, & x<2.5 \\ -1, & x>2.5 \end{array}\right. G1(x)={1,1,x<2.5x>2.5

  • G 1 ( x ) G_1(x) G1(x) 在训练数据集上的误差率为 e 1 = P ( G 1 ( x i ) ≠ y i ) = 0.3 e_{1}=P\left(G_{1}\left(x_{i}\right) \neq y_{i}\right)=0.3 e1=P(G1(xi)=yi)=0.3
  • 计算 G 1 ( x ) G_1(x) G1(x) 的系数 α 1 = 1 2 log ⁡ 1 − e 1 e 1 = 0.4236 \alpha_{1}=\frac{1}{2} \log \frac{1-e_{1}}{e_{1}}=0.4236 α1=21loge11e1=0.4236
  • 更新训练数据的权值分布:
    D 2 = ( w 21 , ⋯   , w 2 i , ⋯   , w 210 ) w 2 i = w 1 i Z 1 exp ⁡ ( − α 1 y i G 1 ( x i ) ) , i = 1 , 2 , ⋯   , 10 D 2 = ( 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.16667 , 0.16667 , 0.16667 , 0.07143 ) f 1 ( x ) = 0.4236 G 1 ( x ) \begin{aligned} D_{2}=&\left(w_{21}, \cdots, w_{2 i}, \cdots, w_{210}\right) \\ w_{2 i}=& \frac{w_{1 i}}{Z_{1}} \exp \left(-\alpha_{1} y_{i} G_{1}\left(x_{i}\right)\right), \quad i=1,2, \cdots, 10 \\ D_{2}=&(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,\\ &0.16667,0.16667,0.16667,0.07143) \\ f_{1}(x) &=0.4236 G_{1}(x) \end{aligned} D2=w2i=D2=f1(x)(w21,,w2i,,w210)Z1w1iexp(α1yiG1(xi)),i=1,2,,10(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143)=0.4236G1(x)
    对于m=2:
  • 在权值分布 D 2 D_2 D2 的训练数据集上,遍历每个结点并计算分类误差率 e m e_m em ,阈值取v=8.5时分类误差率最低,那么基本分类器为:
    G 2 ( x ) = { 1 , x < 8.5 − 1 , x > 8.5 G_{2}(x)=\left\{\begin{array}{ll} 1, & x<8.5 \\ -1, & x>8.5 \end{array}\right. G2(x)={1,1,x<8.5x>8.5
  • 𝐺2(𝑥) 在训练数据集上的误差率为 e 2 = 0.2143 e_2 = 0.2143 e2=0.2143
  • 计算 𝐺2(𝑥) 的系数: α 2 = 0.6496 \alpha_2 = 0.6496 α2=0.6496
  • 更新训练数据的权值分布:
    D 3 = ( 0.0455 , 0.0455 , 0.0455 , 0.1667 , 0.1667 , 0.1667 0.1060 , 0.1060 , 0.1060 , 0.0455 ) f 2 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) \begin{aligned} D_{3}=&(0.0455,0.0455,0.0455,0.1667,0.1667,0.1667\\ &0.1060,0.1060,0.1060,0.0455) \\ f_{2}(x) &=0.4236 G_{1}(x)+0.6496 G_{2}(x) \end{aligned} D3=f2(x)(0.0455,0.0455,0.0455,0.1667,0.1667,0.16670.1060,0.1060,0.1060,0.0455)=0.4236G1(x)+0.6496G2(x)
    对m=3
  • 在权值分布 D 3 D_3 D3的训练数据集上,遍历每个结点并计算分类误差率 e m e_m em ,阈值取 v = 5.5 v=5.5 v=5.5时分类误差率最低,那么基本分类器为:
    G 3 ( x ) = { 1 , x > 5.5 − 1 , x < 5.5 G_{3}(x)=\left\{\begin{array}{ll} 1, & x>5.5 \\ -1, & x<5.5 \end{array}\right. G3(x)={1,1,x>5.5x<5.5
  • G 3 ( x ) G_3(x) G3(x)在训练数据集上的误差率为 e 3 = 0.1820 e_3 = 0.1820 e3=0.1820
  • 计算
  • G 3 ( x ) G_3(x) G3(x)的系数: α 3 = 0.7514 \alpha_3 = 0.7514 α3=0.7514
  • 更新训练数据的权值分布:
    D 4 = ( 0.125 , 0.125 , 0.125 , 0.102 , 0.102 , 0.102 , 0.065 , 0.065 , 0.065 , 0.125 ) D_{4}=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125) D4=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125)
    于是得到: f 3 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) f_{3}(x)=0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x) f3(x)=0.4236G1(x)+0.6496G2(x)+0.7514G3(x)
    分类器 sign ⁡ [ f 3 ( x ) ] \operatorname{sign}\left[f_{3}(x)\right] sign[f3(x)]在训练数据集上的误分类点的个数为0。
    于是得到最终分类器为:
    G ( x ) = sign ⁡ [ f 3 ( x ) ] = sign ⁡ [ 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) ] G(x)=\operatorname{sign}\left[f_{3}(x)\right]=\operatorname{sign}\left[0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x)\right] G(x)=sign[f3(x)]=sign[0.4236G1(x)+0.6496G2(x)+0.7514G3(x)]

使用Sklearn对Adaboost算法建模:
、本次案例我们使用一份UCI的机器学习库里的开源数据集:葡萄酒数据集,该数据集可以在 ( https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data )上获得。该数据集包含了178个样本和13个特征,从不同的角度对不同的化学特性进行描述,我们的任务是根据这些数据预测红酒属于哪一个类别。(案例来源《python机器学习(第二版》)

# 引入数据科学相关工具包:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns
# 加载训练数据:         
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols', 
                'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
# 数据查看:
print("Class labels",np.unique(wine["Class label"]))
wine.head()
# 数据预处理
# 仅仅考虑2,3类葡萄酒,去除1类
wine = wine[wine['Class label'] != 1]
y = wine['Class label'].values
X = wine[['Alcohol','OD280/OD315 of diluted wines']].values

# 将分类标签变成二进制编码:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

# 按8:2分割训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)  # stratify参数代表了按照y的类别等比例抽样

# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
from sklearn.metrics import accuracy_score
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))

# 使用sklearn实现Adaboost(基分类器为决策树)
'''
AdaBoostClassifier相关参数:
base_estimator:基本分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators:终止迭代的次数
learning_rate:学习率
algorithm:训练的相关算法,{'SAMME','SAMME.R'},默认='SAMME.R'
random_state:随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=tree,n_estimators=500,learning_rate=0.1,random_state=1)
ada = ada.fit(X_train,y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train,y_train_pred)
ada_test = accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))

Decision tree train/test accuracies 0.916/0.875
Adaboost train/test accuracies 1.000/0.917
下面的结果虽然比上面要好一些,但细心一点不难发现,下面的结果相比上面,训练集和测试集的分数差距在扩大,可以推测使用adaboost后出现过拟合。

# 画出单层决策树与Adaboost的决策边界:
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
    clf.fit(X_train, y_train)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    axarr[idx].contourf(xx, yy, Z, alpha=0.3)
    axarr[idx].scatter(X_train[y_train==0, 0],X_train[y_train==0, 1],c='blue', marker='^')
    axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='red', marker='o')
    axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.tight_layout()
plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()

在这里插入图片描述
从上面的决策边界图可以看到:Adaboost模型的决策边界比单层决策树的决策边界要复杂的多。也就是说,Adaboost试图用增加模型复杂度而降低偏差的方式去减少总误差,但是过程中引入了方差,可能出现过拟合,因此在训练集和测试集之间的性能存在较大的差距。
注意:与单个分类器相比,Adaboost等Boosting模型增加了计算的复杂度,在实践中需要仔细思考是否愿意为预测性能的相对改善而增加计算成本,而且Boosting方式无法做到现在流行的并行计算的方式进行训练,因为每一步迭代都要基于上一部的基本分类器。

推荐博客:

  1. https://www.python-course.eu/Boosting.php
  2. https://link.medium.com/udXDrLndAfb

推荐视频:

  1. https://www.bilibili.com/video/BV1Cs411c7Zt?t=4&p=2

参考:

[1]. https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning
[2].集成学习(基础与算法) 周志华
[3].https://zhuanlan.zhihu.com/p/59121403

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值