Boosting(1)
前提说明:本文是在我学习集成学习时的浅显总结,由于个人水平暂时有限,故基本没有推导与公式过程,可能部分内容还存在错误的理解,请谅解。
一、总概:
Boosting 是一类可将弱学习器提升为强学习器的算法。Boosting的理论依据来自,Valiant和 Kearns(1989)提出的经典理论问题:"强可学习"和"弱可学习"问题是否等价。Schapire(1990)通过构造性方法,即第一个Boosting方法,证明此问题的答案是肯定的。故,我们得到如下结论:任何弱学习器都有被提升为强学习器的潜力。
二、Boostring 算法
首先是一般方法:
Boosting方法串行地训练一系列分类器,使得先前基分类器做错的样本在后续受到更多关注,并将这些分类器进行结合,以便获得性能更好的强分类器。
接下是经典的AdaBoosting算法:
首先,假设训练数据的权值分布是均匀分布,是为了使得第一次没有先验信息的条件下每个样本在基本分类器的学习中作用一样。
然后,每一次迭代产生的基本分类器
h
t
(
x
)
h_t(x)
ht(x),
ϵ
t
\epsilon_t
ϵt代表了
h
t
(
x
)
h_t(x)
ht(x)中分类错误的样本权重和,这点直接说明了权重分布
D
t
D_t
Dt与
h
t
(
x
)
h_t(x)
ht(x)的分类错误率
ϵ
t
\epsilon_t
ϵt有直接关系。同时,使用
α
t
=
1
2
ln
1
−
ϵ
t
ϵ
t
\alpha_t=\frac{1}{2} \ln \frac{1-\epsilon_t}{\epsilon_t}
αt=21lnϵt1−ϵt计算分类器
h
t
(
x
)
h_t(x)
ht(x)的系数权重,它表示了
h
t
(
x
)
h_t(x)
ht(x)在最终分类器的重要性程度,当
ϵ
t
≤
1
2
\epsilon_t \le \frac{1}{2}
ϵt≤21时,
α
t
≥
0
\alpha_t \ge0
αt≥0,并且
α
t
\alpha_t
αt随着
ϵ
t
\epsilon_t
ϵt的减少而增大,因此分类错误率越小的基本分类器在最终分类器的作用越大。
接下来是划重点,样本权重进行更新(展开后):
w
t
+
1
,
i
=
{
w
t
i
Z
t
ϵ
−
α
t
,
G
t
(
x
i
)
=
y
i
w
t
i
Z
t
ϵ
α
t
,
G
t
(
x
i
)
≠
y
i
w_{t+1, i}=\left\{\begin{array}{ll} \frac{w_{t i}}{Z_{t}} \mathrm{\epsilon}^{-\alpha_{t}}, & G_{t}\left(x_{i}\right)=y_{i} \\ \frac{w_{t i}}{Z_{t}} \mathrm{\epsilon}^{\alpha_{t}}, & G_{t}\left(x_{i}\right) \neq y_{i} \end{array}\right.
wt+1,i={Ztwtiϵ−αt,Ztwtiϵαt,Gt(xi)=yiGt(xi)=yi
最后,线性组合
h
t
(
x
)
h_t(x)
ht(x) 实现了将T个基本分类器的加权表决,系数
α
t
\alpha_{t}
αt 标志了基本分类器
h
t
(
x
)
h_t(x)
ht(x) 的重要性,值得注意的是:所有的
α
t
\alpha_{t}
αt之和不为1。
h
t
(
x
)
h_t(x)
ht(x) 的符号决定了样本x属于哪一类。
下图可以更形象的理解:
完整AdaBoost算法的python实现
def my_adaboost_clf(Y_train, X_train, Y_test, X_test, M=20, weak_clf=DecisionTreeClassifier(max_depth = 1)):
n_train, n_test = len(X_train), len(X_test)
# Initialize weights
w = np.ones(n_train) / n_train
pred_train, pred_test = [np.zeros(n_train), np.zeros(n_test)]
for i in range(M):
# Fit a classifier with the specific weights
weak_clf.fit(X_train, Y_train, sample_weight = w)
pred_train_i = weak_clf.predict(X_train)
pred_test_i = weak_clf.predict(X_test)
# Indicator function
miss = [int(x) for x in (pred_train_i != Y_train)]
print("weak_clf_%02d train acc: %.4f"
% (i + 1, 1 - sum(miss) / n_train))
# Error
err_m = np.dot(w, miss)
# Alpha
alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
# New weights
miss2 = [x if x==1 else -1 for x in miss] # -1 * y_i * G(x_i): 1 / -1
w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))
w = w / sum(w)
# Add to prediction
pred_train_i = [1 if x == 1 else -1 for x in pred_train_i]
pred_test_i = [1 if x == 1 else -1 for x in pred_test_i]
pred_train = pred_train + np.multiply(alpha_m, pred_train_i)
pred_test = pred_test + np.multiply(alpha_m, pred_test_i)
pred_train = (pred_train > 0) * 1
pred_test = (pred_test > 0) * 1
print("My AdaBoost clf train accuracy: %.4f" % (sum(pred_train == Y_train) / n_train))
print("My AdaBoost clf test accuracy: %.4f" % (sum(pred_test == Y_test) / n_test
如果通过上面的代码和图例还没看懂,那~
下面我们用一个例子来解释(我觉得看懂这个例子就可以初步知道AdaBoost算法了):
训练数据如下表,假设基本分类器的形式是一个分割 𝑥<𝑣 或 𝑥>𝑣 表示,阈值v由该基本分类器在训练数据集上分类错误率
e
m
e_m
em 最低确定(见统计学习方法P59例8.1)。
序号
1
2
3
4
5
6
7
8
9
10
x
0
1
2
3
4
5
6
7
8
9
y
1
1
1
−
1
−
1
−
1
1
1
1
−
1
\begin{array}{ccccccccccc} \hline \text { 序号 } & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline x & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ y & 1 & 1 & 1 & -1 & -1 & -1 & 1 & 1 & 1 & -1 \\ \hline \end{array}
序号 xy10121132143−154−165−1761871981109−1
解:
初始化样本权值分布
D
1
=
(
w
11
,
w
12
,
⋯
,
w
110
)
w
1
i
=
0.1
,
i
=
1
,
2
,
⋯
,
10
\begin{aligned} D_{1} &=\left(w_{11}, w_{12}, \cdots, w_{110}\right) \\ w_{1 i} &=0.1, \quad i=1,2, \cdots, 10 \end{aligned}
D1w1i=(w11,w12,⋯,w110)=0.1,i=1,2,⋯,10
对m=1:
在权值分布 𝐷1 的训练数据集上,遍历每个结点并计算分类误差率 𝑒𝑚 ,阈值取v=2.5时分类误差率最低,那么基本分类器为:
G
1
(
x
)
=
{
1
,
x
<
2.5
−
1
,
x
>
2.5
G_{1}(x)=\left\{\begin{array}{ll} 1, & x<2.5 \\ -1, & x>2.5 \end{array}\right.
G1(x)={1,−1,x<2.5x>2.5
- G 1 ( x ) G_1(x) G1(x) 在训练数据集上的误差率为 e 1 = P ( G 1 ( x i ) ≠ y i ) = 0.3 e_{1}=P\left(G_{1}\left(x_{i}\right) \neq y_{i}\right)=0.3 e1=P(G1(xi)=yi)=0.3。
- 计算 G 1 ( x ) G_1(x) G1(x) 的系数 α 1 = 1 2 log 1 − e 1 e 1 = 0.4236 \alpha_{1}=\frac{1}{2} \log \frac{1-e_{1}}{e_{1}}=0.4236 α1=21loge11−e1=0.4236
- 更新训练数据的权值分布:
D 2 = ( w 21 , ⋯ , w 2 i , ⋯ , w 210 ) w 2 i = w 1 i Z 1 exp ( − α 1 y i G 1 ( x i ) ) , i = 1 , 2 , ⋯ , 10 D 2 = ( 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.16667 , 0.16667 , 0.16667 , 0.07143 ) f 1 ( x ) = 0.4236 G 1 ( x ) \begin{aligned} D_{2}=&\left(w_{21}, \cdots, w_{2 i}, \cdots, w_{210}\right) \\ w_{2 i}=& \frac{w_{1 i}}{Z_{1}} \exp \left(-\alpha_{1} y_{i} G_{1}\left(x_{i}\right)\right), \quad i=1,2, \cdots, 10 \\ D_{2}=&(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,\\ &0.16667,0.16667,0.16667,0.07143) \\ f_{1}(x) &=0.4236 G_{1}(x) \end{aligned} D2=w2i=D2=f1(x)(w21,⋯,w2i,⋯,w210)Z1w1iexp(−α1yiG1(xi)),i=1,2,⋯,10(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143)=0.4236G1(x)
对于m=2: - 在权值分布
D
2
D_2
D2 的训练数据集上,遍历每个结点并计算分类误差率
e
m
e_m
em ,阈值取v=8.5时分类误差率最低,那么基本分类器为:
G 2 ( x ) = { 1 , x < 8.5 − 1 , x > 8.5 G_{2}(x)=\left\{\begin{array}{ll} 1, & x<8.5 \\ -1, & x>8.5 \end{array}\right. G2(x)={1,−1,x<8.5x>8.5 - 𝐺2(𝑥) 在训练数据集上的误差率为 e 2 = 0.2143 e_2 = 0.2143 e2=0.2143
- 计算 𝐺2(𝑥) 的系数: α 2 = 0.6496 \alpha_2 = 0.6496 α2=0.6496
- 更新训练数据的权值分布:
D 3 = ( 0.0455 , 0.0455 , 0.0455 , 0.1667 , 0.1667 , 0.1667 0.1060 , 0.1060 , 0.1060 , 0.0455 ) f 2 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) \begin{aligned} D_{3}=&(0.0455,0.0455,0.0455,0.1667,0.1667,0.1667\\ &0.1060,0.1060,0.1060,0.0455) \\ f_{2}(x) &=0.4236 G_{1}(x)+0.6496 G_{2}(x) \end{aligned} D3=f2(x)(0.0455,0.0455,0.0455,0.1667,0.1667,0.16670.1060,0.1060,0.1060,0.0455)=0.4236G1(x)+0.6496G2(x)
对m=3: - 在权值分布
D
3
D_3
D3的训练数据集上,遍历每个结点并计算分类误差率
e
m
e_m
em ,阈值取
v
=
5.5
v=5.5
v=5.5时分类误差率最低,那么基本分类器为:
G 3 ( x ) = { 1 , x > 5.5 − 1 , x < 5.5 G_{3}(x)=\left\{\begin{array}{ll} 1, & x>5.5 \\ -1, & x<5.5 \end{array}\right. G3(x)={1,−1,x>5.5x<5.5 - G 3 ( x ) G_3(x) G3(x)在训练数据集上的误差率为 e 3 = 0.1820 e_3 = 0.1820 e3=0.1820
- 计算
- G 3 ( x ) G_3(x) G3(x)的系数: α 3 = 0.7514 \alpha_3 = 0.7514 α3=0.7514
- 更新训练数据的权值分布:
D 4 = ( 0.125 , 0.125 , 0.125 , 0.102 , 0.102 , 0.102 , 0.065 , 0.065 , 0.065 , 0.125 ) D_{4}=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125) D4=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125)
于是得到: f 3 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) f_{3}(x)=0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x) f3(x)=0.4236G1(x)+0.6496G2(x)+0.7514G3(x)
分类器 sign [ f 3 ( x ) ] \operatorname{sign}\left[f_{3}(x)\right] sign[f3(x)]在训练数据集上的误分类点的个数为0。
于是得到最终分类器为:
G ( x ) = sign [ f 3 ( x ) ] = sign [ 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) ] G(x)=\operatorname{sign}\left[f_{3}(x)\right]=\operatorname{sign}\left[0.4236 G_{1}(x)+0.6496 G_{2}(x)+0.7514 G_{3}(x)\right] G(x)=sign[f3(x)]=sign[0.4236G1(x)+0.6496G2(x)+0.7514G3(x)]
使用Sklearn对Adaboost算法建模:
例、本次案例我们使用一份UCI的机器学习库里的开源数据集:葡萄酒数据集,该数据集可以在 ( https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data )上获得。该数据集包含了178个样本和13个特征,从不同的角度对不同的化学特性进行描述,我们的任务是根据这些数据预测红酒属于哪一个类别。(案例来源《python机器学习(第二版》)
# 引入数据科学相关工具包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import seaborn as sns
# 加载训练数据:
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)
wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols','Flavanoids', 'Nonflavanoid phenols',
'Proanthocyanins','Color intensity', 'Hue','OD280/OD315 of diluted wines','Proline']
# 数据查看:
print("Class labels",np.unique(wine["Class label"]))
wine.head()
# 数据预处理
# 仅仅考虑2,3类葡萄酒,去除1类
wine = wine[wine['Class label'] != 1]
y = wine['Class label'].values
X = wine[['Alcohol','OD280/OD315 of diluted wines']].values
# 将分类标签变成二进制编码:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# 按8:2分割训练集和测试集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y) # stratify参数代表了按照y的类别等比例抽样
# 使用单一决策树建模
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy',random_state=1,max_depth=1)
from sklearn.metrics import accuracy_score
tree = tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
tree_train = accuracy_score(y_train,y_train_pred)
tree_test = accuracy_score(y_test,y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f' % (tree_train,tree_test))
# 使用sklearn实现Adaboost(基分类器为决策树)
'''
AdaBoostClassifier相关参数:
base_estimator:基本分类器,默认为DecisionTreeClassifier(max_depth=1)
n_estimators:终止迭代的次数
learning_rate:学习率
algorithm:训练的相关算法,{'SAMME','SAMME.R'},默认='SAMME.R'
random_state:随机种子
'''
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=tree,n_estimators=500,learning_rate=0.1,random_state=1)
ada = ada.fit(X_train,y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
ada_train = accuracy_score(y_train,y_train_pred)
ada_test = accuracy_score(y_test,y_test_pred)
print('Adaboost train/test accuracies %.3f/%.3f' % (ada_train,ada_test))
Decision tree train/test accuracies 0.916/0.875
Adaboost train/test accuracies 1.000/0.917
下面的结果虽然比上面要好一些,但细心一点不难发现,下面的结果相比上面,训练集和测试集的分数差距在扩大,可以推测使用adaboost后出现过拟合。
# 画出单层决策树与Adaboost的决策边界:
x_min = X_train[:, 0].min() - 1
x_max = X_train[:, 0].max() + 1
y_min = X_train[:, 1].min() - 1
y_max = X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),np.arange(y_min, y_max, 0.1))
f, axarr = plt.subplots(nrows=1, ncols=2,sharex='col',sharey='row',figsize=(12, 6))
for idx, clf, tt in zip([0, 1],[tree, ada],['Decision tree', 'Adaboost']):
clf.fit(X_train, y_train)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axarr[idx].contourf(xx, yy, Z, alpha=0.3)
axarr[idx].scatter(X_train[y_train==0, 0],X_train[y_train==0, 1],c='blue', marker='^')
axarr[idx].scatter(X_train[y_train==1, 0],X_train[y_train==1, 1],c='red', marker='o')
axarr[idx].set_title(tt)
axarr[0].set_ylabel('Alcohol', fontsize=12)
plt.tight_layout()
plt.text(0, -0.2,s='OD280/OD315 of diluted wines',ha='center',va='center',fontsize=12,transform=axarr[1].transAxes)
plt.show()
从上面的决策边界图可以看到:Adaboost模型的决策边界比单层决策树的决策边界要复杂的多。也就是说,Adaboost试图用增加模型复杂度而降低偏差的方式去减少总误差,但是过程中引入了方差,可能出现过拟合,因此在训练集和测试集之间的性能存在较大的差距。
注意:与单个分类器相比,Adaboost等Boosting模型增加了计算的复杂度,在实践中需要仔细思考是否愿意为预测性能的相对改善而增加计算成本,而且Boosting方式无法做到现在流行的并行计算的方式进行训练,因为每一步迭代都要基于上一部的基本分类器。
推荐博客:
- https://www.python-course.eu/Boosting.php
- https://link.medium.com/udXDrLndAfb
推荐视频:
- https://www.bilibili.com/video/BV1Cs411c7Zt?t=4&p=2
参考:
[1]. https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning
[2].集成学习(基础与算法) 周志华
[3].https://zhuanlan.zhihu.com/p/59121403