AdaBoost算法(二)——理论推导篇
集成学习系列博客:
- 集成学习(ensemble learning)基础知识
- 随机森林(random forest)
- AdaBoost算法(一)——基础知识篇
- AdaBoost算法(二)——理论推导篇
在前面AdaBoost算法(一)——基础知识篇中详细介绍了adaboost的基础知识和原理,如果你只想了解adaboost的基本原理那么只看那篇博客就足够了,或者是你看到公式就头大,也可以跳过这篇博客了,因为这篇博客讲的AdaBoost理论推导确实很枯燥,并且这个理论推导对于不搞学术研究,不做改进,只是用下adaboost的同学来说没啥意义,所以只看AdaBoost算法(一)——基础知识篇 也是完全ok的。
这篇博客主要分以下两个方面来介绍下adaboost的理论推导方面吧,这篇博客基于李航老师的《统计学习方法》写成。
- Adabost算法的训练误差分析
- 用加法模型来解释Adaboost
一、Adabost算法的训练误差分析
首先先来理论分析下adaboost算法的误差,以下公式大多都是直接摘自李航老师《统计学习方法》,我只是添加了一些细化的步骤和解释。我们在上一篇adaboost的基础知识里也说了adaboost的误差为:
1
N
∑
i
=
1
N
I
(
G
(
x
i
)
=
̸
y
i
)
\frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i)
N1∑i=1NI(G(xi)≠yi),那么这个误差有没有上届?《统计学习方法》书中给出了并且给出了证明,adaboost算法最终分类器的训练误差界为:
(1)
1
N
∑
i
=
1
N
I
(
G
(
x
i
)
=
̸
y
i
)
≤
1
N
∑
i
e
x
p
(
−
y
i
f
(
x
i
)
)
=
∏
m
Z
m
\frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i) \leq \frac{1}{N}\sum_{i}exp(-y_i f(x_i)) = \prod_{m}Z_m \tag{1}
N1i=1∑NI(G(xi)≠yi)≤N1i∑exp(−yif(xi))=m∏Zm(1)
这里的
G
(
x
)
G(x)
G(x)表示最终的分类器,
G
m
(
x
)
G_m(x)
Gm(x)表示第m次迭代时的基分类器,
f
(
x
)
f(x)
f(x)表示基分类器的线性组合,即
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
f(x)=\sum_{m=1}^{M}\alpha_mG_m(x)
f(x)=∑m=1MαmGm(x),
Z
m
Z_m
Zm为前一篇博客中提到的规范化因子,
Z
m
=
∑
i
=
1
N
w
m
i
e
x
p
(
−
α
m
y
i
G
m
(
x
i
)
)
Z_m=\sum_{i=1}^{N}w_{mi}exp(-\alpha_my_iG_m(x_i))
Zm=∑i=1Nwmiexp(−αmyiGm(xi))。下面主要是来证明上面这个公式,证明:
当
G
(
x
i
)
=
̸
y
i
G(x_i)= \not y_i
G(xi)≠yi 时,有
y
i
f
(
x
i
)
<
0
y_if(x_i) < 0
yif(xi)<0,因此
e
x
p
(
−
y
i
f
(
x
i
)
)
≥
1
exp(-y_if(x_i)) \ge 1
exp(−yif(xi))≥1,直接可得
1
N
∑
i
=
1
N
I
(
G
(
x
i
)
=
̸
y
i
)
≤
1
N
∑
i
e
x
p
(
−
y
i
f
(
x
i
)
)
\frac{1}{N}\sum_{i=1}^{N}I(G(x_i) = \not y_i) \leq \frac{1}{N}\sum_{i}exp(-y_i f(x_i))
N1∑i=1NI(G(xi)≠yi)≤N1∑iexp(−yif(xi))。
现在来证明
1
N
∑
i
e
x
p
(
−
y
i
f
(
x
i
)
)
=
∏
m
Z
m
\frac{1}{N}\sum_{i}exp(-y_i f(x_i)) = \prod_{m}Z_m
N1∑iexp(−yif(xi))=∏mZm,首先根据上一篇博客AdaBoost算法(一)——基础知识篇中公式能够知道:
w
m
i
e
x
p
(
−
α
m
y
i
G
m
(
x
i
)
)
=
Z
m
w
m
+
1
,
i
w_{mi}exp(-\alpha_my_iG_m(x_i)) = Z_mw_{m+1,i}
wmiexp(−αmyiGm(xi))=Zmwm+1,i
证明:
(2) 1 N ∑ i e x p ( − y i f ( x i ) ) = 1 N ∑ i e x p ( − ∑ m = 1 M α m y i G m ( x i ) ) = ∑ i w 1 i ⋅ e x p [ − α 1 y i G 1 ( x i ) − α 2 y i G 2 ( x i ) − . . . − α M y i G M ( x i ) ] = ∑ i w 1 i ∏ m = 1 M e x p ( − α m y i G m ( x i ) ) = Z 1 ∑ i w 2 i ∏ m = 2 M e x p ( − α m y i G m ( x i ) ) = Z 1 Z 2 ∑ i w 3 i ∏ m = 3 M e x p ( − α m y i G m ( x i ) ) = ⋅ ⋅ ⋅ = Z 1 Z 2 ⋅ ⋅ ⋅ Z M − 1 ∑ i w M i e x p ( − α M y i G M ( x i ) ) = ∏ m Z m \begin{aligned} &\frac{1}{N}\sum_{i}exp(-y_i f(x_i)) \\ &=\frac{1}{N}\sum_{i}exp(-\sum_{m=1}^{M}\alpha_my_i G_m(x_i)) \\ &= \sum_{i}w_{1i} \cdot exp[-\alpha_1y_iG_1(x_i) - \alpha_2y_iG_2(x_i) - ... - \alpha_My_iG_M(x_i)] \tag{2}\\ &= \sum_{i}w_{1i} \prod_{m=1}^{M}exp(-\alpha_my_i G_m(x_i))\\ &= Z_1\sum_{i}w_{2i} \prod_{m=2}^{M}exp(-\alpha_my_i G_m(x_i)) \\ &= Z_1Z_2\sum_{i}w_{3i} \prod_{m=3}^{M}exp(-\alpha_my_i G_m(x_i)) \\ &=\cdot \cdot\cdot\\ &= Z_1Z_2 \cdot \cdot\cdot Z_{M-1}\sum_{i}w_{Mi}exp(-\alpha_My_i G_M(x_i)) \\ &=\prod_{m}Z_m\\ \end{aligned} N1i∑exp(−yif(xi))=N1i∑exp(−m=1∑MαmyiGm(xi))=i∑w1i⋅exp[−α1yiG1(xi)−α2yiG2(xi)−...−αMyiGM(xi)]=i∑w1im=1∏Mexp(−αmyiGm(xi))=Z1i∑w2im=2∏Mexp(−αmyiGm(xi))=Z1Z2i∑w3im=3∏Mexp(−αmyiGm(xi))=⋅⋅⋅=Z1Z2⋅⋅⋅ZM−1i∑wMiexp(−αMyiGM(xi))=m∏Zm(2)
上面的定理说明了:只要在每一轮选择适当的
G
m
G_m
Gm使得
Z
m
Z_m
Zm最小,则可以使得adaboost的训练误差下降最快。
特别地,对于二分类而言,adaboost的训练误差是以指数速度下降的,即:
(3)
∏
m
=
1
M
Z
m
=
∏
m
=
1
M
[
2
e
m
(
1
−
e
m
)
]
=
∏
m
=
1
M
(
1
−
4
γ
m
2
)
≤
e
x
p
(
−
2
∑
m
=
1
M
γ
m
2
)
\prod_{m=1}^{M}Z_m=\prod_{m=1}^{M}[2\sqrt{e_m(1-e_m)}] = \prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq exp(-2\sum_{m=1}^{M}\gamma_m^2) \tag{3}
m=1∏MZm=m=1∏M[2em(1−em)]=m=1∏M(1−4γm2)≤exp(−2m=1∑Mγm2)(3)
这里,
γ
m
=
1
2
−
e
m
\gamma_m=\frac{1}{2}-e_m
γm=21−em。
下面来证明(3)式:
(4)
Z
m
=
∑
i
=
1
N
w
m
i
e
x
p
(
−
α
m
y
i
G
m
(
x
i
)
)
=
∑
y
i
=
G
m
(
x
i
)
w
m
i
e
−
α
m
+
∑
y
i
=
̸
G
m
(
x
i
)
w
m
i
e
α
m
=
e
−
α
m
∑
y
i
=
G
m
(
x
i
)
w
m
i
+
e
α
m
∑
y
i
=
̸
G
m
(
x
i
)
w
m
i
=
e
−
α
m
(
1
−
e
m
)
+
e
α
m
e
m
=
(
1
−
e
m
)
e
−
1
2
log
1
−
e
m
e
m
+
e
m
e
1
2
log
1
−
e
m
e
m
=
2
e
m
(
1
−
e
m
)
=
1
−
4
γ
m
2
\begin{aligned} Z_m &= \sum_{i=1}^{N}w_{mi}exp(-\alpha_my_iG_m(x_i )) \\ &=\sum_{y_i=G_m(x_i)}w_{mi}e^{-\alpha_m} + \sum_{y_i= \not G_m(x_i)}w_{mi}e^{\alpha_m} \tag{4}\\ &=e^{-\alpha_m}\sum_{y_i=G_m(x_i)}w_{mi} + e^{\alpha_m}\sum_{y_i= \not G_m(x_i)}w_{mi}\\ &=e^{-\alpha_m}(1-e_m) + e^{\alpha_m}e_m \\ &= (1-e_m)e^{-\frac{1}{2}\log\frac{1-e_m}{e_m}} + e_me^{\frac{1}{2}\log\frac{1-e_m}{e_m}} \\ &= 2\sqrt{e_m(1-e_m)} \\ &= \sqrt{1-4\gamma_m^2} \end{aligned}
Zm=i=1∑Nwmiexp(−αmyiGm(xi))=yi=Gm(xi)∑wmie−αm+yi≠Gm(xi)∑wmieαm=e−αmyi=Gm(xi)∑wmi+eαmyi≠Gm(xi)∑wmi=e−αm(1−em)+eαmem=(1−em)e−21logem1−em+eme21logem1−em=2em(1−em)=1−4γm2(4)
下面来证不等式
∏
m
=
1
M
(
1
−
4
γ
m
2
)
≤
e
x
p
(
−
2
∑
m
=
1
M
γ
m
2
)
\prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq exp(-2\sum_{m=1}^{M}\gamma_m^2)
m=1∏M(1−4γm2)≤exp(−2m=1∑Mγm2)
因为
(5)
e
x
p
(
−
2
∑
m
=
1
M
γ
m
2
)
=
e
x
p
(
−
2
γ
1
2
)
⋅
e
x
p
(
−
2
γ
2
2
)
⋅
⋅
⋅
e
x
p
(
−
2
γ
M
2
)
=
∏
m
=
1
M
e
x
p
(
−
2
γ
m
2
)
exp(-2\sum_{m=1}^{M}\gamma_m^2) = exp(-2\gamma_1^2)\cdot exp(-2\gamma_2^2)\cdot\cdot\cdot exp(-2\gamma_M^2) = \prod_{m=1}^{M}exp(-2\gamma_m^2) \tag{5}
exp(−2m=1∑Mγm2)=exp(−2γ12)⋅exp(−2γ22)⋅⋅⋅exp(−2γM2)=m=1∏Mexp(−2γm2)(5)
即证明:
∏
m
=
1
M
(
1
−
4
γ
m
2
)
≤
∏
m
=
1
M
e
x
p
(
−
2
γ
m
2
)
\prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq \prod_{m=1}^{M}exp(-2\gamma_m^2)
m=1∏M(1−4γm2)≤m=1∏Mexp(−2γm2)
这个用泰勒展开式在点
x
=
0
x=0
x=0 处对
e
−
2
r
2
e^{-2r^2}
e−2r2和
1
−
4
r
2
\sqrt{1-4r^2}
1−4r2展开,这里要用到的泰勒公式如下:
e
x
=
1
+
x
+
1
2
!
x
2
+
1
3
!
x
3
+
1
4
!
x
4
+
⋅
⋅
⋅
+
1
n
!
x
n
e^x = 1+x + \frac{1}{2!}x^2 + \frac{1}{3!}x^3 + \frac{1}{4!}x^4 + \cdot \cdot \cdot + \frac{1}{n!}x^n
ex=1+x+2!1x2+3!1x3+4!1x4+⋅⋅⋅+n!1xn
(
1
+
x
)
α
=
1
+
α
x
+
α
(
α
−
1
)
2
!
x
2
+
⋅
⋅
⋅
+
α
(
α
−
1
)
⋅
⋅
⋅
(
α
−
n
+
1
)
n
!
x
n
(1+x)^\alpha = 1 + \alpha x + \frac{\alpha (\alpha-1)}{2!}x^2 + \cdot \cdot \cdot + \frac{\alpha (\alpha-1) \cdot \cdot \cdot (\alpha- n + 1)}{n!}x^n
(1+x)α=1+αx+2!α(α−1)x2+⋅⋅⋅+n!α(α−1)⋅⋅⋅(α−n+1)xn
因此
1
−
4
r
2
\sqrt{1-4r^2}
1−4r2的泰勒展开式为:
1
+
(
−
4
r
2
)
=
1
−
2
r
2
−
2
r
4
+
⋅
⋅
⋅
\sqrt{1+(-4r^2)} = 1 - 2r^2 - 2r^4 + \cdot \cdot \cdot
1+(−4r2)=1−2r2−2r4+⋅⋅⋅
e
−
2
r
2
e^{-2r^2}
e−2r2 的泰勒展开式为:
e
−
2
r
2
=
1
−
2
r
2
+
2
r
4
+
⋅
⋅
⋅
e^{-2r^2} = 1 - 2r^2 + 2r^4 +\cdot\cdot\cdot
e−2r2=1−2r2+2r4+⋅⋅⋅
所以,
1
−
4
r
2
<
e
−
2
r
2
\sqrt{1-4r^2} < e^{-2r^2}
1−4r2<e−2r2,因此
∏
m
=
1
M
(
1
−
4
γ
m
2
)
≤
∏
m
=
1
M
e
x
p
(
−
2
γ
m
2
)
\prod_{m=1}^{M}\sqrt{(1-4\gamma_m^2)} \leq \prod_{m=1}^{M}exp(-2\gamma_m^2)
∏m=1M(1−4γm2)≤∏m=1Mexp(−2γm2),得证。这实际上表明了adaboost的训练误差是以指数速度下降的。
二、用加法模型来解释Adaboost
这一部分主要是用加法模型和指数损失函数来解释adaboost。在统计学习方法书中写到,前向分步法学习的是加法模型,当基函数为基本分类器时,该加法模型等价于AdaBoost的最终分类器
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
f(x) = \sum_{m=1}^{M}\alpha_mG_m(x)
f(x)=m=1∑MαmGm(x)
前向分步法从前往后逐一学习基函数,与adaBoost逐一学习基分类器的过程是一致的。并且当前向分步法的损失函数是指数损失函数(
L
(
y
,
f
(
x
)
)
=
e
x
p
[
−
y
f
(
x
)
]
L(y, f(x)) = exp[-yf(x)]
L(y,f(x))=exp[−yf(x)])时,其学习的具体操作等价于AdaBoost。下面就是证明,如何从加法模型中使用前向分步法推导出AdaBoost,对公式头大的同学可以略过此节了。。即使看的话,我依然建议去看统计学习书中的这部分,我只是把其中的步骤做了一点细化的补充,为了让公式更容易看懂而已。。
假设经过
m
−
1
m-1
m−1次迭代前向分步法得到
f
m
−
1
(
x
)
f_{m-1}(x)
fm−1(x):
f
m
−
1
(
x
)
=
f
m
−
2
(
x
)
+
α
m
−
1
G
m
−
1
(
x
)
=
α
1
G
1
(
x
)
+
α
2
G
2
(
x
)
+
⋅
⋅
⋅
+
α
m
−
1
G
m
−
1
(
x
)
f_{m-1}(x) = f_{m-2}(x) + \alpha_{m-1}G_{m-1}(x) = \alpha_{1}G_{1}(x) + \alpha_{2}G_{2}(x) + \cdot \cdot \cdot + \alpha_{m-1}G_{m-1}(x)
fm−1(x)=fm−2(x)+αm−1Gm−1(x)=α1G1(x)+α2G2(x)+⋅⋅⋅+αm−1Gm−1(x)
则在第m次迭代有:
f
m
(
x
)
=
f
m
−
1
(
x
)
+
α
m
G
m
(
x
)
f_{m}(x) = f_{m-1}(x) + \alpha_{m}G_{m}(x)
fm(x)=fm−1(x)+αmGm(x)
目标则是使前向分步法得到的
α
m
\alpha_m
αm,
G
m
(
x
)
G_m(x)
Gm(x)使
f
m
(
x
)
f_m(x)
fm(x)在训练集上的指数损失函数最小,即:
(6)
(
α
m
,
G
m
(
x
)
)
=
arg
min
α
,
G
∑
i
=
1
N
e
x
p
[
−
y
i
(
f
m
−
1
(
x
i
)
+
α
G
(
x
i
)
)
]
(\alpha_m, G_m(x)) = \arg\min_{\alpha, G}\sum_{i=1}^{N}exp[-y_i(f_{m-1}(x_i) + \alpha G(x_i)) ] \tag{6}
(αm,Gm(x))=argα,Gmini=1∑Nexp[−yi(fm−1(xi)+αG(xi))](6)
若令
w
ˉ
m
i
=
e
x
p
[
−
y
i
f
m
−
1
(
x
i
)
]
\bar{w}_{mi}=exp[-y_if_{m-1}(x_i)]
wˉmi=exp[−yifm−1(xi)],则公式(6)为:
(7)
(
α
m
,
G
m
(
x
)
)
=
arg
min
α
,
G
∑
i
=
1
N
w
ˉ
m
i
e
x
p
[
−
y
i
α
G
(
x
i
)
]
(\alpha_m, G_m(x)) = \arg\min_{\alpha, G}\sum_{i=1}^{N}\bar{w}_{mi}exp[-y_i\alpha G(x_i) ] \tag{7}
(αm,Gm(x))=argα,Gmini=1∑Nwˉmiexp[−yiαG(xi)](7)
因为
w
ˉ
m
i
\bar{w}_{mi}
wˉmi和
α
\alpha
α与
G
G
G无关,因此最小化时可以把它视为常数。
下面我们来看看怎么求
α
m
∗
\alpha^*_m
αm∗和
G
m
∗
(
x
)
G^*_m(x)
Gm∗(x)使得公式(7)最小,
先来求
G
m
∗
(
x
)
G^*_m(x)
Gm∗(x),对于任意的
α
>
0
\alpha>0
α>0,要想使得公式(7)最小,则要
−
y
i
G
(
x
i
)
-y_iG(x_i)
−yiG(xi)最小,则要
y
i
G
(
x
i
)
y_iG(x_i)
yiG(xi)最大,那么就需要
G
m
∗
(
x
)
G^*_m(x)
Gm∗(x)的错误率最小,因此:
(8)
G
m
∗
(
x
)
=
arg
min
G
∑
i
=
1
N
w
ˉ
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
G^*_m(x) = \arg\min_G\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G(x_i)) \tag{8}
Gm∗(x)=argGmini=1∑NwˉmiI(yi≠G(xi))(8)
此分类器
G
m
∗
(
x
)
G^*_m(x)
Gm∗(x)即为AdaBoost算法的基分类器
G
m
(
x
)
G_m(x)
Gm(x),因为它是使得第m次迭代时加权训练数据分类误差最小的基分类器。
再来求
α
m
∗
\alpha^*_m
αm∗:
(9)
∑
i
=
1
N
w
ˉ
m
i
e
x
p
(
−
y
i
α
G
(
x
i
)
)
=
∑
y
i
=
G
m
(
x
i
)
w
ˉ
m
i
e
−
α
+
∑
y
i
=
̸
G
m
(
x
i
)
w
ˉ
m
i
e
α
=
(
e
α
−
e
−
α
)
∑
i
=
1
N
w
ˉ
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
+
e
−
α
∑
i
=
1
N
w
ˉ
m
i
\begin{aligned} &\sum_{i=1}^{N}\bar{w}_{mi}exp(-y_i \alpha G(x_i)) \\ &=\sum_{y_i = G_m(x_i)} \bar{w}_{mi} e^{-\alpha} + \sum_{y_i = \not G_m(x_i)} \bar{w}_{mi} e^{\alpha}\\ &=(e^\alpha - e^{-\alpha})\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G(x_i)) + e^{-\alpha}\sum_{i=1}^{N}\bar{w}_{mi} \tag{9} \end{aligned}
i=1∑Nwˉmiexp(−yiαG(xi))=yi=Gm(xi)∑wˉmie−α+yi≠Gm(xi)∑wˉmieα=(eα−e−α)i=1∑NwˉmiI(yi≠G(xi))+e−αi=1∑Nwˉmi(9)
公式(9)对
α
\alpha
α求导,并使得导数为0,可解得:
(10)
α
m
∗
=
1
2
l
o
g
1
−
e
m
e
m
\alpha^{*}_m = \frac{1}{2}log\frac{1-e_m}{e_m} \tag{10}
αm∗=21logem1−em(10)
其中,
(11)
e
m
=
∑
i
=
1
N
w
ˉ
m
i
I
(
y
i
=
̸
G
m
(
x
i
)
)
∑
i
=
1
N
w
ˉ
m
i
e_m = \frac{\sum_{i=1}^{N}\bar{w}_{mi}I(y_i = \not G_m(x_i))}{\sum_{i=1}^{N}\bar{w}_{mi}} \tag{11}
em=∑i=1Nwˉmi∑i=1NwˉmiI(yi≠Gm(xi))(11)
至于公式(11)为何还等于
∑
i
=
1
N
w
m
i
I
(
(
y
i
)
=
̸
G
m
(
x
i
)
)
\sum_{i=1}^{N}w_{mi}I((y_i)= \not G_m(x_i))
∑i=1NwmiI((yi)≠Gm(xi))我是没看懂,请路过的大佬不啬赐教~
能够看出这个
α
m
∗
\alpha^*_m
αm∗与adaboost的
α
m
\alpha_m
αm完全一致。
再来看样本权重更新,因为
w
ˉ
m
i
=
e
x
p
[
−
y
i
f
m
−
1
(
x
i
)
]
\bar{w}_{mi}=exp[-y_if_{m-1}(x_i)]
wˉmi=exp[−yifm−1(xi)],所以:
w
ˉ
m
+
1
,
i
=
e
x
p
[
−
y
i
f
m
(
x
i
)
]
=
e
x
p
[
−
y
i
(
f
m
−
1
(
x
)
+
α
m
G
m
(
x
)
)
]
=
e
x
p
(
−
y
i
f
m
−
1
(
x
i
)
)
⋅
e
x
p
(
−
y
i
α
m
G
m
(
x
)
)
=
w
ˉ
m
i
e
x
p
(
−
y
i
α
m
G
m
(
x
)
)
\begin{aligned} \bar{w}_{m+1,i}&=exp[-y_if_{m}(x_i)] \\ &=exp[-y_i(f_{m-1}(x) + \alpha_mG_m(x))] \\ &=exp(-y_if_{m-1}(x_i))\cdot exp(-y_i\alpha_mG_m(x)) \\ &=\bar{w}_{mi}exp(-y_i\alpha_mG_m(x)) \end{aligned}
wˉm+1,i=exp[−yifm(xi)]=exp[−yi(fm−1(x)+αmGm(x))]=exp(−yifm−1(xi))⋅exp(−yiαmGm(x))=wˉmiexp(−yiαmGm(x))
这与Adaboost的样本权重更新是一致的。
关于AdaBoost的理论推导就介绍到这~
参考文献
[1]: 李航《统计学习方法》