前言
我们将介绍将弱分类器组合成强分类器的算法,Adaboost算法,以及该算法有效的证明。
对于这种提升方法,我们有
- 每次迭代加大误分类点的权重,这样下次生成的弱分类器能够更可能将该误分类点分类正确
- 每次迭代生成弱分类器的权重,对于误分类率低的弱分类器,我们在最终结果中给予更高的权重
1. 算法
- 输入:数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)},其中, x i ∈ R n , y i ∈ { − 1 , + 1 } x_i\in\mathbb{R}^n, y_i\in\{-1, +1\} xi∈Rn,yi∈{−1,+1};弱分类器学习方法
- 输出:强分类器 G ( x ) G(x) G(x)
(1) 初始数据权重
w
1
=
{
w
1
,
1
,
w
1
,
2
,
.
.
.
,
w
1
,
N
}
w_{1}=\{w_{1, 1}, w_{1, 2}, ..., w_{1, N}\}
w1={w1,1,w1,2,...,w1,N},这里
w
1
,
i
=
1
N
w_{1, i}=\frac{1}{N}
w1,i=N1;
(2)对于
m
=
1
,
2
,
.
.
.
,
M
m=1, 2, ..., M
m=1,2,...,M,第
m
m
m 次数据权重为
w
m
=
{
w
m
,
1
,
w
m
,
2
,
.
.
.
,
w
m
,
N
}
w_m=\{w_{m, 1}, w_{m, 2}, ..., w_{m, N}\}
wm={wm,1,wm,2,...,wm,N},则
- 误分类率 e m = ∑ i = 1 N w m , i I ( y i ≠ G m ( x i ) ) e_m=\sum_{i=1}^Nw_{m, i}I(y_i\not=G_m(x_i)) em=∑i=1Nwm,iI(yi=Gm(xi))
- 根据弱学习算法(比如最小化误分类率),确定弱分类器 G m ( x ) G_m(x) Gm(x)
- 计算正确分类几率 α m = 1 2 l o g 1 − e m e m \alpha_m=\frac{1}{2}log\frac{1-e_m}{e_m} αm=21logem1−em
- 更新第 m + 1 m+1 m+1 次数据权重 w m + 1 , i = w m , i e − α m y i G m ( x i ) Z m w_{m+1, i}=\frac{w_{m, i}e^{-\alpha_my_iG_m(x_i)}}{Z_m} wm+1,i=Zmwm,ie−αmyiGm(xi) 其中, Z m = ∑ i = 1 N w m , i e − α m y i G m ( x i ) Z_m=\sum_{i=1}^Nw_{m, i}e^{-\alpha_my_iG_m(x_i)} Zm=∑i=1Nwm,ie−αmyiGm(xi)
(3)弱分类器的线性组合为
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
f(x)=\sum_{m=1}^M\alpha_mG_m(x)
f(x)=m=1∑MαmGm(x) 则强学习器为
G
(
x
)
=
s
i
g
n
(
f
(
x
)
)
G(x)=sign(f(x))
G(x)=sign(f(x))
2. 算法有效的数学证明
通过对一系列弱分类器进行线性组合,我们可以得到强分类器,这是很神奇的。
下面,我们就对这个结果从数学上进行证明,说明弱分类器的组合可以很好的拟合数据,成为强分类器。
2.1 训练误差有上界
训练误差为 1 N ∑ i = 1 N I ( y i ≠ G ( x i ) ) \frac{1}{N}\sum_{i=1}^NI(y_i\not=G(x_i)) N1∑i=1NI(yi=G(xi))。我们说明这个训练误差有上界。
首先,我们有
1
N
∑
i
=
1
N
I
(
y
i
≠
G
(
x
i
)
)
=
1
N
∑
i
=
1
N
I
(
y
i
≠
s
i
g
n
(
f
(
x
i
)
)
)
=
1
N
∑
i
=
1
N
I
(
−
y
i
f
(
x
i
)
≥
0
)
≤
1
N
∑
i
=
1
N
e
−
y
i
f
(
x
i
)
\begin{array}{lll} &&\frac{1}{N}\sum_{i=1}^NI(y_i\not=G(x_i))\\ &=&\frac{1}{N}\sum_{i=1}^NI(y_i\not=sign(f(x_i)))\\ &=&\frac{1}{N}\sum_{i=1}^NI(-y_if(x_i)\ge0)\\ &\le&\frac{1}{N}\sum_{i=1}^Ne^{-y_if(x_i)} \end{array}
==≤N1∑i=1NI(yi=G(xi))N1∑i=1NI(yi=sign(f(xi)))N1∑i=1NI(−yif(xi)≥0)N1∑i=1Ne−yif(xi)
其次,我们证明
1
N
∑
i
=
1
N
e
−
y
i
f
(
x
i
)
=
Π
m
=
1
M
Z
m
\frac{1}{N}\sum_{i=1}^Ne^{-y_if(x_i)}=\Pi_{m=1}^MZ_m
N1∑i=1Ne−yif(xi)=Πm=1MZm。具体的,我们有
1
N
∑
i
=
1
N
e
−
y
i
f
(
x
i
)
=
1
N
∑
i
=
1
N
e
−
∑
m
=
1
M
α
m
y
i
G
m
(
x
i
)
=
1
N
∑
i
=
1
N
Π
m
=
1
M
e
−
α
m
y
i
G
m
(
x
i
)
=
∑
i
=
1
N
1
N
Π
m
=
1
M
e
−
α
m
y
i
G
m
(
x
i
)
=
∑
i
=
1
N
w
1
,
i
Π
m
=
1
M
e
−
α
m
y
i
G
m
(
x
i
)
=
∑
i
=
1
N
w
1
,
i
e
−
α
1
y
i
G
1
(
x
i
)
Π
m
=
2
M
e
−
α
m
y
i
G
m
(
x
i
)
=
∑
i
=
1
N
Z
1
w
2
,
i
Π
m
=
2
M
e
−
α
m
y
i
G
m
(
x
i
)
=
Z
1
∑
i
=
1
N
w
2
,
i
Π
m
=
2
M
e
−
α
m
y
i
G
m
(
x
i
)
=
.
.
.
=
Z
1
Z
2
.
.
.
Z
M
−
1
∑
i
=
1
N
w
M
,
i
e
−
α
M
y
i
G
M
(
x
i
)
=
Z
1
Z
2
.
.
.
Z
M
−
1
Z
M
=
Π
m
=
1
M
Z
m
\begin{array}{lll} &&\frac{1}{N}\sum_{i=1}^Ne^{-y_if(x_i)}\\ &=&\frac{1}{N}\sum_{i=1}^Ne^{-\sum_{m=1}^M\alpha_my_iG_m(x_i)}\\ &=&\frac{1}{N}\sum_{i=1}^N\Pi_{m=1}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&\sum_{i=1}^N\frac{1}{N}\Pi_{m=1}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&\sum_{i=1}^Nw_{1, i}\Pi_{m=1}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&\sum_{i=1}^Nw_{1, i}e^{-\alpha_1y_iG_1(x_i)}\Pi_{m=2}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&\sum_{i=1}^NZ_1w_{2, i}\Pi_{m=2}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&Z_1\sum_{i=1}^Nw_{2, i}\Pi_{m=2}^Me^{-\alpha_my_iG_m(x_i)}\\ &=&...\\ &=&Z_1Z_2...Z_{M-1}\sum_{i=1}^Nw_{M,i}e^{-\alpha_My_iG_M(x_i)}\\ &=&Z_1Z_2...Z_{M-1}Z_M\\ &=&\Pi_{m=1}^MZ_m \end{array}
===========N1∑i=1Ne−yif(xi)N1∑i=1Ne−∑m=1MαmyiGm(xi)N1∑i=1NΠm=1Me−αmyiGm(xi)∑i=1NN1Πm=1Me−αmyiGm(xi)∑i=1Nw1,iΠm=1Me−αmyiGm(xi)∑i=1Nw1,ie−α1yiG1(xi)Πm=2Me−αmyiGm(xi)∑i=1NZ1w2,iΠm=2Me−αmyiGm(xi)Z1∑i=1Nw2,iΠm=2Me−αmyiGm(xi)...Z1Z2...ZM−1∑i=1NwM,ie−αMyiGM(xi)Z1Z2...ZM−1ZMΠm=1MZm
最后,我们得到,训练误差有上界,即
1
N
∑
i
=
1
N
I
(
y
i
≠
G
(
x
i
)
)
≤
Π
m
=
1
M
Z
m
\frac{1}{N}\sum_{i=1}^NI(y_i\not=G(x_i))\le\Pi_{m=1}^MZ_m
N1i=1∑NI(yi=G(xi))≤Πm=1MZm
2.2 重写训练误差上界
训练误差的上界 Π m = 1 M Z m \Pi_{m=1}^MZ_m Πm=1MZm 看起来并不直观,因此,我们需要重新写出一个上界。
考虑到 y i y_i yi 与 G m ( x i ) G_m(x_i) Gm(xi) 的值域为 { − 1 , + 1 } \{-1, +1\} {−1,+1},且如果 y i G m ( x i ) = − 1 y_iG_m(x_i)=-1 yiGm(xi)=−1,则分类器 G m ( x i ) G_m(x_i) Gm(xi) 分类错误;如果 y i G m ( x i ) = 1 y_iG_m(x_i)=1 yiGm(xi)=1,则分类器 G m ( x i ) G_m(x_i) Gm(xi) 分类正确。
对于
Z
m
Z_m
Zm,我们有
Z
m
=
∑
i
=
1
N
w
m
,
i
e
−
α
m
y
i
G
m
(
x
i
)
=
∑
i
=
1
N
w
m
,
i
e
−
α
m
I
(
y
i
=
G
m
(
x
i
)
)
+
∑
i
=
1
N
w
m
,
i
e
α
m
I
(
y
i
≠
G
m
(
x
i
)
)
=
e
−
α
m
∑
i
=
1
N
w
m
,
i
I
(
y
i
=
G
m
(
x
i
)
)
+
e
α
m
∑
i
=
1
N
w
m
,
i
I
(
y
i
≠
G
m
(
x
i
)
)
=
e
−
α
m
(
1
−
e
m
)
+
e
α
m
e
m
=
e
−
1
2
l
o
g
1
−
e
m
e
m
(
1
−
e
m
)
+
e
1
2
l
o
g
1
−
e
m
e
m
e
m
=
2
e
m
(
1
−
e
m
)
\begin{array}{lll} Z_m&=&\sum_{i=1}^Nw_{m, i}e^{-\alpha_my_iG_m(x_i)}\\ &=&\sum_{i=1}^Nw_{m, i}e^{-\alpha_m}I(y_i=G_m(x_i))+\sum_{i=1}^Nw_{m, i}e^{\alpha_m}I(y_i\not=G_m(x_i))\\ &=&e^{-\alpha_m}\sum_{i=1}^Nw_{m, i}I(y_i=G_m(x_i))+e^{\alpha_m}\sum_{i=1}^Nw_{m, i}I(y_i\not=G_m(x_i))\\ &=&e^{-\alpha_m}(1-e_m)+e^{\alpha_m}e_m\\ &=&e^{-\frac{1}{2}log\frac{1-e_m}{e_m}}(1-e_m)+e^{\frac{1}{2}log\frac{1-e_m}{e_m}}e_m\\ &=& 2\sqrt{e_m(1-e_m)} \end{array}
Zm======∑i=1Nwm,ie−αmyiGm(xi)∑i=1Nwm,ie−αmI(yi=Gm(xi))+∑i=1Nwm,ieαmI(yi=Gm(xi))e−αm∑i=1Nwm,iI(yi=Gm(xi))+eαm∑i=1Nwm,iI(yi=Gm(xi))e−αm(1−em)+eαmeme−21logem1−em(1−em)+e21logem1−emem2em(1−em)
对于表达式 2 e m ( 1 − e m ) 2\sqrt{e_m(1-e_m)} 2em(1−em),我们有 2 e m ( 1 − e m ) ≤ 1 2\sqrt{e_m(1-e_m)}\le 1 2em(1−em)≤1,且等号严格在 e m = 1 2 e_m=\frac{1}{2} em=21 处取得。实际上,对于每次迭代后的弱分类器,它的分类效果应该略强于随机猜测的效果,也就是错误分类率 e m < 1 2 e_m<\frac{1}{2} em<21。因此, 2 e m ( 1 − e m ) < 1 2\sqrt{e_m(1-e_m)}<1 2em(1−em)<1,也就意味着 Z m < 1 Z_m<1 Zm<1。
在 M M M 次迭代中,我们令 Z ˉ = max { Z 1 , Z 2 , . . . , Z M } \bar Z=\max\{Z_1, Z_2, ..., Z_M\} Zˉ=max{Z1,Z2,...,ZM},此时,我们依然有 Z ˉ < 1 \bar Z<1 Zˉ<1。因此, Π m = 1 M Z m ≤ Z ˉ M \Pi_{m=1}^MZ_m\le\bar Z^M Πm=1MZm≤ZˉM
显然,上界 Z ˉ M \bar Z^M ZˉM,也就是训练误差上界是以指数速度下降的。这样,随着迭代次数的增加,我们总是可以得到一个训练误差满足条件的强分类器。