Boosting
Boosting也是Ensemble Learning(集成学习)中重要的一类,和Bagging的并行式不同,Boosting的核心思想是按顺序去训练分类器,每一个都要尝试修正前面的分类。其中最具有代表性的是的是Adaboost(适应性提升, Adaptive Boosting)和Gradient Boosting(梯度提升)。
对于Boosting方法来说,有两个非常重要的问题:
1. 在每一轮如何改变训练数据的权值或概率分布,修改的策略是什么?
2. 如何将弱分类器组合成一个强分类器?
Adaboost
对于上面两个问题,AdaBoost的做法是:
1.提高那些被前一轮弱分类器错误分类的样本的权值,而降低那些被正确分类样本的权值。这样一来,那些被分错的数据,在下一轮就会得到更大的关注。所以,分类问题被一系列的弱分类器“分而治之”。
2. 对弱分类器的组合,AdaBoost采取加权多数表决的方法。即加大分类误差率小的弱分类器的权值,使其在表决中起较大作用,减小分类误差率大的弱分类器的权值,使其在表决中起较小的作用。
算法流程
1.初始化数据权值分布
假设训练数据集中有均匀的权值分布(只有初始是均匀,后面要按照误差率进行更新):
D
1
=
(
w
11
,
w
12
,
.
.
.
,
w
1
N
)
,
w
1
i
=
1
N
,
i
=
1
,
2
,
.
.
.
,
N
D_1=(w_{11},w_{12},...,w_{1N}),w_{1i}=\frac{1}{N}, \quad i=1,2,...,N
D1=(w11,w12,...,w1N),w1i=N1,i=1,2,...,N
2.基于权值分布 D m D_m Dm得到基本分类器
G m ( x ) : x → { − 1 , 1 } G_m(x):x\rightarrow\{-1,1\} Gm(x):x→{−1,1}
3.计算分类误差率
G
m
(
x
)
G_m(x)
Gm(x)在训练数据集上的分类误差率为(注意不要忘记数据权值):
e
m
=
∑
i
=
1
N
P
(
G
m
(
x
i
)
̸
=
y
i
)
=
∑
i
=
1
N
w
m
i
I
(
G
m
(
x
i
)
̸
=
y
i
)
e_m=\sum_{i=1}^NP(G_m(x_i) \not= y_i)=\sum_{i=1}^Nw_{mi}I(G_m(x_i)\not=y_i)
em=i=1∑NP(Gm(xi)̸=yi)=i=1∑NwmiI(Gm(xi)̸=yi)
4.计算 G m ( x ) G_m(x) Gm(x)的系数
a m = 1 2 log 1 − e m e m a_m=\frac{1}{2}\log\frac{1-e_m}{e_m} am=21logem1−em
5.更新数据集的权值分布
D
m
+
1
=
(
w
m
+
1
,
1
,
w
m
+
1
,
2
,
.
.
.
,
w
m
+
1
,
N
)
D_{m+1}=(w_{m+1,1},w_{m+1,2},...,w_{m+1,N})
Dm+1=(wm+1,1,wm+1,2,...,wm+1,N)
w
m
+
1
,
1
=
w
m
i
Z
m
exp
(
−
a
m
y
i
G
m
(
x
i
)
)
,
i
=
1
,
2
,
.
.
.
,
N
w_{m+1,1}=\frac{w_{mi}}{Z_m}\exp(-a_my_iG_m(x_i)), \quad i=1,2,...,N
wm+1,1=Zmwmiexp(−amyiGm(xi)),i=1,2,...,N
其中,
Z
m
Z_m
Zm是规范化因子,目的在于保证每次权值总和是1:
Z
m
=
∑
i
=
1
N
w
m
i
e
x
p
(
−
a
m
y
i
G
m
(
x
i
)
)
Z_m=\sum_{i=1}^Nw_{mi}exp(-a_my_iG_m(x_i))
Zm=i=1∑Nwmiexp(−amyiGm(xi))
最终得到分类器:
G
(
x
)
=
s
i
g
n
(
f
(
x
)
)
=
s
i
g
n
(
∑
m
=
1
M
a
m
G
m
(
x
)
)
G(x)=sign(f(x))=sign(\sum_{m=1}^Ma_mGm(x))
G(x)=sign(f(x))=sign(m=1∑MamGm(x))
算法实例
我们结合李航《统计学习方法》中的一个例子分析:
假设弱分类器由
x
<
v
x<v
x<v 或
x
>
v
x>v
x>v产生,其阈值
v
v
v使该分类器在训练数据集上分类误差率最低。试用AdaBoost算法学习一个强分类器。注意y=1为正例,y=-1为反例。
序号 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
y | 1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | -1 |
第一轮m=1:
1.初始化数据权值分布
D 1 = ( w 11 , w 12 , . . . , w 110 ) D_1=(w_{11},w_{12},...,w_{110}) D1=(w11,w12,...,w110)
w i 1 = 0.1 , i = 1 , 2 , . . . , 10 w_{i1}=0.1, \quad i=1,2,...,10 wi1=0.1,i=1,2,...,10
2.计算分类误差率
计算发现
G
1
(
x
)
G_1(x)
G1(x)在阈值
v
=
2.5
v=2.5
v=2.5时在训练数据集上的分类误差率最低(序号7、8、9分类错,其他都正确),故误差率
e
1
=
P
(
G
1
(
x
i
)
̸
=
y
i
)
=
0.3
e_1=P(G_1(x_i)\not=y_i)=0.3
e1=P(G1(xi)̸=yi)=0.3.
所以基本分类器为:
G
1
(
x
)
=
{
1
,
x
<
2.5
−
1
,
x
>
2.5
G_1(x)=\left\{ \begin{aligned} 1, & & x<2.5 \\ -1, & & x>2.5\\ \end{aligned} \right.
G1(x)={1,−1,x<2.5x>2.5
3.计算 G 1 ( x ) G_1(x) G1(x)的系数
a 1 = 1 2 log 1 − e 1 e 1 = 0.4236 a_1=\frac{1}{2}\log\frac{1-e_1}{e_1}=0.4236 a1=21loge11−e1=0.4236
4.更新数据的权值分布
D 2 = ( w 21 , w 22 , . . . , w 210 ) Z 1 = 7 ∗ 0.1 ∗ e x p ( − 0.4236 ) + 3 ∗ 0.1 ∗ e x p ( 0.4236 ) = 0.91651 w 21 = w 1 i Z 1 exp ( − a 1 y i G 1 ( x i ) ) , i = 1 , 2 , . . . , 10 D 2 = ( 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.16667 , 0.16667 , 0.16667 , 0.07143 ) f 1 ( x ) = 0.4236 G 1 ( x ) \begin {aligned} & D_2=(w_{21},w_{22},...,w_{210}) \\ & \\ & Z_1=7*0.1*exp(-0.4236)+3*0.1*exp(0.4236)= 0.91651\\ & \\ & w_{21}=\frac{w_{1i}}{Z_1}\exp(-a_1y_iG_1(x_i)), \quad i=1,2,...,10\\ & \\ & D_2=(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143) \\ & \\ & f_1(x)=0.4236G_1(x) \end{aligned} D2=(w21,w22,...,w210)Z1=7∗0.1∗exp(−0.4236)+3∗0.1∗exp(0.4236)=0.91651w21=Z1w1iexp(−a1yiG1(xi)),i=1,2,...,10D2=(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143)f1(x)=0.4236G1(x)
分类器 s i g n [ f 1 ( x ) ] sign[f_1(x)] sign[f1(x)]在训练数据集上有3个误分类点,从 D 2 D_2 D2的变化中可以看出分类错误的点权值被加大了。
第二轮m=2:
1.数据权值分布
D 2 = ( 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.07143 , 0.16667 , 0.16667 , 0.16667 , 0.07143 ) D_2=(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143) D2=(0.07143,0.07143,0.07143,0.07143,0.07143,0.07143,0.16667,0.16667,0.16667,0.07143)
2.计算分类误差率
计算发现
G
2
(
x
)
G_2(x)
G2(x)在阈值
v
=
8.5
v=8.5
v=8.5时在训练数据集上的分类误差率最低,故误差率
e
2
=
P
(
G
2
(
x
i
)
̸
=
y
i
)
=
0.2143
e_2=P(G_2(x_i)\not=y_i)=0.2143
e2=P(G2(xi)̸=yi)=0.2143.
所以基本分类器为:
G
2
(
x
)
=
{
1
,
x
<
8.5
−
1
,
x
>
8.5
G_2(x)=\left\{ \begin{aligned} 1, & & x<8.5 \\ -1, & & x>8.5\\ \end{aligned} \right.
G2(x)={1,−1,x<8.5x>8.5
3.计算 G 1 ( x ) G_1(x) G1(x)的系数
a 2 = 1 2 log 1 − e 2 e 2 = 0.2143 a_2=\frac{1}{2}\log\frac{1-e_2}{e_2}=0.2143 a2=21loge21−e2=0.2143
4.更新数据的权值分布
D 3 = ( 0.0455 , 0.0455 , 0.0455 , 0.01667 , 0.01667 , 0.01667 , 0.1060 , 0.1060 , 0.1060 , 0.0455 ) f 2 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) \begin {aligned} & D_3=(0.0455,0.0455,0.0455,0.01667,0.01667,0.01667,0.1060,0.1060,0.1060,0.0455) \\ & \\ & f_2(x)=0.4236G_1(x)+0.6496G_2(x) \end{aligned} D3=(0.0455,0.0455,0.0455,0.01667,0.01667,0.01667,0.1060,0.1060,0.1060,0.0455)f2(x)=0.4236G1(x)+0.6496G2(x)
分类器 s i g n [ f 3 ( x ) ] sign[f_3(x)] sign[f3(x)]在训练数据集上有3个误分类点。
第三轮m=3:
1.数据权值分布
D 3 = ( 0.0455 , 0.0455 , 0.0455 , 0.01667 , 0.01667 , 0.01667 , 0.1060 , 0.1060 , 0.1060 , 0.0455 ) D_3=(0.0455,0.0455,0.0455,0.01667,0.01667,0.01667,0.1060,0.1060,0.1060,0.0455) D3=(0.0455,0.0455,0.0455,0.01667,0.01667,0.01667,0.1060,0.1060,0.1060,0.0455)
2.计算分类误差率
G
3
(
x
)
G_3(x)
G3(x)在阈值
v
=
5.5
v=5.5
v=5.5时在训练数据集上的分类误差率最低,故误差率
e
3
=
P
(
G
3
(
x
i
)
̸
=
y
i
)
=
0.1820
e_3=P(G_3(x_i)\not=y_i)=0.1820
e3=P(G3(xi)̸=yi)=0.1820.
所以基本分类器为:
G
3
(
x
)
=
{
1
,
x
>
5.5
−
1
,
x
<
5.5
G_3(x)=\left\{ \begin{aligned} 1, & & x>5.5 \\ -1, & & x<5.5\\ \end{aligned} \right.
G3(x)={1,−1,x>5.5x<5.5
3.计算 G 1 ( x ) G_1(x) G1(x)的系数
a 3 = 1 2 log 1 − e 3 e 3 = 0.7514 a_3=\frac{1}{2}\log\frac{1-e_3}{e_3}=0.7514 a3=21loge31−e3=0.7514
4.更新数据的权值分布
D 4 = ( 0.125 , 0.125 , 0.125 , 0.102 , 0.102 , 0.102 , 0.065 , 0.065 , 0.065 , 0.125 ) f 3 ( x ) = 0.4236 G 1 ( x ) + 0.6496 G 2 ( x ) + 0.7514 G 3 ( x ) \begin {aligned} & D_4=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125) \\ & \\ & f_3(x)=0.4236G_1(x)+0.6496G_2(x)+0.7514G_3(x) \end{aligned} D4=(0.125,0.125,0.125,0.102,0.102,0.102,0.065,0.065,0.065,0.125)f3(x)=0.4236G1(x)+0.6496G2(x)+0.7514G3(x)
分类器 s i g n [ f 3 ( x ) ] sign[f_3(x)] sign[f3(x)]在训练数据集上的误分类点个数为0。
Gradient Boosting
Note
1.如何体现Adaboost中误分类的权值得以扩大,而被正确分类样本的权值却得以缩小?
w
m
+
1
,
1
=
w
m
i
Z
m
exp
(
−
a
m
y
i
G
m
(
x
i
)
)
,
i
=
1
,
2
,
.
.
.
,
N
w_{m+1,1}=\frac{w_{mi}}{Z_m}\exp(-a_my_iG_m(x_i)), \quad i=1,2,...,N
wm+1,1=Zmwmiexp(−amyiGm(xi)),i=1,2,...,N
上式可以写为:
w
m
+
1
,
1
=
{
w
m
i
Z
m
exp
(
−
a
m
)
,
G
m
(
x
i
)
=
y
i
w
m
i
Z
m
exp
(
a
m
)
,
G
m
(
x
i
)
̸
=
y
i
w_{m+1,1}=\left\{ \begin{aligned} \frac{w_{mi}}{Z_m}\exp(-a_m),& & G_m(x_i)=y_i \\ \\ \frac{w_{mi}}{Z_m}\exp(a_m),& & G_m(x_i)\not=y_i \\ \end{aligned} \right.
wm+1,1=⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧Zmwmiexp(−am),Zmwmiexp(am),Gm(xi)=yiGm(xi)̸=yi
因此在分类器分类正确时,权值会非常小。
2.如何将弱分类器组成强分类器?
弱分类器通过
α
m
\alpha_m
αm进行组合,
α
m
\alpha_m
αm表示在最终分类器中的重要性,随着
e
m
e_m
em的减小而增大。
2019.9.2 补充Note