由前向分步算法可以推导AdaBoost,用定理叙述这一关系:
定理:
AdaBoost算法是前向分步加法算法的特例。这时,模型是由基本分类器组成的加法模型,损失函数是指数函数。
证明:
前向分步算法学习的是加法模型,当基函数为基本分类器时,该加法模型等价于AdaBoost的最终分类器:
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
f(x) = \sum_{m=1}^{M} \alpha_{m}G_{m}(x)
f(x)=m=1∑MαmGm(x)
由基本分类器
G
m
(
x
)
G_{m}(x)
Gm(x)及其系数
α
m
\alpha_{m}
αm组成,
m
=
1
,
2
,
…
,
M
m=1, 2, …, M
m=1,2,…,M。前向分步算法逐一学习基函数,这一过程与AdaBoost 算法逐一学习基本分类器的过程一致。下面证明前向分步算法的损失函数是指数损失函数(exponential loss function):
L
(
y
,
f
(
x
)
)
=
e
x
p
[
−
y
f
(
x
)
]
L(y, f(x))=exp[-yf(x)]
L(y,f(x))=exp[−yf(x)]时,其学习的具体操作等价于AdaBoost算法学习的具体操作。
假设经过
m
−
1
m-1
m−1轮迭代前向分步算法已经得到
f
m
−
1
(
x
)
f_{m-1}(x)
fm−1(x):
f
m
−
1
(
x
)
=
f
m
−
2
(
x
)
+
α
m
−
1
G
m
−
1
(
x
)
=
α
1
G
1
(
x
)
+
.
.
.
+
α
m
−
1
G
m
−
1
(
x
)
\begin{aligned} f_{m-1}(x) &= f_{m-2}(x) + \alpha_{m-1}G_{m-1}(x) \\ &= \alpha_{1}G_{1}(x) + ... + \alpha_{m-1}G_{m-1}(x) \end{aligned}
fm−1(x)=fm−2(x)+αm−1Gm−1(x)=α1G1(x)+...+αm−1Gm−1(x)
在第m轮迭代得到
α
m
\alpha_{m}
αm,
G
m
(
x
)
G_{m}(x)
Gm(x)和
f
m
(
x
)
f_{m}(x)
fm(x):
f
m
(
x
)
=
f
m
−
1
(
x
)
+
α
m
G
m
(
x
)
f_{m}(x) = f_{m-1}(x) + \alpha_{m}G_{m}(x)
fm(x)=fm−1(x)+αmGm(x)
目标是使前向分步算法得到的
α
m
\alpha_{m}
αm和
G
m
(
x
)
G_{m}(x)
Gm(x)使
f
m
(
x
)
f_{m}(x)
fm(x)在训练集T上的指数损失最小,即:
(1)
(
α
m
,
G
m
(
x
)
)
=
arg
min
α
,
G
∑
i
=
1
N
L
(
y
i
,
f
m
(
x
)
)
=
arg
min
α
,
G
∑
i
=
1
N
e
x
p
[
−
y
i
(
f
m
−
1
(
x
i
)
+
α
G
(
x
i
)
)
]
\begin{aligned} (\alpha_m,G_{m}(x)) & = \arg\min_{\alpha,G} \sum_{i=1}^N L(y_i, f_{m}(x)) \\ & = \arg\min_{\alpha,G} \sum_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\alpha G(x_i))] \tag1 \end{aligned}
(αm,Gm(x))=argα,Gmini=1∑NL(yi,fm(x))=argα,Gmini=1∑Nexp[−yi(fm−1(xi)+αG(xi))](1)
注:
对于回归问题,前向分步算法的损失函数可以选平方损失,即
L ( y i , f ( x ) ) = ( y i − f ( x ) ) 2 L(y_i,f(x)) = (y_i - f(x))^2 L(yi,f(x))=(yi−f(x))2
所以有:
L ( y i , f m − 1 ( x i ) + α G ( x i ) ) = ( y i − f m − 1 ( x i ) − α G ( x i ) ) 2 = ( r m i − α G ( x i ) ) 2 \begin{aligned} L(y_i,f_{m-1}(x_i)+\alpha G(x_i)) & = (y_i - f_{m-1}(x_i) - \alpha G(x_i))^2 \\ & = (r_{mi} - \alpha G(x_i))^2 \end{aligned} L(yi,fm−1(xi)+αG(xi))=(yi−fm−1(xi)−αG(xi))2=(rmi−αG(xi))2
其中 r m i = ( y i − f m − 1 ( x i ) ) r_{mi}= (y_i - f_{m-1}(x_i)) rmi=(yi−fm−1(xi)),这就是当前模型的残差,为了获取 α G ( x i ) \alpha G(x_i) αG(xi),也就是令其去拟合当前模型的残差。
   \;
AdaBoost是个分类器,对于分类问题,平方损失就不太适合了。所以引入指数损失。
将(1)式写为:
(2)
(
α
m
,
G
m
)
=
arg
min
α
,
G
∑
i
=
1
N
w
‾
m
i
e
x
p
(
−
α
y
i
G
(
x
i
)
)
\begin{aligned} (\alpha_m,G_m) = \arg\min_{\alpha,G} \sum_{i=1}^N \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \end{aligned} \tag2
(αm,Gm)=argα,Gmini=1∑Nwmiexp(−αyiG(xi))(2)
其中,
w
‾
m
i
=
e
x
p
(
−
y
i
f
m
−
1
(
x
i
)
)
\overline{w}_{mi} = exp(-y_i f_{m-1}(x_i))
wmi=exp(−yifm−1(xi))。因为
w
‾
m
i
\overline{w}_{mi}
wmi既不依赖
α
\alpha
α也不依赖于
G
G
G,所以与最小化无关。但
w
‾
m
i
\overline{w}_{mi}
wmi依赖于
f
m
−
1
(
x
)
f_{m-1}(x)
fm−1(x),随着每一轮迭代而发生改变。
现证使式(2)达到最小的 α m ∗ \alpha_{m}^{*} αm∗和 G m ∗ ( x ) G^{*}_{m}(x) Gm∗(x)就是AdaBoost算法所得到的 α m \alpha_{m} αm和 G m ( x ) G_{m}(x) Gm(x)。求解(2)式可分两步:
第一步:求 G m ∗ ( x ) G^{*}_{m}(x) Gm∗(x):
因为
y
i
∈
{
−
1
,
1
}
y_i∈\{−1,1\}
yi∈{−1,1},且
y
i
y_i
yi要么等于
G
(
x
i
)
G(x_i)
G(xi),要么不等于,所以将上述公式拆成两部分。PS:暂时省略
a
r
g
m
i
n
argmin
argmin。
∑
i
=
1
N
w
‾
m
i
e
x
p
(
−
α
y
i
G
(
x
i
)
)
=
e
−
α
∑
y
i
=
G
m
(
x
i
)
w
‾
m
i
+
e
α
∑
y
i
=
̸
G
m
(
x
i
)
w
‾
m
i
=
e
−
α
∑
y
i
=
G
m
(
x
i
)
w
‾
m
i
+
e
α
∑
y
i
=
̸
G
m
(
x
i
)
w
‾
m
i
+
e
−
α
∑
y
i
=
̸
G
m
(
x
i
)
w
‾
m
i
−
e
−
α
∑
y
i
=
̸
G
m
(
x
i
)
w
‾
m
i
\begin{aligned} \sum_{i=1}^N & \overline{w}_{mi} exp(-\alpha y_i G(x_i)) \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_{i} = \not G_{m}(x_i)} \overline w_{mi} \\ & = e^{-\alpha} \sum_{y_i=G_m(x_i)} \overline w_{mi} + e^{\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} + e^{-\alpha} \sum_{y_{i} = \not G_m(x_i)} \overline w_{mi} - e^{-\alpha} \sum_{y_i = \not G_m(x_i)} \overline w_{mi} \\ \end{aligned}
i=1∑Nwmiexp(−αyiG(xi))=e−αyi=Gm(xi)∑wmi+eαyi≠Gm(xi)∑wmi=e−αyi=Gm(xi)∑wmi+eαyi≠Gm(xi)∑wmi+e−αyi≠Gm(xi)∑wmi−e−αyi≠Gm(xi)∑wmi
上式合并,得到:
(3)
(
e
α
−
e
−
α
)
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
m
(
x
i
)
)
+
e
−
α
∑
i
=
1
N
w
‾
m
i
\begin{aligned} (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} \tag 3 \end{aligned}
(eα−e−α)i=1∑NwmiI(yi≠Gm(xi))+e−αi=1∑Nwmi(3)
对于迭代的第
m
m
m步,假设
α
\alpha
α为常数,那么上式第二项以及
(
e
α
−
e
−
α
)
(e^{\alpha} - e^{-\alpha})
(eα−e−α)都可以看成常数,要让损失函数取得最小值,只需使
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
m
(
x
i
)
)
\sum_{i=1}^N \overline w_{mi} I(y_i = \not G_m(x_i))
∑i=1NwmiI(yi≠Gm(xi))取最小值。因此有:
G
m
∗
(
x
)
=
arg
min
G
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
G^{*}_{m}(x) = \arg\min_G \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))
Gm∗(x)=argGmini=1∑NwmiI(yi≠G(xi))此分类器
G
m
∗
(
x
)
G^{*}_{m}(x)
Gm∗(x)即为AdaBoost算法的基本分类器
G
m
(
x
)
G_{m}(x)
Gm(x),因此它是第m轮加权训练数据分类误差率最小的基本分类器。
第二步:求 α m ∗ \alpha^{*}_{m} αm∗:
现假设
G
m
G_m
Gm已知的情况下,回到公式(3)。此时的变量为
α
\alpha
α,要让损失函数取得最小值,先对
α
\alpha
α求偏导,得到:
∂
L
∂
α
=
e
α
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
+
e
−
α
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
−
e
−
α
∑
i
=
1
N
w
‾
m
i
\frac {\partial_L} {\partial_{\alpha}} = e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + e^{-\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) - e^{-\alpha} \sum_{i=1}^N \overline w_{mi}
∂α∂L=eαi=1∑NwmiI(yi≠G(xi))+e−αi=1∑NwmiI(yi≠G(xi))−e−αi=1∑Nwmi
再令
∂
L
∂
α
=
0
\frac {\partial_L} {\partial_{\alpha}} = 0
∂α∂L=0,得:
e
α
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
=
[
∑
i
=
1
N
w
‾
m
i
−
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
]
e
−
α
e^{\alpha} \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) = [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] e^{-\alpha}
eαi=1∑NwmiI(yi≠G(xi))=[i=1∑Nwmi−i=1∑NwmiI(yi≠G(xi))]e−α
对两边同求log,得到:
l
o
g
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
+
l
o
g
e
α
=
l
o
g
[
∑
i
=
1
N
w
‾
m
i
−
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
]
+
l
o
g
e
−
α
log \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i)) + log e^{\alpha} = log [\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))] + log e^{-\alpha}
logi=1∑NwmiI(yi≠G(xi))+logeα=log[i=1∑Nwmi−i=1∑NwmiI(yi≠G(xi))]+loge−α
又因为
l
o
g
e
−
α
=
−
l
o
g
e
α
log e^{-\alpha} = -log e^{\alpha}
loge−α=−logeα,所以有:
l
o
g
e
α
=
1
2
l
o
g
∑
i
=
1
N
w
‾
m
i
−
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
log e^{\alpha} = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))}
logeα=21log∑i=1NwmiI(yi≠G(xi))∑i=1Nwmi−∑i=1NwmiI(yi≠G(xi))
解得:
α
m
=
1
2
l
o
g
∑
i
=
1
N
w
‾
m
i
−
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
∑
i
=
1
N
w
‾
i
I
(
y
i
=
̸
G
(
x
i
)
)
\alpha_m = \frac {1} {2} log \frac {\sum_{i=1}^N \overline w_{mi} - \sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_i I(y_i = \not G(x_i))}
αm=21log∑i=1NwiI(yi≠G(xi))∑i=1Nwmi−∑i=1NwmiI(yi≠G(xi))
又因为加权误差率:
e
m
=
∑
i
=
1
N
w
‾
m
i
I
(
y
i
=
̸
G
(
x
i
)
)
∑
i
=
1
N
w
‾
m
i
e_m = \frac {\sum_{i=1}^N \overline w_{mi} I(y_i = \not G(x_i))} {\sum_{i=1}^N \overline w_{mi}}
em=∑i=1Nwmi∑i=1NwmiI(yi≠G(xi))
所以
α
m
\alpha_m
αm可以写成
α
m
=
1
2
l
o
g
1
−
e
m
e
m
\alpha_m = \frac {1} {2} log \frac {1 - e_m} {e_m}
αm=21logem1−em
求出了
G
m
(
x
)
G_m(x)
Gm(x)与
α
m
\alpha_m
αm,就可以写出
f
(
x
)
f(x)
f(x)的更新公式:
f
m
(
x
)
=
f
m
−
1
(
x
)
+
α
m
G
m
(
x
)
f_m(x) = f_{m-1}(x) + \alpha_m G_m(x)
fm(x)=fm−1(x)+αmGm(x)
根据
w
‾
m
i
=
e
x
p
(
−
y
i
f
m
−
1
(
x
i
)
)
\overline w_{mi} = exp(-y_i f_{m-1}(x_i))
wmi=exp(−yifm−1(xi)),可得
w
w
w的更新公式:
w
‾
m
+
1
,
i
=
e
x
p
(
−
y
i
f
m
(
x
i
)
)
=
e
x
p
(
−
y
i
(
f
m
−
1
(
x
i
)
+
α
m
G
m
(
x
i
)
)
)
=
w
‾
m
,
i
e
x
p
(
−
α
m
y
i
G
m
(
x
i
)
)
\begin{aligned} \overline w_{m+1,i} & = exp(-y_i f_m (x_i)) \\ & = exp(-y_i (f_{m-1}(x_i)+\alpha_m G_m(x_i))) \\ & = \overline w_{m,i} exp(- \alpha_m y_i G_m(x_i)) \end{aligned}
wm+1,i=exp(−yifm(xi))=exp(−yi(fm−1(xi)+αmGm(xi)))=wm,iexp(−αmyiGm(xi))
这与AdaBoost算法的样本权值更新只差规范化因子,因而等价。
这也就推导出:当前向分步算法的损失函数为指数损失时,前向分步算法就是AdaBoost。
参考:
《统计学习方法》 李航
从前向分步算法推导出AdaBoost