集成学习
1、个体与集成
“三个臭皮匠,顶个诸葛亮”,集成学习通过集合多个个体学习器的结果来提升预测结果的准确性以及泛化能力。
“君子和而不同”个体学习器需要比随机猜想要强一些,个体学习器的预测结果也要具有一定的多样性。
样本a | 样本b | 样本c | 样本a | 样本b | 样本c | 样本a | 样本b | 样本c | |||
---|---|---|---|---|---|---|---|---|---|---|---|
学习器1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | ||
学习器2 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | ||
学习器3 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | ||
集成结果 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
表中集成结果的产生采用的是投票法产生,即“少数服从多数”。前三列样本a,b,c训练出的学习器集成后的性能优于单个学习器,而中间三列样本训练出的学习器因为准确性较低(低于50%),所以最后集成的结果比单个学习器更差。最后三列样本训练出的学习器缺乏多样性,所以集成的结果也并未比单个学习器更好。
集成个体学习器的收敛性保证:
P
(
H
(
x
)
≠
f
(
x
)
)
=
∑
k
=
0
⌊
T
/
2
⌋
(
k
T
)
(
1
−
ϵ
)
k
ϵ
T
−
k
≤
e
x
p
(
−
1
2
T
(
1
−
2
ϵ
)
2
)
\begin{aligned} P(H(x)\ne f(x)) &= \sum_{k=0}^{\lfloor T/2 \rfloor}\big(^T_k\big)(1-\epsilon)^k\epsilon^{T-k}\\ &\le exp\big(-\frac{1}{2}T(1-2\epsilon)^2\big) \end{aligned}
P(H(x)=f(x))=k=0∑⌊T/2⌋(kT)(1−ϵ)kϵT−k≤exp(−21T(1−2ϵ)2)
ϵ
\epsilon
ϵ为学习器的错误率,
H
(
x
)
H(x)
H(x)为多个学习器集成的结果,
f
(
x
)
f(x)
f(x)为样本的真实标签,
P
(
.
)
P(.)
P(.)表示集成的错误率,按照投票法的规则,超过半数的结果才是集成的结果,和式后的式子即表示
k
k
k个学习器预测正确的概率(
k
≤
⌊
T
/
2
⌋
k\le \lfloor T/2\rfloor
k≤⌊T/2⌋),所以求和后即为预测错误的概率。
不等式的证明过程如下:
【推导】:由基分类器相互独立,假设随机变量
X
X
X为
T
T
T个基分类器分类正确的次数,因此
X
∼
B
(
T
,
1
−
ϵ
)
X \thicksim B(T,1-\epsilon)
X∼B(T,1−ϵ),设
x
i
x_i
xi为每一个分类器分类正确的次数,则
x
i
∼
B
(
1
,
1
−
ϵ
)
i
=
1
,
2
,
3
,
.
.
.
T
x_i\thicksim B(1,1-\epsilon)i=1,2,3,...T
xi∼B(1,1−ϵ)i=1,2,3,...T,那么有
X
=
∑
i
=
1
T
x
i
E
=
∑
i
=
1
T
=
(
1
−
ϵ
)
T
X=\sum_{i=1}^Tx_i\\ \mathbb{E}=\sum_{i=1}^T=(1-\epsilon)T
X=i=1∑TxiE=i=1∑T=(1−ϵ)T
所以
P
(
H
(
x
)
≠
f
(
x
)
)
=
P
(
X
≤
⌊
T
/
2
⌋
)
≤
P
(
X
≤
T
/
2
)
=
P
[
X
−
(
1
−
ϵ
)
T
≤
T
2
−
(
1
−
ϵ
)
T
]
=
P
[
X
−
(
1
−
ϵ
)
T
≤
−
T
2
(
1
−
2
ϵ
)
]
=
P
[
∑
i
=
1
T
x
i
−
∑
i
=
1
T
E
(
x
i
)
≤
−
T
2
(
1
−
2
ϵ
)
]
=
P
[
1
T
∑
i
=
1
T
x
i
−
1
T
∑
i
=
1
T
E
(
x
i
)
≤
−
1
2
(
1
−
2
ϵ
)
]
\begin{aligned} P(H(x)\ne f(x)) &=P(X\le\lfloor T/2\rfloor)\\ &\le P(X\le T/2)\\ &=P\bigg[X-(1-\epsilon)T\le\frac{T}{2}-(1-\epsilon)T\bigg]\\ &=P\bigg[X-(1-\epsilon)T\le-\frac{T}{2}(1-2\epsilon)\bigg]\\ &=P\bigg[\sum_{i=1}^Tx_i-\sum_{i=1}^T\mathbb{E}(x_i)\le-\frac{T}{2}(1-2\epsilon)\bigg]\\ &=P\bigg[\frac{1}{T}\sum_{i=1}^Tx_i-\frac{1}{T}\sum_{i=1}^T\mathbb{E}(x_i)\le-\frac{1}{2}(1-2\epsilon)\bigg]\\ \end{aligned}
P(H(x)=f(x))=P(X≤⌊T/2⌋)≤P(X≤T/2)=P[X−(1−ϵ)T≤2T−(1−ϵ)T]=P[X−(1−ϵ)T≤−2T(1−2ϵ)]=P[i=1∑Txi−i=1∑TE(xi)≤−2T(1−2ϵ)]=P[T1i=1∑Txi−T1i=1∑TE(xi)≤−21(1−2ϵ)]
根据
H
o
e
f
f
d
i
n
g
Hoeffding
Hoeffding不等式可知
P
(
1
m
∑
i
=
1
m
−
1
m
∑
i
=
1
m
E
(
x
i
)
≤
−
δ
)
≤
e
x
p
(
−
2
m
δ
2
)
P\bigg(\frac{1}{m}\sum_{i=1}^m-\frac{1}{m}\sum_{i=1}^m\mathbb{E}(x_i)\le-\delta\bigg)\le exp(-2m\delta^2)
P(m1i=1∑m−m1i=1∑mE(xi)≤−δ)≤exp(−2mδ2)
令
δ
=
(
1
−
2
ϵ
)
2
,
m
=
T
\delta=\frac{(1-2\epsilon)}{2},m=T
δ=2(1−2ϵ),m=T得
P
(
H
(
x
)
≠
f
(
x
)
)
=
∑
k
=
0
⌊
T
/
2
⌋
(
k
T
)
(
1
−
ϵ
)
k
ϵ
T
−
k
≤
e
x
p
(
−
1
2
T
(
1
−
2
ϵ
)
2
)
\begin{aligned} P(H(x)\ne f(x)) &= \sum_{k=0}^{\lfloor T/2 \rfloor}\big(^T_k\big)(1-\epsilon)^k\epsilon^{T-k}\\ &\le exp\big(-\frac{1}{2}T(1-2\epsilon)^2\big) \end{aligned}
P(H(x)=f(x))=k=0∑⌊T/2⌋(kT)(1−ϵ)kϵT−k≤exp(−21T(1−2ϵ)2)
两个基本结论:
- 收敛速率随着个体学习器数量 T T T呈指数上升
- ϵ = 0.5 \epsilon = 0.5 ϵ=0.5的个体集成器对收敛没有作用
2、Adaboost算法
学习
T
T
T个个体学习器
h
t
h_t
ht和相应的权重
α
t
\alpha_t
αt,使得他们的加权和
H
(
x
)
=
∑
t
=
1
T
α
t
h
t
(
x
)
H(x)=\sum_{t=1}^T\alpha_th_t(x)
H(x)=t=1∑Tαtht(x)
能够最小化损失函数
ℓ
e
x
p
(
H
∣
D
)
=
E
x
∼
D
[
e
−
f
(
x
)
H
(
x
)
]
\ell_{exp}(H|\mathcal{D})=\mathbb{E}_{x\thicksim\mathcal{D}}\big[e^{-f(x)H(x)}\big]
ℓexp(H∣D)=Ex∼D[e−f(x)H(x)]
损失函数希望预测错了的损失要更大,当预测正确时,即
f
(
x
)
=
H
(
x
)
f(x)=H(x)
f(x)=H(x),
f
(
x
)
H
(
x
)
=
−
1
f(x)H(x)=-1
f(x)H(x)=−1,则损失为
e
−
1
e^{-1}
e−1,而预测错误的损失为
e
>
e
−
1
e > e^{-1}
e>e−1。
**前向分布求解算法:**每一轮只学习一个学习器
h
t
h_t
ht和相应的权重
α
t
\alpha_t
αt,第
t
t
t轮的优化目标
(
α
t
,
h
t
)
=
a
r
g
m
i
n
α
,
h
(
H
t
−
1
+
α
h
∣
D
)
(\alpha_t,h_t)=\underset{\alpha,h}{argmin}(H_{t-1}+\alpha h|\mathcal{D})
(αt,ht)=α,hargmin(Ht−1+αh∣D)
根据指数损失函数的定义式(8.5),有
ℓ
e
x
p
(
H
t
−
1
+
α
h
∣
D
)
=
E
x
∼
D
[
e
−
f
(
x
)
(
H
t
−
1
(
x
)
+
α
h
(
x
)
)
]
=
∑
i
=
1
∣
D
∣
D
(
x
i
)
e
−
f
(
x
i
)
(
H
t
−
1
(
x
i
)
+
α
h
(
x
i
)
)
=
∑
i
=
1
∣
D
∣
D
(
x
i
)
e
−
f
(
x
i
)
H
t
−
1
(
x
i
)
e
f
(
x
i
)
α
h
(
x
i
)
\begin{aligned} \ell_{exp}(H_{t-1}+\alpha h | \mathcal{D})&=\mathbb{E}_{x\thicksim\mathcal{D}}\big[e^{-f(x)(H_{t-1}(x)+\alpha h(x))}\big]\\ &=\sum_{i=1}^{|D|}\mathcal{D}(x_i)e^{-f(x_i)(H_{t-1}(x_i)+\alpha h(x_i))}\\ &=\sum_{i=1}^{|D|}\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)}e^{f(x_i)\alpha h(x_i)} \end{aligned}
ℓexp(Ht−1+αh∣D)=Ex∼D[e−f(x)(Ht−1(x)+αh(x))]=i=1∑∣D∣D(xi)e−f(xi)(Ht−1(xi)+αh(xi))=i=1∑∣D∣D(xi)e−f(xi)Ht−1(xi)ef(xi)αh(xi)
因为
f
(
x
i
)
f(x_i)
f(xi)和
h
(
x
i
)
h(x_i)
h(xi)仅可取值
{
−
1
,
1
}
\{-1,1\}
{−1,1},可以推得
e
−
f
(
x
i
)
α
h
(
x
i
)
=
e
−
α
f
(
x
i
)
h
(
x
i
)
=
{
e
−
α
,
f
(
x
i
)
h
(
x
i
)
=
1
e
α
,
f
(
x
i
)
h
(
x
i
)
=
−
1
\begin{aligned} e^{-f(x_i)\alpha h(x_i)}&=e^{-\alpha f(x_i)h(x_i)}\\ &=\left\{\begin{matrix} e^{-\alpha},\quad f(x_i)h(x_i)=1\\ e^{\alpha},\quad f(x_i)h(x_i)=-1 \end{matrix} \right. \end{aligned}
e−f(xi)αh(xi)=e−αf(xi)h(xi)={e−α,f(xi)h(xi)=1eα,f(xi)h(xi)=−1
ℓ e x p ( H t − 1 + α h ∣ D ) = ∑ i = 1 ∣ D ∣ D ( x i ) ( e − α + ( e α − e − α ) I ( f ( x i ) ≠ h ( x i ) ) ) = ∑ i = 1 ∣ D ∣ D ( x i ) e − f ( x i ) H t − 1 ( x i ) e − α + ∑ i = 1 ∣ D ∣ D ( x i ) e − f ( x i ) H t − 1 ( x i ) ( e α − e − α I ( f ( x i ) ≠ h ( x i ) ) ) \begin{aligned} \ell_{exp}(H_{t-1}+\alpha h | \mathcal{D}) &=\sum_{i=1}^{|D|}\mathcal{D}(x_i)(e^{-\alpha}+(e^\alpha-e^{-\alpha})I(f(x_i)\ne h(x_i)))\\ &=\sum_{i=1}^{|D|}\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)e^{-\alpha}}+\sum_{i=1}^{|D|}\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)}(e^\alpha-e^{-\alpha}I(f(x_i)\ne h(x_i))) \end{aligned} ℓexp(Ht−1+αh∣D)=i=1∑∣D∣D(xi)(e−α+(eα−e−α)I(f(xi)=h(xi)))=i=1∑∣D∣D(xi)e−f(xi)Ht−1(xi)e−α+i=1∑∣D∣D(xi)e−f(xi)Ht−1(xi)(eα−e−αI(f(xi)=h(xi)))
做一个简单的符号替换,令
D
t
′
(
x
i
)
=
D
(
x
i
)
e
−
f
(
x
i
)
H
t
−
1
(
x
i
)
\mathcal{D}'_t(x_i)=\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)}
Dt′(xi)=D(xi)e−f(xi)Ht−1(xi),并且注意到
e
−
α
e^{-\alpha}
e−α和
e
α
−
e
−
α
e^\alpha-e^{-\alpha}
eα−e−α与求和变量无关,可以提出来,有
ℓ
e
x
p
(
H
t
−
1
+
α
h
∣
D
)
=
e
−
α
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
+
(
e
α
−
e
−
α
)
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
\ell_{exp}(H_{t-1}+\alpha h | \mathcal{D})=e^{-\alpha}\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)+(e^\alpha-e^{-\alpha})\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))
ℓexp(Ht−1+αh∣D)=e−αi=1∑∣D∣Dt′(xi)+(eα−e−α)i=1∑∣D∣Dt′(xi)I(f(xi)=h(xi))
我们的目的是求解
h
t
h_t
ht使得
ℓ
e
x
p
\ell_{exp}
ℓexp最小化,因此可以忽略掉
h
h
h无关的项,即求解的目标是
h
t
=
a
r
g
m
i
n
h
(
e
α
−
e
−
α
)
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
h_t=\underset{h}{argmin}(e^\alpha-e^{-\alpha})\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))
ht=hargmin(eα−e−α)i=1∑∣D∣Dt′(xi)I(f(xi)=h(xi))
更进一步,由于
α
>
1
2
\alpha>\frac{1}{2}
α>21,易证的
e
α
−
e
−
α
>
0
e^{\alpha}-e^{-\alpha}>0
eα−e−α>0恒成立,因此求解的目标为:
h
t
=
a
r
g
m
i
n
h
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
h_t=\underset{h}{argmin}\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))
ht=hargmini=1∑∣D∣Dt′(xi)I(f(xi)=h(xi))
其中
D
t
′
(
x
i
)
=
D
(
x
i
)
e
−
f
(
x
i
)
H
t
−
1
(
x
i
)
\mathcal{D}'_t(x_i)=\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)}
Dt′(xi)=D(xi)e−f(xi)Ht−1(xi)
观察 D t ′ ( x i ) \mathcal{D}'_t(x_i) Dt′(xi)的形式可以发现它仅与 t − 1 t-1 t−1轮及之前的学习器有关,因此在求解 h t h_t ht时,对于每个样本 i i i,它其实已经固定了,如果把 D t ′ ( x i ) \mathcal{D}'_t(x_i) Dt′(xi)看作样本 i i i在 t t t轮学习时的权重分本,我们要根据这个权重求解学习器 h t h_t ht以满足上面的最优化式子。
同时,为了确保 D t ′ ( x i ) \mathcal{D}'_t(x_i) Dt′(xi)是一个分布,通常我们对其进行规范化后作为下一个学习器的输入样本权重,即 D t ( x i ) = D t ′ ( x i ) ∑ i = 1 ∣ D ∣ D t ′ ( x i ) \mathcal{D}_t(x_i)=\frac{\mathcal{D}_t'(x_i)}{\sum_{i=1}^{|D|}\mathcal{D}_t'(x_i)} Dt(xi)=∑i=1∣D∣Dt′(xi)Dt′(xi),其中分母是常数,因此这个变换不会影响上述最小化的求解。
有意思的一点是,
t
t
t轮样本权重可以通过
t
−
1
t-1
t−1轮样本权重计算,而无需从头算起,以
t
+
1
t+1
t+1轮为例,根据迭代公式,有:
D
t
+
1
(
x
i
)
=
D
(
x
i
)
e
−
f
(
x
i
)
H
t
(
x
i
)
=
D
(
x
i
)
e
−
f
(
x
i
)
(
H
t
−
1
(
x
i
)
+
α
t
h
t
(
x
i
)
)
=
D
(
x
i
)
e
−
f
(
x
i
)
H
t
−
1
(
x
i
)
e
−
f
(
x
i
)
α
t
h
t
(
x
i
)
=
D
t
(
x
t
)
e
−
f
(
x
i
)
α
t
h
t
(
x
i
)
\begin{aligned} \mathcal{D}_{t+1}(x_i)&=\mathcal{D}(x_i)e^{-f(x_i)H_t(x_i)}\\ &=\mathcal{D}(x_i)e^{-f(x_i)(H_{t-1}(x_i)+\alpha_th_t(x_i))}\\ &=\mathcal{D}(x_i)e^{-f(x_i)H_{t-1}(x_i)}e^{-f(x_i)\alpha_th_t(x_i)}\\ &=\mathcal{D}_t(x_t)e^{-f(x_i)\alpha_th_t(x_i)} \end{aligned}
Dt+1(xi)=D(xi)e−f(xi)Ht(xi)=D(xi)e−f(xi)(Ht−1(xi)+αtht(xi))=D(xi)e−f(xi)Ht−1(xi)e−f(xi)αtht(xi)=Dt(xt)e−f(xi)αtht(xi)
然后就是对权重
α
t
\alpha_t
αt的求解,损失函数
ℓ
e
x
p
(
H
t
−
1
+
α
h
∣
D
)
\ell_{exp}(H_{t-1}+\alpha h|\mathcal{D})
ℓexp(Ht−1+αh∣D)对
α
\alpha
α求导得:
ϑ
ℓ
e
x
p
(
H
t
−
1
+
α
h
t
∣
D
)
ϑ
α
=
ϑ
(
e
−
α
∑
i
=
1
∣
D
∣
D
i
=
1
′
(
x
i
)
+
(
e
α
−
e
−
α
)
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
)
ϑ
α
=
−
e
−
α
∑
i
=
1
∣
D
∣
D
i
=
1
′
(
x
i
)
+
(
e
α
+
e
−
α
)
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
\begin{aligned} \frac{\vartheta\ell_{exp}(H_{t-1}+\alpha h_t|\mathcal{D})}{\vartheta\alpha}&=\frac{\vartheta\big(e^{-\alpha}\sum_{i=1}^{|D|}\mathcal{D}_{i=1}'(x_i)+(e^{\alpha}-e^{-\alpha})\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))\big)}{\vartheta\alpha}\\ &=-e^{-\alpha}\sum_{i=1}^{|D|}\mathcal{D}_{i=1}'(x_i)+(e^{\alpha}+e^{-\alpha})\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i)) \end{aligned}
ϑαϑℓexp(Ht−1+αht∣D)=ϑαϑ(e−α∑i=1∣D∣Di=1′(xi)+(eα−e−α)∑i=1∣D∣Dt′(xi)I(f(xi)=h(xi)))=−e−αi=1∑∣D∣Di=1′(xi)+(eα+e−α)i=1∑∣D∣Dt′(xi)I(f(xi)=h(xi))
令导数等于0,移项可得:
e
−
α
e
α
+
e
−
α
=
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
∑
i
=
1
∣
D
∣
D
i
=
1
′
(
x
i
)
=
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
Z
t
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
=
∑
i
=
1
∣
D
∣
D
t
′
(
x
i
)
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
=
E
x
∼
D
t
[
I
(
f
(
x
i
)
≠
h
(
x
i
)
)
]
=
ϵ
t
\frac{e^{-\alpha}}{e^\alpha+e^{-\alpha}}=\frac{\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))}{\sum^{|D|}_{i=1}\mathcal{D}'_{i=1}(x_i)}=\sum_{i=1}^{|D|}\frac{\mathcal{D}'_t(x_i)}{Z_t}I(f(x_i)\ne h(x_i))\\ =\sum_{i=1}^{|D|}\mathcal{D}'_t(x_i)I(f(x_i)\ne h(x_i))=\mathbb{E}_{x\thicksim\mathcal{D}_t}[I(f(x_i)\ne h(x_i))]=\epsilon_t
eα+e−αe−α=∑i=1∣D∣Di=1′(xi)∑i=1∣D∣Dt′(xi)I(f(xi)=h(xi))=i=1∑∣D∣ZtDt′(xi)I(f(xi)=h(xi))=i=1∑∣D∣Dt′(xi)I(f(xi)=h(xi))=Ex∼Dt[I(f(xi)=h(xi))]=ϵt
e − α e α + e − α = 1 e 2 α + 1 ⇒ e 2 α + 1 = 1 ϵ t ⇒ e 2 α = 1 − ϵ t ϵ t ⇒ 2 α = l n ( 1 − ϵ t ϵ t ) ⇒ α t = 1 2 l n ( 1 − ϵ t ϵ t ) \frac{e^{-\alpha}}{e^\alpha+e^{-\alpha}}=\frac{1}{e^{2\alpha}+1}\Rightarrow e^{2\alpha}+1=\frac{1}{\epsilon_t}\Rightarrow e^{2\alpha}=\frac{1-\epsilon_t}{\epsilon_t}\Rightarrow 2\alpha=ln\big(\frac{1-\epsilon_t}{\epsilon_t}\big)\Rightarrow \alpha_t = \frac{1}{2}ln\big(\frac{1-\epsilon_t}{\epsilon_t}\big) eα+e−αe−α=e2α+11⇒e2α+1=ϵt1⇒e2α=ϵt1−ϵt⇒2α=ln(ϵt1−ϵt)⇒αt=21ln(ϵt1−ϵt)
当 ϵ > 1 2 \epsilon>\frac{1}{2} ϵ>21时,上式单调递减,因此误差率越大得学习器分配的权重越少。
3、Gradient Boosting
将AdaBoost问题一般化,不再局限于损失函数为指数函数,也不限定局限于二分类问题,那么更一般的Boosting形式为:
ℓ
(
H
t
∣
D
)
=
E
x
∼
D
[
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
]
=
E
x
∼
D
[
e
r
r
(
H
t
−
1
(
x
)
+
α
t
h
t
(
x
)
,
f
(
x
)
)
]
\begin{aligned} \ell(H_t|\mathcal{D})&=\mathbb{E}_{x\thicksim\mathcal{D}}[err(H_t(x),f(x))] \\&=\mathbb{E}_{x\thicksim\mathcal{D}}[err(H_{t-1}(x)+\alpha_th_t(x),f(x))] \end{aligned}
ℓ(Ht∣D)=Ex∼D[err(Ht(x),f(x))]=Ex∼D[err(Ht−1(x)+αtht(x),f(x))]
比如当我们研究的是回归问题时,
f
(
x
)
∈
R
f(x)\in\mathbb{R}
f(x)∈R且损失函数为平方损失函数
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
=
(
H
t
(
x
)
−
f
(
x
)
)
2
err(H_t(x),f(x))=(H_t(x)-f(x))^2
err(Ht(x),f(x))=(Ht(x)−f(x))2。
类似于AdaBoost,第
t
t
t轮得到
α
t
,
h
t
(
x
)
\alpha_t,h_t(x)
αt,ht(x),可先对损失函数在
H
t
−
1
(
x
)
H_{t-1}(x)
Ht−1(x)处进行泰勒展开:
ℓ
(
H
t
∣
D
≈
E
x
∼
D
[
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
+
ϑ
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
(
H
t
(
x
)
−
H
t
−
1
(
x
)
)
]
=
E
x
∼
D
[
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
+
ϑ
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
α
t
h
t
(
x
)
]
=
E
x
∼
D
[
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
]
+
E
x
∼
D
[
ϑ
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
α
t
h
t
(
x
)
]
\begin{aligned} \ell(H_t|\mathcal{D}&\thickapprox\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[err(H_t(x),f(x))+\frac{\vartheta err(H_t(x),f(x))}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)}(H_t(x)-H_{t-1}(x))\bigg]\\ &=\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[err(H_t(x),f(x))+\frac{\vartheta err(H_t(x),f(x))}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)}\alpha_t h_t(x)\bigg]\\ &=\mathbb{E}_{x\thicksim\mathcal{D}}[err(H_t(x),f(x))]+\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[\frac{\vartheta err(H_t(x),f(x))}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)}\alpha_t h_t(x)\bigg]\\ \end{aligned}
ℓ(Ht∣D≈Ex∼D[err(Ht(x),f(x))+ϑHt(x)ϑerr(Ht(x),f(x))∣∣∣∣Ht(x)=Ht−1(x)(Ht(x)−Ht−1(x))]=Ex∼D[err(Ht(x),f(x))+ϑHt(x)ϑerr(Ht(x),f(x))∣∣∣∣Ht(x)=Ht−1(x)αtht(x)]=Ex∼D[err(Ht(x),f(x))]+Ex∼D[ϑHt(x)ϑerr(Ht(x),f(x))∣∣∣∣Ht(x)=Ht−1(x)αtht(x)]
上式中括号中第1项为常量
ℓ
(
H
t
∣
D
)
\ell(H_t|\mathcal{D})
ℓ(Ht∣D),因此最小化
ℓ
(
H
t
∣
D
)
\ell(H_t|\mathcal{D})
ℓ(Ht∣D)只需要最小化第二项即可。先不考虑
α
t
\alpha_t
αt,求解如下优化问题即可得到
h
t
(
x
)
h_t(x)
ht(x):
h
t
(
x
)
=
a
r
g
m
i
n
h
E
x
∼
D
[
ϑ
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
h
t
(
x
)
]
s
.
t
.
c
o
n
s
t
r
a
i
n
t
s
f
o
r
h
(
x
)
\begin{aligned} &h_t(x)=\\ &\underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[\frac{\vartheta err(H_t(x),f(x))}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)} h_t(x)\bigg]\\ &s.t. constraints \quad for\quad h(x) \end{aligned}
ht(x)=hargminEx∼D[ϑHt(x)ϑerr(Ht(x),f(x))∣∣∣∣Ht(x)=Ht−1(x)ht(x)]s.t.constraintsforh(x)
解得
h
t
(
x
)
h_t(x)
ht(x)之后,再求解如下优化问题可得权重项
α
t
\alpha_t
αt:
α
t
=
a
r
g
m
i
n
α
E
x
∼
D
[
e
r
r
(
H
t
−
1
(
x
)
+
α
h
t
(
x
)
)
,
f
(
x
)
]
\alpha_t=\underset{\alpha}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}[err(H_{t-1}(x)+\alpha h_t(x)),f(x)]
αt=αargminEx∼D[err(Ht−1(x)+αht(x)),f(x)]
以上就是梯度提升(Gradient Boosting)的理论框架,即每轮通过梯度(Gradient)下降的方式将个体弱学习器提升(Boosting)为强学习器。可以看出Adaboost是其特殊形式。
Adaboost再推导
h
t
(
x
)
=
a
r
g
m
i
n
h
E
x
∼
D
[
ϑ
e
r
r
(
H
t
(
x
)
,
f
(
x
)
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
h
(
x
)
]
=
a
r
g
m
i
n
h
E
x
∼
D
[
ϑ
e
−
f
(
x
)
H
t
(
x
)
ϑ
H
t
(
x
)
∣
H
t
(
x
)
=
H
t
−
1
(
x
)
h
(
x
)
]
=
a
r
g
m
i
n
h
E
x
∼
D
[
−
f
(
x
)
e
−
f
(
x
)
H
t
−
1
(
x
)
h
(
x
)
]
=
a
r
g
m
i
n
h
E
x
∼
D
[
−
f
(
x
)
h
(
x
)
]
\begin{aligned} h_t(x)&= \underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[\frac{\vartheta err(H_t(x),f(x))}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)} h(x)\bigg]\\ &=\underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[\frac{\vartheta e^{-f(x)H_t(x)}}{\vartheta H_t(x)}\bigg|_{H_t(x)=H_{t-1}(x)} h(x)\bigg]\\ &=\underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}\bigg[-f(x)e^{-f(x)H_{t-1}(x)}h(x)\bigg]=\underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}}[-f(x)h(x)]\\ \end{aligned}
ht(x)=hargminEx∼D[ϑHt(x)ϑerr(Ht(x),f(x))∣∣∣∣Ht(x)=Ht−1(x)h(x)]=hargminEx∼D[ϑHt(x)ϑe−f(x)Ht(x)∣∣∣∣Ht(x)=Ht−1(x)h(x)]=hargminEx∼D[−f(x)e−f(x)Ht−1(x)h(x)]=hargminEx∼D[−f(x)h(x)]
由
f
(
x
)
,
h
(
x
)
∈
{
−
1
,
1
}
f(x),h(x)\in\{-1,1\}
f(x),h(x)∈{−1,1},有
f
(
x
)
h
(
x
)
=
1
−
2
I
(
f
(
x
)
≠
h
(
x
)
)
f(x)h(x)=1-2I(f(x)\ne h(x))
f(x)h(x)=1−2I(f(x)=h(x))
因此代入上上式化简得
h
t
(
x
)
=
a
r
g
m
i
n
h
E
x
∼
D
t
[
I
(
f
(
x
)
≠
h
(
x
)
)
]
h_t(x)=\underset{h}{argmin}\mathbb{E}_{x\thicksim\mathcal{D}_t}[I(f(x)\ne h(x))]
ht(x)=hargminEx∼Dt[I(f(x)=h(x))]
4、GBDT和XGBoost
GBDT以Gradient Boosting为基本框架,并使用CART(决策树的变种)作为个体学习器。
- 针对回归问题,GBDT采用平方损失作为损失函数。 e r r ( H t ( x ) , f ( x ) ) = ( H t ( x ) − f ( x ) ) 2 err(H_t(x),f(x))=(H_t(x)-f(x))^2 err(Ht(x),f(x))=(Ht(x)−f(x))2
- 针对二分类问题,GBDT采用对数似然估计损失函数 e r r ( H t ( x ) , f ( x ) ) = l o g ( 1 + e x p ( − H t ( x ) f ( x ) ) ) err(H_t(x),f(x))=log(1+exp(-H_t(x)f(x))) err(Ht(x),f(x))=log(1+exp(−Ht(x)f(x)))
XGBoost即eXtreme Gradient Boosting的缩写,XGBoost与GBDT的关系可以类比为LIBSVM和SVM的关系,即XGBoost是GBDT的一种高效实现和改进。
5、Bagging与随机森林
Bagging是并行式集成学习的代表。我们可采样出 T T T个含 m m m训练样本的采样集,基于每个采样集训练一个基学习器然后将其结合起来进行预测。
自主采样法:
假设从
n
n
n个样本有放回地抽出
n
n
n个样本,
n
n
n次抽样后,有的样本会重复抽到,有的样本没有被抽到,将没有被抽到的样本作为验证集,占比约为:
l
i
m
n
→
∞
(
1
−
1
n
)
n
=
1
e
≈
36.6
%
lim_{n\rightarrow\infin}\bigg(1-\frac{1}{n}\bigg)^n=\frac{1}{e}\thickapprox36.6\%
limn→∞(1−n1)n=e1≈36.6%
随机森林(Random Forest)是Bagging的一个扩展变体,在以决策树为基学习器构建Bagging集成的基础上,进一步在决策树的训练过程中引入了属性的随机选择。
假设样本包含 d d d个属性对基决策树的每个节点,先从该节点的属性结合中随机选择包含 k ( k ≤ d ) k\quad(k\le d) k(k≤d)个属性的子集用来进行最优划分。
随机森林(Random Forest)是Bagging的一个扩展变体,在以决策树为基学习器构建Bagging集成的基础上,进一步在决策树的训练过程中引入了属性的随机选择。
假设样本包含 d d d个属性对基决策树的每个节点,先从该节点的属性结合中随机选择包含 k ( k ≤ d ) k\quad(k\le d) k(k≤d)个属性的子集用来进行最优划分。
随机森林训练效率通常由于Bagging,因为每个节点的划分只需要部分属性参与,而随机森林的泛化误差通常低于Bagging,因为属性的扰动为每个基决策树提供了更高的鲁棒性(不易过拟合到训练集上)。