8.1 Adaboost
Boosting提升方法
基本思路
- 将弱可学习算法是升为强可学习算法。
- 其中提升方法是集成学习的一种
- 集成学习两个主要类别:
- 序列方法
- 并行方法
Adaboost算法
解分类问题
y
∈
[
−
1
,
+
1
]
y \in[-1,+1]
y∈[−1,+1]
在训练数据上训练得到模型,查看模型在整体数据和单个数据的分类 效果
在整体数据上分类效果较好,则该模型在最后的模型中占较大比例, 反之。
在单个数据上分类效果较好,那么在训练下一个模型时调小孩单个数
在上面过程迭代N次之后,直到最后的分类结果达到预期目标。将所有的模型组合,得到强可学习模型。
输入: 训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } 1 T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}_{1} T={(x1,y1),(x2,y2),⋯,(xN,yN)}1 其中 x i ∈ x_{i} \in xi∈ X ⊆ R n , y i ∈ Y = { − 1 , + 1 } X \subseteq R^{n}, y_{i} \in Y=\{-1,+1\} X⊆Rn,yi∈Y={−1,+1}; 弱学习算法;
输出: 最终分类器
G
(
x
)
G_{(x)}
G(x)
(1) 初始化训练数据的权值分布
D
1
=
(
ω
11
,
⋯
,
ω
1
i
,
⋯
ω
1
N
)
,
ω
1
i
=
1
N
,
i
=
1
,
2
,
⋯
,
N
D_{1}=\left(\omega_{11}, \cdots, \omega_{1 i}, \cdots \omega_{1 N}\right), \omega_{1 i}=\frac{1}{N}, i=1,2, \cdots, N
D1=(ω11,⋯,ω1i,⋯ω1N),ω1i=N1,i=1,2,⋯,N
(2) 对
m
=
1
,
2
⋯
,
M
m=1,2 \cdots, M
m=1,2⋯,M
(2.1)使用具有权值分布
D
m
\mathrm{D}_{m}
Dm 的训练数据集学习,得到基本分类器
G
m
(
x
)
:
X
→
{
−
1
,
+
1
}
G_{m}(x): X \rightarrow\{-1,+1\}
Gm(x):X→{−1,+1}
(2.2)计算
G
m
(
x
)
G_{m}(x)
Gm(x) 在训㤽数据集上的分类误差率
e
m
e_{m}
em
e
m
=
∑
i
=
1
N
P
(
G
m
(
x
i
)
≠
y
i
)
=
∑
i
=
1
N
ω
m
i
I
(
G
m
(
x
i
)
≠
y
i
)
e_{m}=\sum_{i=1}^{N} P\left(G_{m}\left(x_{i}\right) \neq y_{i}\right)=\sum_{i=1}^{N} \omega_{m i} I\left(G_{m}\left(x_{i}\right) \neq y_{i}\right)
em=i=1∑NP(Gm(xi)=yi)=i=1∑NωmiI(Gm(xi)=yi)
(2.3) 计算
G
m
(
x
)
\mathrm{G}_{m}(x)
Gm(x) 在训练数据集上的分类误差
α
m
=
1
2
log
1
−
e
m
e
m
\alpha_{m}=\frac{1}{2} \log \frac{1-e_{m}}{e_{m}}
αm=21logem1−em
这里的对数是自然对数
(2.4) 更新训练数据集的权值分布
D
m
+
1
=
(
ω
m
+
1
,
1
,
⋯
,
ω
m
+
1
,
i
,
⋯
,
ω
m
+
1
,
N
)
ω
m
+
1
,
i
=
ω
m
i
Z
m
exp
(
−
α
m
y
i
G
m
(
x
i
)
)
,
i
=
1
,
2
,
⋯
,
N
\begin{gathered} D_{m+1}=\left(\omega_{m+1,1}, \cdots, \omega_{m+1, i}, \cdots, \omega_{m+1, N}\right) \\ \omega_{m+1, i}=\frac{\omega_{m i}}{Z_{m}} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right), \quad i=1,2, \cdots, N \end{gathered}
Dm+1=(ωm+1,1,⋯,ωm+1,i,⋯,ωm+1,N)ωm+1,i=Zmωmiexp(−αmyiGm(xi)),i=1,2,⋯,N
这里,
Z
m
Z_{m}
Zm 是规范化因子
Z
m
=
∑
i
=
1
N
ω
m
i
exp
(
−
α
m
y
i
G
m
(
x
i
)
)
Z_{m}=\sum_{i=1}^{N} \omega_{m i} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right)
Zm=i=1∑Nωmiexp(−αmyiGm(xi))
它使
D
m
+
1
D_{m+1}
Dm+1 成为一个概率分布
(3)构建基本分类器的先行组合
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
f(x)=\sum_{m=1}^{M} \alpha_{m} G_{m}(x)
f(x)=m=1∑MαmGm(x)
得到最终分类器
G
(
x
)
=
sign
(
f
(
x
)
)
=
sign
(
∑
m
=
1
M
α
m
G
m
(
x
)
)
\begin{gathered} G(x)=\operatorname{sign}(f(x)) \\ =\operatorname{sign}\left(\sum_{m=1}^{M} \alpha_{m} G_{m}(x)\right) \end{gathered}
G(x)=sign(f(x))=sign(m=1∑MαmGm(x))
提升树boosting tree
基本分类器:分类树或回归树
提左树模型:
f
M
(
x
)
=
∑
m
=
1
M
T
(
x
;
Θ
m
)
f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right)
fM(x)=m=1∑MT(x;Θm)
前向分步算法:
f
m
(
x
)
=
f
m
−
1
(
x
)
+
T
(
x
;
Θ
m
)
Θ
^
m
=
arg
min
∑
i
=
1
N
L
(
y
i
,
f
m
−
1
(
x
i
)
+
T
(
x
i
;
Θ
m
)
)
f
0
(
x
)
=
0
f
1
(
x
)
=
f
0
(
x
)
+
T
(
x
;
Θ
1
)
1
N
∑
i
=
1
N
L
(
y
i
,
f
1
(
x
i
)
)
\begin{aligned} &f_{m}(x)=f_{m-1}(x)+T\left(x ; \Theta_{m}\right) \\ &\widehat{\Theta}_{m}=\arg \min \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+T\left(x_{i} ; \Theta_{m}\right)\right)\\ &f_{0}(x)=0 \\ &f_{1}(x)=f_{0}(x)+T\left(x ; \Theta_{1}\right) \\ &\frac{1}{N} \sum_{i=1}^{N} L\left(y_{i}, f_{1}\left(x_{i}\right)\right) \end{aligned}
fm(x)=fm−1(x)+T(x;Θm)Θ
m=argmini=1∑NL(yi,fm−1(xi)+T(xi;Θm))f0(x)=0f1(x)=f0(x)+T(x;Θ1)N1i=1∑NL(yi,f1(xi))
其中对于回归问题,一般为
L
(
y
i
,
f
1
(
x
i
)
)
=
1
2
(
y
−
f
(
x
)
)
2
L\left(y_{i}, f_{1}\left(x_{i}\right)\right)=\frac{1}{2}(y-f(x))^{2}
L(yi,f1(xi))=21(y−f(x))2
回归问题(平方误差损失)
算法8.3 (回忉问题的提升树方法)
输入: 训练数据集
T
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋯
,
(
x
N
,
y
N
)
}
T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}
T={(x1,y1),(x2,y2),⋯,(xN,yN)} , 其中
x
i
∈
X
⊆
R
n
,
y
i
∈
Y
⊆
R
i
x_{i} \in X \subseteq R^{n}, y_{i} \in Y \subseteq \mathrm{R}_{i}
xi∈X⊆Rn,yi∈Y⊆Ri
输出: 提升树
f
M
(
x
)
f_{M}(x)
fM(x)
(1) 初始化
f
0
(
x
)
=
0
f_{0}(x)=0
f0(x)=0
(2) 对
m
=
1
,
2
,
⋯
,
M
m=1,2 , \cdots, M
m=1,2,⋯,M
(2.1)计算残差:
r
m
i
=
y
i
−
f
m
−
1
(
x
i
)
,
i
=
1
,
2
,
⋯
,
N
r_{m i}=y_{i}-f_{m-1}\left(x_{i}\right), \quad i=1,2, \cdots, N
rmi=yi−fm−1(xi),i=1,2,⋯,N
(2.2)拟合残差
r
m
i
r_{m i}
rmi 学习―个回归树,得到
T
(
x
;
Θ
m
)
T\left(x ; \Theta_{m}\right)
T(x;Θm)
(2.3) 更新
f
m
(
x
)
=
f
m
−
1
(
x
)
+
T
(
x
;
Θ
m
)
f_{m}(x)=f_{m-1}(x)+T\left(x ; \Theta_{m}\right)
fm(x)=fm−1(x)+T(x;Θm)
(3) 得到回归问题是升树
f
M
(
x
)
=
∑
m
=
1
M
T
(
x
;
Θ
m
)
f_{M}(x)=\sum_{m=1}^{M} T\left(x ; \Theta_{m}\right)
fM(x)=m=1∑MT(x;Θm)
至于拟合残差的原因:
对于任意的样本点y和拟合值
f
(
x
)
f(x)
f(x) 的损失
L
[
y
,
f
(
x
)
]
=
[
y
−
f
(
x
)
]
2
\begin{aligned} & L[y, f(x)] \\ =& {[y-f(x)]^{2} } \end{aligned}
=L[y,f(x)][y−f(x)]2
在前项分布算法中
f
m
(
x
)
=
[
y
−
f
m
−
1
(
x
)
−
T
(
x
;
Θ
m
)
]
2
=
[
γ
m
−
1
−
T
(
x
;
Θ
m
)
]
2
=
L
(
γ
m
−
1
,
T
(
x
;
Θ
m
)
)
\begin{aligned} f_{m}(x)=& {\left[y-f_{m-1}(x)-T\left(x ; \Theta_{m}\right)\right]^{2} } \\ =& {\left[\gamma_{m-1}-T\left(x ; \Theta_{m}\right)\right]^{2} } \\ =&L\left(\gamma_{m-1}, T\left(x ; \Theta_{m}\right)\right) \end{aligned}
fm(x)===[y−fm−1(x)−T(x;Θm)]2[γm−1−T(x;Θm)]2L(γm−1,T(x;Θm))
回归问题梯度提升:
算法8.4 (梯度提升算法)
输入: 训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } 1 T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}_{1} T={(x1,y1),(x2,y2),⋯,(xN,yN)}1, 其 中 x i ∈ X ⊆ R n , y i ∈ Y ⊆ R i x_{i} \in X \subseteq R^{n}, y_{i} \in Y \subseteq \mathrm{R}_{i} xi∈X⊆Rn,yi∈Y⊆Ri;损失函数 L ( y , f ( x ) ) L(y, f(x)) L(y,f(x))
输出:回归树 f ^ ( x ) \hat{f}(x) f^(x)
(1)初始化
f
0
(
x
)
=
arg
min
c
∑
i
=
1
N
L
(
y
i
,
c
)
f_{0}(x)=\arg \min _{c} \sum_{i=1}^{N} L\left(y_{i}, c\right)
f0(x)=argcmini=1∑NL(yi,c)
(2)对于
m
=
1
,
2
,
⋯
,
M
m=1,2, \cdots, M
m=1,2,⋯,M
(2.1) 对i
i
=
1
,
2
,
⋯
,
N
i=1,2, \cdots, N
i=1,2,⋯,N,计算
r
m
i
=
−
[
∂
L
(
y
i
,
f
(
x
i
)
)
∂
f
(
x
i
)
]
f
(
x
)
=
f
m
−
1
(
x
)
r_{m i}=-\left[\frac{\partial L\left(y_{i}, f\left(x_{i}\right)\right)}{\partial f\left(x_{i}\right)}\right]_{f(x)=f_{m-1}(x)}
rmi=−[∂f(xi)∂L(yi,f(xi))]f(x)=fm−1(x)
(2.2) 对
r
m
i
r_{m i}
rmi 拟合拟合一个回归树,得到 第棵树的叶节点区域
R
m
j
,
j
=
1
,
2
,
⋯
,
J
\mathrm{R}_{m j}, j=1,2, \cdots, J
Rmj,j=1,2,⋯,J
(2.3) 对j
=
1
,
2
,
⋯
,
J
=1,2, \cdots, J
=1,2,⋯,J, 计箿
c
m
j
=
arg
min
c
∑
x
i
∈
R
m
j
L
(
y
i
,
f
m
−
1
(
x
i
)
+
c
)
c_{m j}=\arg \min _{c} \sum_{x_{i} \in R_{m j}} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+c\right)
cmj=argcminxi∈Rmj∑L(yi,fm−1(xi)+c)
(2.4) 更新
f
m
(
x
)
=
f
m
−
1
(
x
)
+
∑
j
=
1
J
c
m
j
I
(
x
∈
R
m
j
)
f_{m}(x)=f_{m-1}(x)+\sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right)
fm(x)=fm−1(x)+j=1∑JcmjI(x∈Rmj)
(3) 得到回归树
f
^
(
x
)
=
f
M
(
x
)
=
∑
m
=
1
M
∑
j
=
1
J
c
m
j
I
(
x
∈
R
m
j
)
\hat{f}(x)=f_{M}(x)=\sum_{m=1}^{M} \sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right)
f^(x)=fM(x)=m=1∑Mj=1∑JcmjI(x∈Rmj)
总结
- 提升算法采用了多个弱模型结合达到一个强模型的效果
- AdaBoost每次训练的分类器将重点关注于之前仍然被错分的样本。
- 以决策树为基函数的提升方法被称为提升树
8.2 前向分步部算法
前向分步算法
前向分步算去求解诣数函数为损佚函数的氻法模型与Adabost的关系 结论: 两者是等价的
f
(
x
)
=
∑
m
=
1
M
α
m
G
m
(
x
)
,
G
m
(
x
)
∈
[
−
1
,
+
1
]
f(x)=\sum_{m=1}^{M} \alpha_{m} G_{m}(x), \quad G_{m}(x) \in[-1,+1]
f(x)=m=1∑MαmGm(x),Gm(x)∈[−1,+1]
损失函数
L
(
y
,
f
(
x
)
)
=
exp
(
−
y
f
(
x
)
)
L(y, f(x))=\exp (-y f(x))
L(y,f(x))=exp(−yf(x))
假没经过m-1轮迭代,前向分步算刧已经得到
f
m
−
1
(
x
)
=
∑
j
=
1
m
−
1
α
j
G
j
(
x
)
f_{m-1}(x)=\sum_{j=1}^{m-1} \alpha_{j} G_{j}(x)
fm−1(x)=j=1∑m−1αjGj(x)
那么
f
m
(
x
)
=
f
m
−
1
(
x
)
+
α
m
G
m
(
x
)
α
m
,
G
m
(
x
)
=
arg
min
α
,
G
∑
i
=
1
N
exp
[
−
y
i
(
f
m
−
1
(
x
i
)
+
α
G
(
x
i
)
)
]
=
arg
min
α
,
G
∑
i
=
1
N
ω
m
i
‾
exp
[
−
y
i
α
G
(
x
i
)
]
=
arg
min
α
,
G
∑
i
∈
M
1
ω
m
i
‾
exp
(
−
α
)
+
arg
min
α
,
G
∑
i
∈
M
2
ω
m
i
‾
exp
(
α
)
\begin{aligned} f_{m}(x) &=f_{m-1}(x)+\alpha_{m} G_{m}(x) \\ \alpha_{m}, G_{m}(x) &=\arg \min _{\alpha, G} \sum_{i=1}^{N} \exp \left[-y_{i}\left(f_{m-1}\left(x_{i}\right)+\alpha G\left(x_{i}\right)\right)\right] \\ &=\arg \min _{\alpha, G} \sum_{i=1}^{N} \overline{\omega_{m i}} \exp \left[-y_{i} \alpha G\left(x_{i}\right)\right] \\ &=\arg \min _{\alpha, G} \sum_{i \in M_{1}} \overline{\omega_{m i}} \exp (-\alpha)+\arg \min _{\alpha, G} \sum_{i \in M_{2}} \overline{\omega_{m i}} \exp (\alpha) \end{aligned}
fm(x)αm,Gm(x)=fm−1(x)+αmGm(x)=argα,Gmini=1∑Nexp[−yi(fm−1(xi)+αG(xi))]=argα,Gmini=1∑Nωmiexp[−yiαG(xi)]=argα,Gmini∈M1∑ωmiexp(−α)+argα,Gmini∈M2∑ωmiexp(α)
其中
M
1
M_{1}
M1 是止确分类,
M
2
M_{2}
M2 是错误分类
arg
min
α
,
G
∑
i
∈
M
1
ω
m
i
‾
exp
(
−
α
)
+
arg
min
α
,
G
∑
i
∈
M
2
ω
m
i
‾
exp
(
−
α
)
+
arg
min
α
,
G
∑
i
∈
M
2
ω
m
i
‾
(
exp
(
α
)
−
exp
(
−
α
)
)
=
exp
(
−
α
)
∑
i
ω
m
i
‾
+
[
exp
(
α
)
−
exp
(
−
α
)
]
∑
ω
m
i
‾
I
(
y
i
≠
G
(
x
i
)
)
\begin{aligned} \arg \min _{\alpha, G} & \sum_{i \in M_{1}} \overline{\omega_{m i}} \exp (-\alpha)+\arg \min _{\alpha, G} \sum_{i \in M_{2}} \overline{\omega_{m i}} \exp (-\alpha)+\\ & \arg \min _{\alpha, G} \sum_{i \in M_{2}} \overline{\omega_{m i}}(\exp (\alpha)-\exp (-\alpha)) \\ =& \exp (-\alpha) \sum_{i} \overline{\omega_{m i}}+[\exp (\alpha)-\exp (-\alpha)] \sum \overline{\omega_{m i}} I\left(y_{i} \neq G\left(x_{i}\right)\right) \end{aligned}
argα,Gmin=i∈M1∑ωmiexp(−α)+argα,Gmini∈M2∑ωmiexp(−α)+argα,Gmini∈M2∑ωmi(exp(α)−exp(−α))exp(−α)i∑ωmi+[exp(α)−exp(−α)]∑ωmiI(yi=G(xi))
得到G的最优解
G
m
∗
=
argmin
∑
i
ω
m
i
‾
I
(
y
i
≠
G
(
x
i
)
)
G_{m}^{*}=\operatorname{argmin} \sum_{i} \overline{\omega_{m i}} I\left(y_{i} \neq G\left(x_{i}\right)\right)
Gm∗=argmini∑ωmiI(yi=G(xi))
接下来求
α
\alpha
α的最优解
α
m
=
arg
min
α
∑
i
ω
m
i
‾
exp
(
−
α
y
i
G
∗
(
x
i
)
)
=
∑
i
∈
M
1
ω
m
i
‾
exp
(
−
α
)
+
∑
i
∈
M
2
ω
m
i
‾
exp
(
α
)
=
(
e
α
−
e
−
α
)
∑
ω
m
i
‾
I
(
y
i
≠
G
(
x
i
)
)
+
e
−
α
∑
ω
m
i
‾
\begin{gathered} \alpha_{m}=\arg \min _{\alpha} \sum_{i} \overline{\omega m_{i}} \exp \left(-\alpha y_{i} G^{*}\left(x_{i}\right)\right) \\ =\sum_{i \in M_{1}} \overline{\omega_{m i}} \exp (-\alpha)+\sum_{i \in M_{2}} \overline{\omega_{m i}} \exp (\alpha) \\ =\left(e^{\alpha}-e^{-\alpha}\right) \sum \overline{\omega_{m i}} I\left(y_{i} \neq G\left(x_{i}\right)\right)+e^{-\alpha} \sum \overline{\omega_{m i}} \end{gathered}
αm=argαmini∑ωmiexp(−αyiG∗(xi))=i∈M1∑ωmiexp(−α)+i∈M2∑ωmiexp(α)=(eα−e−α)∑ωmiI(yi=G(xi))+e−α∑ωmi
将得到的式子对
α
\alpha
α 求导并使导数为 0 。即
(
e
α
+
e
−
α
)
∑
ω
ˉ
I
(
y
i
≠
G
(
x
i
)
)
−
e
−
α
ω
ˉ
=
0
\left(e^{\alpha}+e^{-\alpha}\right) \sum \bar{\omega} I\left(y_{i} \neq G\left(x_{i}\right)\right)-e^{-\alpha} \bar{\omega}=0
(eα+e−α)∑ωˉI(yi=G(xi))−e−αωˉ=0
e
2
α
=
∑
ω
i
‾
∑
ω
i
‾
I
(
y
i
≠
G
(
x
i
)
)
−
1
α
=
1
2
ln
∑
ω
i
‾
−
∑
ω
ˉ
I
(
y
i
≠
G
(
x
i
)
)
∑
ω
ˉ
I
(
y
i
≠
G
(
x
i
)
)
=
1
2
ln
1
−
∑
ω
ˉ
I
(
y
i
≠
G
(
x
i
)
)
∑
ω
i
‾
∑
ω
ˉ
I
(
y
i
≠
G
(
x
i
)
)
∑
ω
i
‾
=
1
2
ln
1
−
e
m
e
m
\begin{gathered} e^{2 \alpha}=\frac{\sum \overline{\omega_{i}}}{\sum \overline{\omega_{i}} I\left(y_{i} \neq G\left(x_{i}\right)\right)}-1 \\ \alpha=\frac{1}{2} \ln \frac{\sum \overline{\omega_{i}}-\sum \bar{\omega} I\left(y_{i} \neq G\left(x_{i}\right)\right)}{\sum \bar{\omega} I\left(y_{i} \neq G\left(x_{i}\right)\right)} \\ =\frac{1}{2} \ln \frac{1-\frac{\sum \bar{\omega} I\left(y_{i} \neq G\left(x_{i}\right)\right)}{\sum \overline{\omega_{i}}}}{\frac{\sum \bar{\omega} I\left(y_{i} \neq G\left(x_{i}\right)\right)}{\sum \overline{\omega_{i}}}} \\ =\frac{1}{2} \ln \frac{1-e_{m}}{e_{m}} \end{gathered}
e2α=∑ωiI(yi=G(xi))∑ωi−1α=21ln∑ωˉI(yi=G(xi))∑ωi−∑ωˉI(yi=G(xi))=21ln∑ωi∑ωˉI(yi=G(xi))1−∑ωi∑ωˉI(yi=G(xi))=21lnem1−em
ω m i ‾ = exp ( − y i f m − 1 ( x i ) ) = exp ( − y i ∑ j = 1 m − 1 α j G j ( x ) ) = ∏ j exp ( − y i α j G j ( x i ) ) \begin{aligned} &\overline{\omega_{m i}}=\exp \left(-y_{i} f_{m-1}\left(x_{i}\right)\right) \\ &=\exp \left(-y_{i} \sum_{j=1}^{m-1} \alpha_{j} G_{j}(x)\right) \\ &=\prod_{j} \exp \left(-y_{i} \alpha_{j} G_{j}\left(x_{i}\right)\right) \end{aligned} ωmi=exp(−yifm−1(xi))=exp(−yij=1∑m−1αjGj(x))=j∏exp(−yiαjGj(xi))
总结
- AdaBoost算法的一个解释是该算法实际是前向分步算法的一个实现。
- 对于AdaBoost而言,模型是加法模型,损失函数是指数损失,算法是前向分步算法。