在boosting中,集成分类器包含多个非常简单的成员分类器,这些分类器性能稍强于随机猜测(rough rules of thumb),被称为弱学习机。典型的弱分类器是单层决策树。
Adaboost使用整个训练集来训练弱学习机,训练样本在每次迭代中都会被赋予一个新的权重,在上一个学习机错误的基础上进行学习进而构建一个更加强大的分类器。
AdaBoost加权分类器:
F
(
x
)
=
∑
t
=
1
T
α
t
h
(
x
;
θ
t
)
,
h
(
x
;
θ
t
)
∈
{
−
1
,
1
}
F(\mathbf{x})=\sum_{t=1}^{T}\alpha_{t}h(\mathbf{x};\theta_{t}), h(\mathbf{x};\theta_{t})\in \{-1,1\}
F(x)=t=1∑Tαth(x;θt),h(x;θt)∈{−1,1}
最终决策:
f
(
x
)
=
s
i
g
n
{
F
(
x
)
}
f(\mathbf{x})=sign\{F(\mathbf{x})\}
f(x)=sign{F(x)}
误差函数:
arg
min
α
t
;
θ
t
,
t
:
1
,
2
,
.
.
.
,
T
∑
i
=
1
N
exp
(
−
y
i
F
(
x
i
)
)
\arg \min_{\alpha_{t};\theta_{t},t:1,2,...,T}\sum_{i=1}^{N}\exp{(-y_{i}F(\mathbf{x_{i}}))}
argαt;θt,t:1,2,...,Tmini=1∑Nexp(−yiF(xi))
当错误分类时,
y
i
F
(
x
i
)
<
0
y_{i}F(\mathbf{x}_{i})<0
yiF(xi)<0;当正确分类时,
y
i
F
(
x
i
)
>
0
y_{i}F(\mathbf{x}_{i})>0
yiF(xi)>0
求解推理过程
当迭代到第
m
m
m个分类器时,此时前
m
m
m个分类器的目标函数:
J
(
α
,
θ
)
=
∑
i
=
1
N
exp
(
−
y
i
(
F
m
−
1
(
x
i
)
+
α
h
(
x
i
;
θ
)
)
)
J(\alpha,\theta)=\sum_{i=1}^{N}\exp{(-y_{i}(F_{m-1}(\mathbf{x}_{i})+\alpha h(\mathbf{x}_{i};\theta)))}
J(α,θ)=i=1∑Nexp(−yi(Fm−1(xi)+αh(xi;θ)))
(
α
m
,
θ
m
)
=
arg
min
α
,
θ
J
(
α
,
θ
)
(\alpha_{m},\theta_{m})=\arg \min_{\alpha,\theta}J(\alpha,\theta)
(αm,θm)=argα,θminJ(α,θ)
对目标函数进化化简:
J
(
α
,
θ
)
=
∑
i
=
1
N
exp
(
−
y
i
(
F
m
−
1
(
x
i
)
)
⋅
exp
(
−
y
i
α
h
(
x
i
;
θ
)
)
=
∑
i
=
1
N
w
i
m
⋅
exp
(
−
y
i
α
h
(
x
i
;
θ
)
)
J(\alpha,\theta)=\sum_{i=1}^{N}\exp{(-y_{i}(F_{m-1}(\mathbf{x}_{i}))}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}=\sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}
J(α,θ)=i=1∑Nexp(−yi(Fm−1(xi))⋅exp(−yiαh(xi;θ))=i=1∑Nwim⋅exp(−yiαh(xi;θ))
求
θ
m
=
arg
min
θ
∑
i
=
1
N
w
i
m
⋅
exp
(
−
y
i
α
h
(
x
i
;
θ
)
)
\theta_{m}=\arg \min_{\theta} \sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}
θm=argminθ∑i=1Nwim⋅exp(−yiαh(xi;θ))
先不管
α
\alpha
α,让错分样本对应的权重最小化:
θ
m
=
arg
min
θ
{
P
m
=
∑
i
=
1
N
w
i
m
I
(
1
−
y
i
h
(
x
i
;
θ
)
)
}
,
I
(
⋅
)
=
{
0
,
0
1
,
o
t
h
e
r
\theta_{m}=\arg \min_{\theta}\{P_{m}=\sum_{i=1}^{N}w_{i}^{m}I(1-y_{i}h(\mathbf{x}_{i};\theta))\}, \ I(\cdot)=\left\{\begin{matrix} 0,0\\ 1,other \end{matrix}\right.
θm=argθmin{Pm=i=1∑NwimI(1−yih(xi;θ))}, I(⋅)={0,01,other
P
m
P_{m}
Pm为第
m
m
m个分类器错误识别样本的权重,选择阈值
P
m
<
0.5
P_{m}<0.5
Pm<0.5。
求分类器权重
α
m
\alpha_{m}
αm:
由
J
(
α
,
θ
)
=
∑
i
=
1
N
w
i
m
⋅
exp
(
−
y
i
α
h
(
x
i
;
θ
m
)
)
J(\alpha,\theta)=\sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta_{m}))}
J(α,θ)=i=1∑Nwim⋅exp(−yiαh(xi;θm))
分类错误的样本权重之和:
∑
y
i
h
(
x
i
;
θ
m
)
<
0
w
i
m
=
P
m
\sum_{y_{i}h(\mathbf{x}_{i};\theta_{m})<0}w_{i}^{m}=P_{m}
∑yih(xi;θm)<0wim=Pm,其中
y
i
h
(
x
i
;
θ
m
)
=
−
1
y_{i}h(\mathbf{x}_{i};\theta_{m})=-1
yih(xi;θm)=−1;分类正确的样本权重之和为
∑
y
i
h
(
x
i
;
θ
m
)
>
0
w
i
m
=
1
−
P
m
\sum_{y_{i}h(\mathbf{x}_{i};\theta_{m})>0}w_{i}^{m}=1-P_{m}
∑yih(xi;θm)>0wim=1−Pm,其中
y
i
h
(
x
i
;
θ
m
)
=
1
y_{i}h(\mathbf{x}_{i};\theta_{m})=1
yih(xi;θm)=1。
可推出:
α
m
=
arg
min
α
[
exp
(
−
α
)
(
1
−
P
m
)
+
exp
(
α
)
P
m
]
\alpha_{m}=\arg \min_{\alpha}[\exp{(-\alpha)(1-P_{m})}+\exp{(\alpha)P_{m}}]
αm=argαmin[exp(−α)(1−Pm)+exp(α)Pm]
对以上公式求导数:
[
exp
(
−
α
)
(
1
−
P
m
)
+
exp
(
α
)
P
m
]
′
=
0
[\exp{(-\alpha)(1-P_{m})}+\exp{(\alpha)P_{m}}]^{'}=0
[exp(−α)(1−Pm)+exp(α)Pm]′=0
可以解得:
α
m
=
1
2
ln
1
−
P
m
P
m
\alpha_{m}=\frac{1}{2}\ln\frac{1-P_{m}}{P_{m}}
αm=21lnPm1−Pm
样本权重更新
w
i
(
m
+
1
)
=
exp
(
−
y
i
F
m
(
x
i
)
)
=
w
i
(
m
)
exp
(
−
y
i
α
m
h
(
x
i
;
θ
m
)
)
w_{i}^{(m+1)}=\exp{(-y_{i}F_{m}(\mathbf x_{i}))}=w_{i}^{(m)}\exp{(-y_{i}\alpha_{m}h(\mathbf x_{i};\theta_{m}))}
wi(m+1)=exp(−yiFm(xi))=wi(m)exp(−yiαmh(xi;θm))
归一化:
w
i
m
+
1
=
w
i
m
+
1
/
∑
i
=
1
N
w
i
m
+
1
w_{i}^{m+1}=w_{i}^{m+1}/\sum_{i=1}^{N}w_{i}^{m+1}
wim+1=wim+1/∑i=1Nwim+1
算法伪代码流程如下:
输入:训练数据
D
=
{
(
x
i
,
y
i
)
}
i
=
1
N
,
y
i
∈
{
−
1
,
1
}
D=\{({\rm{x}}_{i},y_{i})\}_{i=1}^{N},y_{i} \in \{-1,1\}
D={(xi,yi)}i=1N,yi∈{−1,1}
输出:分类模型
F
(
x
)
F({\rm{x}})
F(x)
1.初始化训练数据的权重向量
P
1
=
{
1
N
,
1
N
,
.
.
.
,
1
N
}
P^{1}=\{\frac{1}{N},\frac{1}{N},...,\frac{1}{N}\}
P1={N1,N1,...,N1},并满足
∑
i
=
1
N
P
i
=
1
\sum_{i=1}^{N}P_{i}=1
∑i=1NPi=1
2.
f
o
r
t
=
1
t
o
T
for\ t=1\ to T
for t=1 toT
3. --------基于
P
t
P^{t}
Pt训练弱分类器
h
t
(
x
)
h_{t}({\rm{x}})
ht(x)
4. --------计算弱分类器错误率
ϵ
t
=
∑
i
=
1
N
P
i
t
[
[
h
t
(
x
)
≠
y
i
]
]
\epsilon_{t}=\sum_{i=1}^{N}P^{t}_{i}[[h_{t}({\rm{x}})\neq y_{i}]]
ϵt=∑i=1NPit[[ht(x)̸=yi]]
5. --------计算弱分类器权重
α
t
=
1
2
ln
1
−
ϵ
t
ϵ
t
\alpha_{t}=\frac{1}{2}\ln \frac{1-\epsilon_{t}}{\epsilon_{t}}
αt=21lnϵt1−ϵt,此时
ϵ
t
\epsilon_{t}
ϵt越小,
α
t
\alpha_{t}
αt越大,
ϵ
t
\epsilon_{t}
ϵt是恒大于0.5的。
6. --------重新计算训练数据的权重向量
P
i
t
+
1
=
P
i
t
exp
(
−
α
t
y
i
h
t
(
x
i
)
)
P^{t+1}_{i}=P^{t}_{i}\exp(-\alpha_{t}y_{i}h_{t}({\rm{x}}_{i}))
Pit+1=Pitexp(−αtyiht(xi)),此时判断错的样本权重将会增加,正确的权重将会减小。
7. --------归一化权重向量
P
i
t
+
1
=
P
i
t
+
1
∖
∑
i
=
1
N
P
i
t
+
1
P^{t+1}_{i}=P^{t+1}_{i} \setminus \sum_{i=1}^{N}P^{t+1}_{i}
Pit+1=Pit+1∖∑i=1NPit+1
8.
e
n
d
f
o
r
end\ for
end for
9. 最终决策:
F
(
x
)
=
s
g
n
[
f
T
(
x
)
]
=
s
g
n
(
∑
t
=
1
T
α
t
h
t
(
x
)
)
F({\rm{x}})=sgn[f_{T}({\rm{x}})]=sgn(\sum_{t=1}^{T}\alpha_{t}h_{t}({\rm{x}}))
F(x)=sgn[fT(x)]=sgn(∑t=1Tαtht(x))
初始的时候,所有的‘+’和‘-’都是等权重的。第一个分类器后,右边的‘+’分类错误,此时减小分类正确的权重,增大分类错误的权重。第二个分类器满足了此前分错具有较大权重的‘+’,此时左边的‘-’分错,所以增大左边‘-’的权重,同时分类正确的权重较小。最后将所有的分类器按分类器权重组合在一起。
如图3个分类器将决策区域最终划分为了6部分,从左到右,从上到下,分别标号1~6。其中1和6区域是3个分类器没有分歧。2区域蓝色,因为前两个分类器认为是蓝色,第3个认为是非蓝色,基于权重是0.42+0.65=1.07>0.92,所以此区域应该是蓝色。其他部分同理。
以下将使用葡萄酒数据集训练一个AdaBoost分类器,每个分类器都是一个单层的决策树。
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=1)
ada = AdaBoostClassifier(base_estimator=tree,
n_estimators=500,
learning_rate=0.1,
random_state=0)
tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
print('Decision tree train/test accuracies: %.3f/%.3f'
% (accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)))
ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
print('Adaboost tree train/test accuracies: %.3f/%.3f'
% (accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)))
Decision tree train/test accuracies: 0.604/0.563
Adaboost tree train/test accuracies: 0.887/0.845