前向分步算法到AdaBoost
前向分步算法与AdaBoost有什么关系呢?除了都属于Boosting的模型,其实AdaBoost是当前向分步算法损失函数为指数损失时的特例。这篇就写一下推导的过程。
前向分步算法 Forward Stagewise Additive Modeling
- 初始化f0(x)=0” role=”presentation” style=”position: relative;”>f0(x)=0f0(x)=0
- 对于m=1,2,...,M” role=”presentation” style=”position: relative;”>m=1,2,...,Mm=1,2,...,M
(a)
(βm,γm)=arg⁡minβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))” role=”presentation” style=”text-align: center; position: relative;”>(βm,γm)=argminβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))(βm,γm)=argminβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))
(\beta_m,\gamma_m) = \arg\min_{\beta,\gamma} \sum_{i=1}^N L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma))
(b)
fm(x)=fm−1(x)+βmb(x;γm)” role=”presentation” style=”text-align: center; position: relative;”>fm(x)=fm−1(x)+βmb(x;γm)fm(x)=fm−1(x)+βmb(x;γm)
f_m(x) = f_{m-1}(x) + \beta_m b(x;\gamma_m)
前向分步算法的步骤如上,其实我觉得应该翻译成前向分步累加模型更适合。因为最终的决策函数f(x)” role=”presentation” style=”position: relative;”>f(x)f(x)。
对于回归问题,前向分步算法的损失函数可以选平方损失,即
L(yi,f(x))=(yi−f(x))2” role=”presentation” style=”text-align: center; position: relative;”>L(yi,f(x))=(yi−f(x))2L(yi,f(x))=(yi−f(x))2
L(y_i,f(x)) = (y_i - f(x))^2
所以有
L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2” role=”presentation” style=”position: relative;”>L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2
L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma)) = (y_i - f_{m-1}(x_i) - \beta b(x_i;\gamma))^2 \\ = (r_{im} - \beta b(x_i;\gamma))^2
其中rim=(yi−fm−1(xi))” role=”presentation” style=”position: relative;”>rim=(yi−fm−1(xi))rim=(yi−fm−1(xi)),也就是令其去拟合当前模型的残差。
而AdaBoost是个分类器,对于分类问题,平方损失就不太适合了。所以引入指数损失,即
L(y,f(x))=exp(−yf(x))” role=”presentation” style=”text-align: center; position: relative;”>L(y,f(x))=exp(−yf(x))L(y,f(x))=exp(−yf(x))
L(y,f(x)) = exp(-y f(x))
基本的AdaBoost是一个二分类模型,令其基函数b(x;γ)=G(x)” role=”presentation” style=”position: relative;”>b(x;γ)=G(x)b(x;γ)=G(x)。
则在指数损失的基础上,就需要解决如下问题
(βm,Gm)=arg⁡minβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))]” role=”presentation” style=”text-align: center; position: relative;”>(βm,Gm)=argminβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))](βm,Gm)=argminβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))]
(\beta_m,G_m) = \arg\min_{\beta,G} \sum_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\beta G_(x_i))]
令wi(m)=exp(−yifm−1(xi))” role=”presentation” style=”position: relative;”>w(m)i=exp(−yifm−1(xi))wi(m)=exp(−yifm−1(xi)),则上述公式可以写成
(βm,Gm)=arg⁡minβ,G∑i=1Nwi(m)exp(−βyiG(xi))” role=”presentation” style=”text-align: center; position: relative;”>(βm,Gm)=argminβ,G∑i=1Nw(m)iexp(−βyiG(xi))(βm,Gm)=argminβ,G∑i=1Nwi(m)exp(−βyiG(xi))
(\beta_m,G_m) = \arg\min_{\beta,G} \sum_{i=1}^N w_i^{(m)} exp(-\beta y_i G(x_i))
因为yi∈{−1,1}” role=”presentation” style=”position: relative;”>yi∈{−1,1}yi∈{−1,1},有
e−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)” role=”presentation” style=”text-align: center; position: relative;”>e−β∑yi=G(xi)w(m)i+eβ∑yi≠G(xi)w(m)ie−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)
e^{-\beta} \sum_{y_i=G(x_i)} w_i^{(m)} + e^{\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)}
在这基础上再添上两项,有
e−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)+e−β∑yi≠G(xi)wi(m)−e−β∑yi≠G(xi)wi(m)” role=”presentation” style=”text-align: center; position: relative;”>e−β∑yi=G(xi)w(m)i+eβ∑yi≠G(xi)w(m)i+e−β∑yi≠G(xi)w(m)i−e−β∑yi≠G(xi)w(m)ie−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)+e−β∑yi≠G(xi)wi(m)−e−β∑yi≠G(xi)wi(m)
e^{-\beta} \sum_{y_i=G(x_i)} w_i^{(m)} + e^{\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)} + e^{-\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)} - e^{-\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)}
再进一步合并,得到
(1)(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nwi(m)” role=”presentation” style=”position: relative;”>(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nw(m)i(1)(1)(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nwi(m)
(e^{\beta} - e^{-\beta}) \sum_{i=1}^N w_i I(y_i \ne G(x_i)) + e^{-\beta} \sum_{i=1}^N w_i^{(m)} \tag 1
对于迭代的第m” role=”presentation” style=”position: relative;”>mm取最小值。因此有
Gm=arg⁡minG∑i=1Nwi(m)I(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>Gm=argminG∑i=1Nw(m)iI(yi≠G(xi))Gm=argminG∑i=1Nwi(m)I(yi≠G(xi))
G_m = \arg\min_G \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))
那么βm” role=”presentation” style=”position: relative;”>βmβm求偏导,得到
∂L∂β=eβ∑i=1Nwi(m)I(yi≠G(xi))+e−β∑i=1Nwi(m)I(yi≠G(xi))−e−β∑i=1Nwi(m)” role=”presentation” style=”text-align: center; position: relative;”>∂L∂β=eβ∑i=1Nw(m)iI(yi≠G(xi))+e−β∑i=1Nw(m)iI(yi≠G(xi))−e−β∑i=1Nw(m)i∂L∂β=eβ∑i=1Nwi(m)I(yi≠G(xi))+e−β∑i=1Nwi(m)I(yi≠G(xi))−e−β∑i=1Nwi(m)
\frac {\partial_L} {\partial_{\beta}} = e^{\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) + e^{-\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) - e^{-\beta} \sum_{i=1}^N w_i^{(m)}
再令
∂L∂β=0” role=”presentation” style=”position: relative;”>∂L∂β=0∂L∂β=0,得
eβ∑i=1Nwi(m)I(yi≠G(xi))=[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]e−β” role=”presentation” style=”text-align: center; position: relative;”>eβ∑i=1Nw(m)iI(yi≠G(xi))=[∑i=1Nw(m)i−∑i=1Nw(m)iI(yi≠G(xi))]e−βeβ∑i=1Nwi(m)I(yi≠G(xi))=[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]e−β
e^{\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) = [\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))] e^{-\beta}
对两边同求
log” role=”presentation” style=”position: relative;”>loglog,得到
log∑i=1Nwi(m)I(yi≠G(xi))+logeβ=log[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]+loge−β” role=”presentation” style=”text-align: center; position: relative;”>log∑i=1Nw(m)iI(yi≠G(xi))+logeβ=log[∑i=1Nw(m)i−∑i=1Nw(m)iI(yi≠G(xi))]+loge−βlog∑i=1Nwi(m)I(yi≠G(xi))+logeβ=log[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]+loge−β
log \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) + log e^{\beta} = log [\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))] + log e^{-\beta}
又因为
loge−β=−logeβ” role=”presentation” style=”position: relative;”>loge−β=−logeβloge−β=−logeβ,所以有
logeβ=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)I(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>logeβ=12log∑Ni=1w(m)i−∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1w(m)iI(yi≠G(xi))logeβ=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)I(yi≠G(xi))
log e^{\beta} = \frac {1} {2} log \frac {\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))}
所以解得
βm=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1NwiI(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>βm=12log∑Ni=1w(m)i−∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1wiI(yi≠G(xi))βm=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1NwiI(yi≠G(xi))
\beta_m = \frac {1} {2} log \frac {\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i I(y_i \ne G(x_i))}
又因为加权误差率
errm=∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)” role=”presentation” style=”text-align: center; position: relative;”>errm=∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1w(m)ierrm=∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)
err_m = \frac {\sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i^{(m)}}
所以
βm” role=”presentation” style=”position: relative;”>βmβm可以写成
βm=12log1−errmerrm” role=”presentation” style=”text-align: center; position: relative;”>βm=12log1−errmerrmβm=12log1−errmerrm
\beta_m = \frac {1} {2} log \frac {1 - err_m} {err_m}
求出了Gm(x)” role=”presentation” style=”position: relative;”>Gm(x)Gm(x)的更新公式了
fm(x)=fm−1(x)+βmGm(x)” role=”presentation” style=”text-align: center; position: relative;”>fm(x)=fm−1(x)+βmGm(x)fm(x)=fm−1(x)+βmGm(x)
f_m(x) = f_{m-1}(x) + \beta_m G_m(x)
根据wi(m)=exp(−yifm−1(xi))” role=”presentation” style=”position: relative;”>w(m)i=exp(−yifm−1(xi))wi(m)=exp(−yifm−1(xi))的更新公式
wi(m+1)=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=wi(m)exp(−βmyiGm(xi))” role=”presentation” style=”position: relative;”>w(m+1)i=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=w(m)iexp(−βmyiGm(xi))wi(m+1)=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=wi(m)exp(−βmyiGm(xi))
w_i^{(m+1)} = exp(-y_i f_m (x_i)) \\ = exp(-y_i (f_{m-1}(x_i)+\beta_m G_m(x_i))) \\ = w_i^{(m)} exp(- \beta_m y_i G_m(x_i))
因为
yi” role=”presentation” style=”position: relative;”>yiyi,代入上面的公式,得到
wi(m+1)=exp(−yifm(xi))=wi(m)∙exp2βmI(yi≠Gm(xi))∙exp−βm” role=”presentation” style=”text-align: center; position: relative;”>w(m+1)i=exp(−yifm(xi))=w(m)i∙exp2βmI(yi≠Gm(xi))∙exp−βmwi(m+1)=exp(−yifm(xi))=wi(m)∙exp2βmI(yi≠Gm(xi))∙exp−βm
w_i^{(m+1)} = exp(-y_i f_m (x_i)) = w_i^{(m)} \bullet exp^{2 \beta_m I(y_i \ne G_m(x_i))} \bullet exp^{-\beta_m}
再令αm=2βm” role=”presentation” style=”position: relative;”>αm=2βmαm=2βm都一样,所以可以舍去。这样就得到了
wi(m+1)=wi(m)∙expαmI(yi≠Gm(xi))” role=”presentation” style=”text-align: center; position: relative;”>w(m+1)i=w(m)i∙expαmI(yi≠Gm(xi))wi(m+1)=wi(m)∙expαmI(yi≠Gm(xi))
w_i^{(m+1)} = w_i^{(m)} \bullet exp^{\alpha_m I(y_i \ne G_m(x_i))}
这就与AdaBoost的样本权值更新公式一样了。
而
αm=2βm=log1−errmerrm” role=”presentation” style=”position: relative;”>αm=2βm=log1−errmerrmαm=2βm=log1−errmerrm 也与AdaBoost的弱分类器系数一样。
到这里也就推导出了当前向分步算法的损失函数选为指数损失的时候,前向分步算法也就是AdaBoost啦。
前向分步算法到AdaBoost
前向分步算法与AdaBoost有什么关系呢?除了都属于Boosting的模型,其实AdaBoost是当前向分步算法损失函数为指数损失时的特例。这篇就写一下推导的过程。
前向分步算法 Forward Stagewise Additive Modeling
- 初始化f0(x)=0” role=”presentation” style=”position: relative;”>f0(x)=0f0(x)=0
- 对于m=1,2,...,M” role=”presentation” style=”position: relative;”>m=1,2,...,Mm=1,2,...,M
(a)
(βm,γm)=arg⁡minβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))” role=”presentation” style=”text-align: center; position: relative;”>(βm,γm)=argminβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))(βm,γm)=argminβ,γ∑i=1NL(yi,fm−1(xi)+βb(xi;γ))
(\beta_m,\gamma_m) = \arg\min_{\beta,\gamma} \sum_{i=1}^N L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma))
(b)
fm(x)=fm−1(x)+βmb(x;γm)” role=”presentation” style=”text-align: center; position: relative;”>fm(x)=fm−1(x)+βmb(x;γm)fm(x)=fm−1(x)+βmb(x;γm)
f_m(x) = f_{m-1}(x) + \beta_m b(x;\gamma_m)
前向分步算法的步骤如上,其实我觉得应该翻译成前向分步累加模型更适合。因为最终的决策函数f(x)” role=”presentation” style=”position: relative;”>f(x)f(x)。
对于回归问题,前向分步算法的损失函数可以选平方损失,即
L(yi,f(x))=(yi−f(x))2” role=”presentation” style=”text-align: center; position: relative;”>L(yi,f(x))=(yi−f(x))2L(yi,f(x))=(yi−f(x))2
L(y_i,f(x)) = (y_i - f(x))^2
所以有
L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2” role=”presentation” style=”position: relative;”>L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2L(yi,fm−1(xi)+βb(xi;γ))=(yi−fm−1(xi)−βb(xi;γ))2=(rim−βb(xi;γ))2
L(y_i,f_{m-1}(x_i)+\beta b(x_i;\gamma)) = (y_i - f_{m-1}(x_i) - \beta b(x_i;\gamma))^2 \\ = (r_{im} - \beta b(x_i;\gamma))^2
其中rim=(yi−fm−1(xi))” role=”presentation” style=”position: relative;”>rim=(yi−fm−1(xi))rim=(yi−fm−1(xi)),也就是令其去拟合当前模型的残差。
而AdaBoost是个分类器,对于分类问题,平方损失就不太适合了。所以引入指数损失,即
L(y,f(x))=exp(−yf(x))” role=”presentation” style=”text-align: center; position: relative;”>L(y,f(x))=exp(−yf(x))L(y,f(x))=exp(−yf(x))
L(y,f(x)) = exp(-y f(x))
基本的AdaBoost是一个二分类模型,令其基函数b(x;γ)=G(x)” role=”presentation” style=”position: relative;”>b(x;γ)=G(x)b(x;γ)=G(x)。
则在指数损失的基础上,就需要解决如下问题
(βm,Gm)=arg⁡minβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))]” role=”presentation” style=”text-align: center; position: relative;”>(βm,Gm)=argminβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))](βm,Gm)=argminβ,G∑i=1Nexp[−yi(fm−1(xi)+βG(xi))]
(\beta_m,G_m) = \arg\min_{\beta,G} \sum_{i=1}^N exp[-y_i(f_{m-1}(x_i)+\beta G_(x_i))]
令wi(m)=exp(−yifm−1(xi))” role=”presentation” style=”position: relative;”>w(m)i=exp(−yifm−1(xi))wi(m)=exp(−yifm−1(xi)),则上述公式可以写成
(βm,Gm)=arg⁡minβ,G∑i=1Nwi(m)exp(−βyiG(xi))” role=”presentation” style=”text-align: center; position: relative;”>(βm,Gm)=argminβ,G∑i=1Nw(m)iexp(−βyiG(xi))(βm,Gm)=argminβ,G∑i=1Nwi(m)exp(−βyiG(xi))
(\beta_m,G_m) = \arg\min_{\beta,G} \sum_{i=1}^N w_i^{(m)} exp(-\beta y_i G(x_i))
因为yi∈{−1,1}” role=”presentation” style=”position: relative;”>yi∈{−1,1}yi∈{−1,1},有
e−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)” role=”presentation” style=”text-align: center; position: relative;”>e−β∑yi=G(xi)w(m)i+eβ∑yi≠G(xi)w(m)ie−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)
e^{-\beta} \sum_{y_i=G(x_i)} w_i^{(m)} + e^{\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)}
在这基础上再添上两项,有
e−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)+e−β∑yi≠G(xi)wi(m)−e−β∑yi≠G(xi)wi(m)” role=”presentation” style=”text-align: center; position: relative;”>e−β∑yi=G(xi)w(m)i+eβ∑yi≠G(xi)w(m)i+e−β∑yi≠G(xi)w(m)i−e−β∑yi≠G(xi)w(m)ie−β∑yi=G(xi)wi(m)+eβ∑yi≠G(xi)wi(m)+e−β∑yi≠G(xi)wi(m)−e−β∑yi≠G(xi)wi(m)
e^{-\beta} \sum_{y_i=G(x_i)} w_i^{(m)} + e^{\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)} + e^{-\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)} - e^{-\beta} \sum_{y_i \ne G(x_i)} w_i^{(m)}
再进一步合并,得到
(1)(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nwi(m)” role=”presentation” style=”position: relative;”>(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nw(m)i(1)(1)(eβ−e−β)∑i=1NwiI(yi≠G(xi))+e−β∑i=1Nwi(m)
(e^{\beta} - e^{-\beta}) \sum_{i=1}^N w_i I(y_i \ne G(x_i)) + e^{-\beta} \sum_{i=1}^N w_i^{(m)} \tag 1
对于迭代的第m” role=”presentation” style=”position: relative;”>mm取最小值。因此有
Gm=arg⁡minG∑i=1Nwi(m)I(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>Gm=argminG∑i=1Nw(m)iI(yi≠G(xi))Gm=argminG∑i=1Nwi(m)I(yi≠G(xi))
G_m = \arg\min_G \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))
那么βm” role=”presentation” style=”position: relative;”>βmβm求偏导,得到
∂L∂β=eβ∑i=1Nwi(m)I(yi≠G(xi))+e−β∑i=1Nwi(m)I(yi≠G(xi))−e−β∑i=1Nwi(m)” role=”presentation” style=”text-align: center; position: relative;”>∂L∂β=eβ∑i=1Nw(m)iI(yi≠G(xi))+e−β∑i=1Nw(m)iI(yi≠G(xi))−e−β∑i=1Nw(m)i∂L∂β=eβ∑i=1Nwi(m)I(yi≠G(xi))+e−β∑i=1Nwi(m)I(yi≠G(xi))−e−β∑i=1Nwi(m)
\frac {\partial_L} {\partial_{\beta}} = e^{\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) + e^{-\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) - e^{-\beta} \sum_{i=1}^N w_i^{(m)}
再令
∂L∂β=0” role=”presentation” style=”position: relative;”>∂L∂β=0∂L∂β=0,得
eβ∑i=1Nwi(m)I(yi≠G(xi))=[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]e−β” role=”presentation” style=”text-align: center; position: relative;”>eβ∑i=1Nw(m)iI(yi≠G(xi))=[∑i=1Nw(m)i−∑i=1Nw(m)iI(yi≠G(xi))]e−βeβ∑i=1Nwi(m)I(yi≠G(xi))=[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]e−β
e^{\beta} \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) = [\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))] e^{-\beta}
对两边同求
log” role=”presentation” style=”position: relative;”>loglog,得到
log∑i=1Nwi(m)I(yi≠G(xi))+logeβ=log[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]+loge−β” role=”presentation” style=”text-align: center; position: relative;”>log∑i=1Nw(m)iI(yi≠G(xi))+logeβ=log[∑i=1Nw(m)i−∑i=1Nw(m)iI(yi≠G(xi))]+loge−βlog∑i=1Nwi(m)I(yi≠G(xi))+logeβ=log[∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))]+loge−β
log \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i)) + log e^{\beta} = log [\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))] + log e^{-\beta}
又因为
loge−β=−logeβ” role=”presentation” style=”position: relative;”>loge−β=−logeβloge−β=−logeβ,所以有
logeβ=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)I(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>logeβ=12log∑Ni=1w(m)i−∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1w(m)iI(yi≠G(xi))logeβ=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)I(yi≠G(xi))
log e^{\beta} = \frac {1} {2} log \frac {\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))}
所以解得
βm=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1NwiI(yi≠G(xi))” role=”presentation” style=”text-align: center; position: relative;”>βm=12log∑Ni=1w(m)i−∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1wiI(yi≠G(xi))βm=12log∑i=1Nwi(m)−∑i=1Nwi(m)I(yi≠G(xi))∑i=1NwiI(yi≠G(xi))
\beta_m = \frac {1} {2} log \frac {\sum_{i=1}^N w_i^{(m)} - \sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i I(y_i \ne G(x_i))}
又因为加权误差率
errm=∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)” role=”presentation” style=”text-align: center; position: relative;”>errm=∑Ni=1w(m)iI(yi≠G(xi))∑Ni=1w(m)ierrm=∑i=1Nwi(m)I(yi≠G(xi))∑i=1Nwi(m)
err_m = \frac {\sum_{i=1}^N w_i^{(m)} I(y_i \ne G(x_i))} {\sum_{i=1}^N w_i^{(m)}}
所以
βm” role=”presentation” style=”position: relative;”>βmβm可以写成
βm=12log1−errmerrm” role=”presentation” style=”text-align: center; position: relative;”>βm=12log1−errmerrmβm=12log1−errmerrm
\beta_m = \frac {1} {2} log \frac {1 - err_m} {err_m}
求出了Gm(x)” role=”presentation” style=”position: relative;”>Gm(x)Gm(x)的更新公式了
fm(x)=fm−1(x)+βmGm(x)” role=”presentation” style=”text-align: center; position: relative;”>fm(x)=fm−1(x)+βmGm(x)fm(x)=fm−1(x)+βmGm(x)
f_m(x) = f_{m-1}(x) + \beta_m G_m(x)
根据wi(m)=exp(−yifm−1(xi))” role=”presentation” style=”position: relative;”>w(m)i=exp(−yifm−1(xi))wi(m)=exp(−yifm−1(xi))的更新公式
wi(m+1)=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=wi(m)exp(−βmyiGm(xi))” role=”presentation” style=”position: relative;”>w(m+1)i=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=w(m)iexp(−βmyiGm(xi))wi(m+1)=exp(−yifm(xi))=exp(−yi(fm−1(xi)+βmGm(xi)))=wi(m)exp(−βmyiGm(xi))
w_i^{(m+1)} = exp(-y_i f_m (x_i)) \\ = exp(-y_i (f_{m-1}(x_i)+\beta_m G_m(x_i))) \\ = w_i^{(m)} exp(- \beta_m y_i G_m(x_i))
因为
yi” role=”presentation” style=”position: relative;”>yiyi,代入上面的公式,得到
wi(m+1)=exp(−yifm(xi))=wi(m)∙exp2βmI(yi≠Gm(xi))∙exp−βm” role=”presentation” style=”text-align: center; position: relative;”>w(m+1)i=exp(−yifm(xi))=w(m)i∙exp2βmI(yi≠Gm(xi))∙exp−βmwi(m+1)=exp(−yifm(xi))=wi(m)∙exp2βmI(yi≠Gm(xi))∙exp−βm
w_i^{(m+1)} = exp(-y_i f_m (x_i)) = w_i^{(m)} \bullet exp^{2 \beta_m I(y_i \ne G_m(x_i))} \bullet exp^{-\beta_m}
再令αm=2βm” role=”presentation” style=”position: relative;”>αm=2βmαm=2βm都一样,所以可以舍去。这样就得到了
wi(m+1)=wi(m)∙expαmI(yi≠Gm(xi))” role=”presentation” style=”text-align: center; position: relative;”>w(m+1)i=w(m)i∙expαmI(yi≠Gm(xi))wi(m+1)=wi(m)∙expαmI(yi≠Gm(xi))
w_i^{(m+1)} = w_i^{(m)} \bullet exp^{\alpha_m I(y_i \ne G_m(x_i))}
这就与AdaBoost的样本权值更新公式一样了。
而
αm=2βm=log1−errmerrm” role=”presentation” style=”position: relative;”>αm=2βm=log1−errmerrmαm=2βm=log1−errmerrm 也与AdaBoost的弱分类器系数一样。
到这里也就推导出了当前向分步算法的损失函数选为指数损失的时候,前向分步算法也就是AdaBoost啦。