统计学习理论的本质 笔记 5 模式识别的方法 part2(5.6-5.10)

5 模式识别的方法

5.6 支持向量机

支持向量机 (Support Vector Machine, SVM, SV机)实现了如下方案:通过事先选择的某种映射将输入 x x x映射到高维空间 Z Z Z中向量 z z z,并在 Z Z Z 中构造最优分类超平面。

5.6.1 高维空间的推广

定理 5.2
l l l 个样本的训练集被最大间隔超平面完全分开, P e r r o r P_{error} Perror 为测试错误概率, m m m 为支持向量的个数, R R R 为包含所有训练集向量的最小超球半径, Δ \Delta Δ 为间隔值, n n n 为输入空间维数,则有
E [ P e r r o r ] ≤ E [ min ⁡ { m l , [ R 2 Δ − 2 ] l , n l } ] E[P_{error}] \le E[\min\{\dfrac{m}{l}, \dfrac{[R^2\Delta^{-2}]}{l}, \dfrac{n}{l}\}] E[Perror]E[min{lm,l[R2Δ2],ln}]

# 一般 R ( α ) = f ( R e m p ( α ) , Φ ( ζ ) ) R(\alpha) = f(R_{emp}(\alpha), \Phi(\zeta)) R(α)=f(Remp(α),Φ(ζ)), 完全分类时 R e m p ( α ) = 0 R_{emp}(\alpha) = 0 Remp(α)=0, 在3.4节情况3中, R ( α ) = R e m p ( α ) ( 1 − a ( p ) τ ζ ) + R(\alpha) = \dfrac{R_{emp}(\alpha)}{(1-a(p)\tau\sqrt\zeta)_+} R(α)=(1a(p)τζ )+Remp(α),此时置信区间是否还起作用?若为情况2, R ( α ) = R e m p ( α ) + Φ ( ζ ) R(\alpha) = R_{emp}(\alpha) + \Phi(\zeta) R(α)=Remp(α)+Φ(ζ),那么我认为上述定理直接表明 R ( α ) ≲ h / l R(\alpha) \lesssim h / l R(α)h/l,这样的结果是不是有点太好了?

5.6.2 内积的回旋

考虑在hilbert空间中内积的一个一般表达(Courant and Hilber, 1953)
z i ⋅ z j = K ( x i , x j ) ,     K ( x i , x j ) = ∑ k = 1 ∞ a k ψ ( x i ) ψ ( x j ) ,    a k > 0 z_i \cdot z_j = K(x_i, x_j),\ \ \ K(x_i, x_j) = \sum\limits_{k=1}^\infty a_k \psi(x_i)\psi(x_j),\ \ a_k >0 zizj=K(xi,xj),   K(xi,xj)=k=1akψ(xi)ψ(xj),  ak>0
定理 5.3
(Mercer)要保证 L 2 L_2 L2 下的对称函数 K ( u , v ) K(u,v) K(u,v) 能以正的系数 a k > 0 a_k > 0 ak>0 展开为 K ( u , v ) = ∑ k = 1 ∞ a k ψ ( u ) ψ ( v ) K(u, v) = \sum\limits_{k=1}^\infty a_k \psi(u)\psi(v) K(u,v)=k=1akψ(u)ψ(v) 的充要条件为
∀ g ≠ 0 , ∫ g 2 ( u ) d u < ∞ ,    ∬ K ( u , v ) g ( u ) g ( v ) d u d v > 0 \forall g \not = 0, \int g^2(u)du < \infty,\ \ \iint K(u,v)g(u)g(v)dudv > 0 g=0,g2(u)du<,  K(u,v)g(u)g(v)dudv>0

5.6.3 构造SV机

f ( x ) = s g n { ∑ α i ≠ 0 y i α i K ( x i , x ) − b } f(x) = sgn\{ \sum\limits_{\alpha_i\not =0} y_i \alpha_i K(x_i, x) - b\} f(x)=sgn{αi=0yiαiK(xi,x)b}
求解过程与前面的步骤完全一致。对偶问题为最大化泛函:
W ( α ) = ∑ i = 1 l α i − 1 2 ∑ i , j = 1 l α i α j y i y j K ( x i , x j ) W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2} \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j) W(α)=i=1lαi21i,j=1lαiαjyiyjK(xi,xj)
约束条件为
∑ i = 1 l α i y i = 0 ,    α i ≥ 0 ,    α i ( 1 − y i ( ∑ j = 1 l y j α j K ( x j , x i ) − b ) ) = 0 \sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ \alpha_i \ge 0,\ \ \alpha_i(1-y_i(\sum\limits_{j=1}^ly_j \alpha_j K(x_j, x_i)-b)) = 0 i=1lαiyi=0,  αi0,  αi(1yi(j=1lyjαjK(xj,xi)b))=0

5.6.4 SV机的例子

要估计某一SV机的测试错误的界,我们需要估计VC维 h ≈ R 2 ∣ ∣ w ∣ ∣ 2 h \approx R^2 ||w||^2 hR2w2,其中
∣ ∣ w ∣ ∣ 2 = ∑ i , j = 1 l α i α j y i y j K ( x i , x j ) 0 = ∑ i = 1 l α i ( 1 − y i ( ∑ j = 1 l y j α j K ( x j , x i ) − b ) ) = ∑ i = 1 l α i − ∑ i , j = 1 l α i α j y i y j K ( x i , x j ) + b ∑ i = 1 l α i y i → ∣ ∣ w ∣ ∣ 2 = ∑ i = 1 l α i ||w||^2 = \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j)\\ 0 = \sum\limits_{i=1}^l \alpha_i(1-y_i(\sum\limits_{j=1}^ly_j \alpha_j K(x_j, x_i)-b)) =\\ \sum\limits_{i=1}^l \alpha_i - \sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j K(x_i, x_j) + b\sum\limits_{i=1}^l \alpha_i y_i \to ||w||^2 = \sum\limits_{i=1}^l \alpha_i w2=i,j=1lαiαjyiyjK(xi,xj)0=i=1lαi(1yi(j=1lyjαjK(xj,xi)b))=i=1lαii,j=1lαiαjyiyjK(xi,xj)+bi=1lαiyiw2=i=1lαi
R R R 通过以下方式寻找:
R 2 = R 2 ( K ) = min ⁡ a max ⁡ x i [ K ( x i , x i ) − 2 K ( x i , a ) + K ( a , a ) ] R^2 = R^2(K) = \min\limits_{a}\max\limits_{x_i} [K(x_i, x_i) - 2K(x_i,a) + K(a,a)] R2=R2(K)=aminximax[K(xi,xi)2K(xi,a)+K(a,a)]
a a a 为该最小超球中心。

  1. 多项式学习机器
    K ( u , v ) = [ ( u ⋅ v ) + 1 ] d K(u,v) = [(u \cdot v) + 1]^d K(u,v)=[(uv)+1]d
  2. 径向基函数机器(RBF)
    K ( u , v ) = K γ ( ∣ ∣ u − v ∣ ∣ ) K(u,v) = K_\gamma(||u -v||) K(u,v)=Kγ(uv) 固定 γ \gamma γ 时为非负单调减函数,且趋于0,如
    K ( u , v ) = exp ⁡ { − γ ∣ ∣ u − v ∣ ∣ 2 } K(u,v) = \exp\{ -\gamma||u -v||^2\} K(u,v)=exp{γuv2}
  3. 两层神经网络
    K ( x , x i ) = S [ v ( x ⋅ x i ) + c ] K(x, x_i) = S[v(x \cdot x_i) + c] K(x,xi)=S[v(xxi)+c]
    参数 v , c v,c v,c 仅在某些取值时满足 定理5.3 Mercer条件。

5.7 SV机的实验

5.8 关于SV机的讨论

5.9 SVM与Logistic回归

5.9.1 Logistic 回归

对于模式识别问题,我们可以构建贝叶斯(最优)决策规则
r ( x ) = s g n { ln ⁡ P { y = 1 ∣ x } 1 − P { y = 1 ∣ x } } = s g n { f ( x , w 0 ) } r(x) = sgn\{ \ln \dfrac{P\{y=1|x\}}{1 - P\{y=1|x\}} \} = sgn\{ f(x, w_0)\} r(x)=sgn{ln1P{y=1x}P{y=1x}}=sgn{f(x,w0)}
由上式我们可以得出如下结果(称为logistic回归)
P { y = 1 ∣ x } = e f ( x , w 0 ) 1 + e f ( x , w 0 ) P\{y=1|x\} = \dfrac{e^{f(x, w_0)}}{1+e^{f(x, w_0)}} P{y=1x}=1+ef(x,w0)ef(x,w0)
我们的目标是给定样本集,估计logistic回归的参数,采用风险泛函
R x ( w ) = E y ∣ x ln ⁡ ( 1 + e − y f ( x , w ) ) R_x(w) = E_{y|x}\ln(1+e^{-yf(x,w)}) Rx(w)=Eyxln(1+eyf(x,w))
需确认该风险泛函确实在 w 0 w_0 w0 处取极值
∂ R x ( w ) ∂ w = P { y = 1 ∣ x } ∂ ln ⁡ ( 1 + e − f ( x , w ) ) ∂ w + P { y = − 1 ∣ x } ∂ ln ⁡ ( 1 + e f ( x , w ) ) ∂ w = ( e f ( x , w ) − e f ( x , w 0 ) ) f ′ ( x , w ) ( 1 + e f ( x , w ) ) ( 1 + e f ( x , w 0 ) ) = 0 \dfrac{\partial R_x(w)}{\partial w} = P\{y=1|x\}\dfrac{\partial \ln(1+e^{-f(x,w)})}{\partial w} + P\{y=-1|x\}\dfrac{\partial \ln(1+e^{f(x,w)})}{\partial w}\\ = \dfrac{(e^{f(x,w)} - e^{f(x,w_0)})f'(x,w)}{(1+e^{f(x,w)})(1+e^{f(x,w_0)})} = 0 wRx(w)=P{y=1x}wln(1+ef(x,w))+P{y=1x}wln(1+ef(x,w))=(1+ef(x,w))(1+ef(x,w0))(ef(x,w)ef(x,w0))f(x,w)=0
假定我们所求的logistic回归是线性函数,即
f ( x , w ) = x ⋅ w + b ,    R x ( w ) = E y ∣ x ln ⁡ ( 1 + e − y [ x ⋅ w + b ] ) f(x,w) = x \cdot w + b, \ \ R_x(w) = E_{y|x}\ln(1+e^{-y[x \cdot w + b]}) f(x,w)=xw+b,  Rx(w)=Eyxln(1+ey[xw+b])
采用SRM方法,定义结构 w ⋅ w ≤ r w \cdot w \le r wwr 进而采用类似软间隔的思路变为最小化泛函
Φ ( w , b ) = C ∑ i = 1 l ln ⁡ ( 1 + e − y i [ x i ⋅ w + b ] ) + 1 2 w ⋅ w \Phi(w,b) = C \sum\limits_{i=1}^l \ln (1+e^{-y_i[x_i \cdot w + b]}) + \dfrac{1}{2} w \cdot w Φ(w,b)=Ci=1lln(1+eyi[xiw+b])+21ww
Φ ( w , b ) \Phi(w,b) Φ(w,b) 求梯度,得
∂ Φ ( w , b ) ∂ b = 0 → ∑ i = 1 l y i e − y i [ x i ⋅ w + b ] 1 + e − y i [ x i ⋅ w + b ] = 0 ∂ Φ ( w , b ) ∂ w = 0 → C ∑ i = 1 l y i x i e − y i [ x i ⋅ w + b ] 1 + e − y i [ x i ⋅ w + b ] = w \dfrac{\partial \Phi(w, b)}{\partial b} = 0 \to \sum\limits_{i=1}^l y_i \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}} = 0\\ \dfrac{\partial \Phi(w, b)}{\partial w} = 0 \to C\sum\limits_{i=1}^l y_i x_i \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}} = w bΦ(w,b)=0i=1lyi1+eyi[xiw+b]eyi[xiw+b]=0wΦ(w,b)=0Ci=1lyixi1+eyi[xiw+b]eyi[xiw+b]=w
引入变量
α i = e − y i [ x i ⋅ w + b ] 1 + e − y i [ x i ⋅ w + b ] \alpha_i = \dfrac{ e^{-y_i[x_i \cdot w + b]}}{1+e^{-y_i[x_i \cdot w + b]}} αi=1+eyi[xiw+b]eyi[xiw+b]
则有
w = C ∑ i = 1 l α i y i x i ,    ∑ i = 1 l α i y i = 0 ,    0 < α i < 1 , P { y = 1 ∣ x } = e C ∑ i = 1 l α i y i ( x i ⋅ x ) + b 1 + e C ∑ i = 1 l α i y i ( x i ⋅ x ) + b Φ = C 2 2 ∑ i , j = 1 l α i α j y i y j ( x i ⋅ x j ) + C ∑ i = 1 l ln ⁡ ( 1 + e − y i [ C ∑ j = 1 l α j y j ( x j ⋅ x i ) + b ] ) w=C\sum\limits_{i=1}^l \alpha_i y_i x_i,\ \ \sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ 0 < \alpha_i < 1,\\ P\{y=1|x\} = \dfrac{e^{C\sum\limits_{i=1}^l \alpha_i y_i (x_i \cdot x) + b}}{1+e^{C\sum\limits_{i=1}^l \alpha_i y_i (x_i \cdot x) + b}}\\ \Phi = \dfrac{C^2}{2}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) + C \sum\limits_{i=1}^l \ln (1+e^{-y_i[C\sum\limits_{j=1}^l \alpha_j y_j (x_j \cdot x_i) + b]}) w=Ci=1lαiyixi,  i=1lαiyi=0,  0<αi<1,P{y=1x}=1+eCi=1lαiyi(xix)+beCi=1lαiyi(xix)+bΦ=2C2i,j=1lαiαjyiyj(xixj)+Ci=1lln(1+eyi[Cj=1lαjyj(xjxi)+b])

# 原文说可用梯度下降法求上述 Φ \Phi Φ 的极小值点,但这里如何确定关于 α \alpha α 这是要求极大值还是极小值点?

5.9.2 SVM的风险函数

我们采用如下损失函数来近似logistic回归的损失函数
Q ( x , w ) = c 1 ( 1 − y ( w ⋅ x + b ) ) + Q(x, w) = c_1 (1- y(w \cdot x + b))_+ Q(x,w)=c1(1y(wx+b))+
且引入变量(形成约束条件)
ξ i = ( 1 − y ( w ⋅ x + b ) ) + → y ( w ⋅ x + b ) ≥ 1 − ξ i ,    ξ i ≥ 0 \xi_i = (1- y(w \cdot x + b))_+ \to y(w \cdot x + b)\ge 1- \xi_i,\ \ \xi_i \ge 0 ξi=(1y(wx+b))+y(wx+b)1ξi,  ξi0
那么之前的目标泛函 Φ ( w , b ) \Phi(w,b) Φ(w,b) 变为
Φ ( w , b ) = C ∑ i = 1 l ξ i + 1 2 w ⋅ w \Phi(w,b) = C \sum\limits_{i=1}^l \xi_i + \dfrac{1}{2} w \cdot w Φ(w,b)=Ci=1lξi+21ww
此问题我们已经在5.5节讨论过。

5.9.3 Logistic回归的 SVMn 逼近

我们采用如下损失函数来近似logistic回归的损失函数(称为样条逼近)
Q ( x , w ) = ∑ k = 1 n c k ( d k − y ( w ⋅ x + b ) ) + Q(x, w) = \sum\limits_{k=1}^n c_k (d_k- y(w \cdot x + b))_+ Q(x,w)=k=1nck(dky(wx+b))+
且引入变量(形成约束条件)
ξ k , i = ( d k − y ( w ⋅ x + b ) ) + → y ( w ⋅ x + b ) ≥ d k − ξ k , i ,    ξ k , i ≥ 0 \xi_{k,i} = (d_k - y(w \cdot x + b))_+ \to y(w \cdot x + b)\ge d_k- \xi_{k,i},\ \ \xi_{k,i} \ge 0 ξk,i=(dky(wx+b))+y(wx+b)dkξk,i,  ξk,i0
那么之前的目标泛函 Φ ( w , b ) \Phi(w,b) Φ(w,b) 变为
Φ ( w , b ) = C ∑ i = 1 l ∑ k = 1 n c k ξ k , i + 1 2 w ⋅ w \Phi(w,b) = C \sum\limits_{i=1}^l \sum\limits_{k=1}^n c_k \xi_{k,i} + \dfrac{1}{2} w \cdot w Φ(w,b)=Ci=1lk=1nckξk,i+21ww
拉格朗日函数
L ( w , b , α , β ) = 1 2 w ⋅ w + C ∑ i = 1 l ∑ k = 1 n c k ξ k , i + ∑ i = 1 l ∑ k = 1 n α k , i ( d k − ξ k , i − y i ( w ⋅ x i + b ) ) + ∑ i = 1 l ∑ k = 1 n β k , i ( − ξ k , i ) , α k , i ≥ 0 ,    β k , i ≥ 0 L(w, b, \alpha, \beta) = \dfrac{1}{2} w \cdot w + C\sum\limits_{i=1}^l\sum\limits_{k=1}^n c_k \xi_{k,i} + \sum\limits_{i=1}^l\sum\limits_{k=1}^n \alpha_{k,i} (d_k- \xi_{k,i} - y_i(w \cdot x_i +b)) + \sum\limits_{i=1}^l\sum\limits_{k=1}^n \beta_{k,i} (-\xi_{k,i}),\\ \alpha_{k,i} \ge 0,\ \ \beta_{k,i} \ge 0 L(w,b,α,β)=21ww+Ci=1lk=1nckξk,i+i=1lk=1nαk,i(dkξk,iyi(wxi+b))+i=1lk=1nβk,i(ξk,i),αk,i0,  βk,i0
目标为 max ⁡ α , β min ⁡ w , b , ξ L \max\limits_{\alpha, \beta}\min\limits_{w, b, \xi}L α,βmaxw,b,ξminL,对拉格朗日函数求梯度得
∂ L ∂ w = 0 → w = ∑ i = 1 l ( ∑ k = 1 n α k , i ) y i x i ∂ L ∂ b = 0 → ∑ i = 1 l ( ∑ k = 1 n α k , i ) y i = 0 ∂ L ∂ ξ k , i = 0 → α k , i + β k , i = C c k \dfrac{\partial L}{\partial w} = 0 \to w = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n \alpha_{k,i}) y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l (\sum\limits_{k=1}^n \alpha_{k,i}) y_i = 0\\ \dfrac{\partial L}{\partial \xi_{k,i}} = 0 \to \alpha_{k,i} + \beta_{k,i} = Cc_k wL=0w=i=1l(k=1nαk,i)yixibL=0i=1l(k=1nαk,i)yi=0ξk,iL=0αk,i+βk,i=Cck
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W ( α ) = ∑ i = 1 l ∑ k = 1 n α k , i d k − 1 2 ∑ i , j = 1 l ( ∑ k = 1 n α k , i ) ( ∑ k = 1 n α k , j ) y i y j ( x i ⋅ x j ) W(\alpha) = \sum\limits_{i=1}^l\sum\limits_{k=1}^n \alpha_{k,i}d_k - \dfrac{1}{2}\sum\limits_{i,j=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) (\sum\limits_{k=1}^n\alpha_{k,j}) y_i y_j (x_i \cdot x_j) W(α)=i=1lk=1nαk,idk21i,j=1l(k=1nαk,i)(k=1nαk,j)yiyj(xixj)
约束条件
∑ i = 1 l ( ∑ k = 1 n α k , i ) y i = 0 ,    0 ≤ α k , i ≤ C c k ,    β k , i = C c k − α k , i \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i = 0,\ \ 0 \le \alpha_{k,i} \le Cc_k,\ \ \beta_{k,i} = Cc_k- \alpha_{k,i} i=1l(k=1nαk,i)yi=0,  0αk,iCck,  βk,i=Cckαk,i
还需满足 Kuhn-Tucker条件
α k , i ( d k − ξ k , i − y i ( w ⋅ x i + b ) ) = 0 ,    β k , i ξ k , i = 0 \alpha_{k,i} (d_k- \xi_{k,i} - y_i(w \cdot x_i +b)) = 0,\ \ \beta_{k,i} \xi_{k,i} = 0 αk,i(dkξk,iyi(wxi+b))=0,  βk,iξk,i=0
logistic回归逼近为
f ( x , w ) = ∑ i = 1 l ( ∑ k = 1 n α k , i ) y i ( x i ⋅ x ) + b P { y = 1 ∣ x } = e f ( x , w ) 1 + e f ( x , w ) f(x,w) = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i (x_i \cdot x) + b\\P\{y=1|x\} = \dfrac{e^{f(x, w)}}{1+e^{f(x, w)}} f(x,w)=i=1l(k=1nαk,i)yi(xix)+bP{y=1x}=1+ef(x,w)ef(x,w)
或者更一般的,采用Mercer条件的核
f ( x , w ) = ∑ i = 1 l ( ∑ k = 1 n α k , i ) y i K ( x i , x ) + b W ( α ) = ∑ i = 1 l ∑ k = 1 l α k , i d k − 1 2 ∑ i , j = 1 l ( ∑ k = 1 n α k , i ) ( ∑ k = 1 n α k , j ) y i y j K ( x i , x j ) ∑ i = 1 l ( ∑ k = 1 n α k , i ) y i = 0 ,    0 ≤ α k , i ≤ C c k ,    β k , i = C c k − α k , i α k , i ( d k − ξ k , i − y i ( ∑ j = 1 l ( ∑ k = 1 n α k , j ) y j K ( x j , x i ) + b ) ) = 0 ,    β k , i ξ k , i = 0 f(x,w) = \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i K(x_i, x) + b\\ W(\alpha) = \sum\limits_{i=1}^l\sum\limits_{k=1}^l \alpha_{k,i}d_k - \dfrac{1}{2}\sum\limits_{i,j=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) (\sum\limits_{k=1}^n\alpha_{k,j}) y_i y_j K(x_i, x_j)\\ \sum\limits_{i=1}^l (\sum\limits_{k=1}^n\alpha_{k,i}) y_i = 0,\ \ 0 \le \alpha_{k,i} \le Cc_k,\ \ \beta_{k,i} = Cc_k- \alpha_{k,i}\\ \alpha_{k,i} (d_k- \xi_{k,i} - y_i(\sum\limits_{j=1}^l (\sum\limits_{k=1}^n \alpha_{k,j}) y_j K(x_j, x_i) +b)) = 0,\ \ \beta_{k,i} \xi_{k,i} = 0 f(x,w)=i=1l(k=1nαk,i)yiK(xi,x)+bW(α)=i=1lk=1lαk,idk21i,j=1l(k=1nαk,i)(k=1nαk,j)yiyjK(xi,xj)i=1l(k=1nαk,i)yi=0,  0αk,iCck,  βk,i=Cckαk,iαk,i(dkξk,iyi(j=1l(k=1nαk,j)yjK(xj,xi)+b))=0,  βk,iξk,i=0
但Vapnik表示他们的实验并没有显示出SVMn比SVM对Logistic回归逼近有更大的优势。

5.10 SVM的组合

5.10.1 Adaboost方法

重新考虑风险泛函(不再使用Logistic回归)
R ( α ) = E y ∣ x e − y f ( x , α ) R(\alpha) = E_{y|x}e^{-yf(x,\alpha)} R(α)=Eyxeyf(x,α)
函数集中包含了函数
f ( x , α 0 ) = 1 2 ln ⁡ P ( y = 1 ∣ x ) P ( y = − 1 ∣ x ) f(x, \alpha_0) = \dfrac{1}{2}\ln\dfrac{P(y=1|x)}{P(y=-1|x)} f(x,α0)=21lnP(y=1x)P(y=1x)

P ( y = 1 ∣ x ) = e f ( x , α 0 ) e f ( x , α 0 ) + e − f ( x , α 0 ) P ( y = − 1 ∣ x ) = e − f ( x , α 0 ) e f ( x , α 0 ) + e − f ( x , α 0 ) P(y=1|x) = \dfrac{e^{f(x,\alpha_0)}}{e^{f(x,\alpha_0)} + e^{-f(x,\alpha_0)}}\\ P(y=-1|x) = \dfrac{e^{-f(x,\alpha_0)}}{e^{f(x,\alpha_0)} + e^{-f(x,\alpha_0)}} P(y=1x)=ef(x,α0)+ef(x,α0)ef(x,α0)P(y=1x)=ef(x,α0)+ef(x,α0)ef(x,α0)
易知该函数恰可使 R ( α ) R(\alpha) R(α) 最小。且 s g n { f ( x , α 0 ) } sgn\{f(x, \alpha_0)\} sgn{f(x,α0)} 恰可使分类正确。采用经验风险代替风险,即最小化
R e m p ( α ) = 1 l ∑ i = 1 l e − y i f ( x i , α ) R_{emp}(\alpha) = \dfrac{1}{l}\sum\limits_{i=1}^l e^{-y_i f(x_i, \alpha)} Remp(α)=l1i=1leyif(xi,α)
贪婪优化步骤
该方法将在第k次迭代得到 f ( x , β k ) = ∑ r = 1 k d r ϕ r ( x ) , d 1 = 1 f(x, \beta_k) = \sum\limits_{r=1}^k d_r \phi_r(x), d_1 = 1 f(x,βk)=r=1kdrϕr(x),d1=1, 其中 ϕ r ( x ) \phi_r(x) ϕr(x) 属于一个给定的指示函数集(这里应该是指 ϕ ( x ) = s g n { K ( x ) } \phi(x) = sgn\{K(x)\} ϕ(x)=sgn{K(x)})。

  1. 第一次迭代时我们选择能最小化经验风险泛函的函数 ϕ 1 ( x ) \phi_1(x) ϕ1(x),即最小化泛函(不妨去掉常数系数 1 / l 1 / l 1/l)
    R e m p ( β 1 ) = ∑ i = 1 l e − y i ϕ 1 ( x i ) R_{emp}(\beta_1) = \sum\limits_{i=1}^l e^{-y_i \phi_1(x_i)} Remp(β1)=i=1leyiϕ1(xi)
    由下面的推导过程可知,此即相当于
    R e m p ( β k + 1 ) ∣ c i k + 1 = 1 , d k + 1 = 1 R_{emp}(\beta_{k+1})|_{c^{k+1}_i = 1, d_{k+1} = 1} Remp(βk+1)cik+1=1,dk+1=1
    因而之后得出的某些结论也应适用于第一次迭代。
  2. 第k次我们得到了如下经验风险值
    R e m p ( β k ) = ∑ i = 1 l e − y i f ( x i , β k ) R_{emp}(\beta_k) = \sum\limits_{i=1}^l e^{-y_i f(x_i, \beta_k)} Remp(βk)=i=1leyif(xi,βk)
    并希望在此基础上(已经选定了 f ( x , β k ) f(x, \beta_k) f(x,βk))最小化经验风险
    R e m p ( β k + 1 ) = ∑ i = 1 l e − y i f ( x i , β k + 1 ) = ∑ i = 1 l c i k + 1 e − y i d k + 1 ϕ k + 1 ( x i ) ,    c i k + 1 = e − y i f ( x i , β k ) R_{emp}(\beta_{k+1}) = \sum\limits_{i=1}^l e^{-y_i f(x_i, \beta_{k+1})} = \sum\limits_{i=1}^l c^{k+1}_ie^{- y_id_{k+1}\phi_{k+1}(x_i)},\ \ c^{k+1}_i = e^{-y_i f(x_i, \beta_k)} Remp(βk+1)=i=1leyif(xi,βk+1)=i=1lcik+1eyidk+1ϕk+1(xi),  cik+1=eyif(xi,βk)
    引入如下参数
    c + k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = 1 } c i k + 1 ,      c − k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = − 1 } c i k + 1 c^{k+1}_+ = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = 1\}} c^{k+1}_i, \ \ \ \ c^{k+1}_- = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = -1\}} c^{k+1}_i c+k+1={i:yiϕk=1(xi)=1}cik+1,    ck+1={i:yiϕk=1(xi)=1}cik+1
    则有
    ∂ R e m p ( β k + 1 ) ∂ d k + 1 = 0 → ∑ i = 1 l y i ϕ k + 1 ( x i ) c i k + 1 e − d k + 1 y i ϕ k + 1 ( x i ) = 0 → ∑ { i : y i ϕ k = 1 ( x i ) = 1 } c i k + 1 e − d k + 1 = ∑ { i : y i ϕ k = 1 ( x i ) = − 1 } c i k + 1 e d k + 1 ,     ∑ i = 1 l c i k + 2 y i ϕ k + 1 ( x i ) = 0 → d k + 1 = 1 2 ln ⁡ c + k + 1 c − k + 1 ,     ∑ i = 1 l c i k + 1 y i ϕ k ( x i ) = 0 \dfrac{\partial R_{emp}(\beta_{k+1})}{ \partial d_{k+1}} = 0 \to \sum\limits_{i=1}^l y_i \phi_{k+1}(x_i) c^{k+1}_i e^{- d_{k+1} y_i \phi_{k+1}(x_i)} = 0 \\ \to \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = 1\}} c^{k+1}_i e^{- d_{k+1}} = \sum\limits_{\{i: y_i \phi_{k=1}(x_i) = -1\}} c^{k+1}_i e^{d_{k+1}},\ \ \ \sum\limits_{i=1}^lc^{k+2}_i y_i \phi_{k+1}(x_i) = 0\\ \to d_{k+1} = \dfrac{1}{2} \ln \dfrac{c^{k+1}_+}{c^{k+1}_-}, \ \ \ \sum\limits_{i=1}^lc^{k+1}_i y_i \phi_k(x_i) = 0 dk+1Remp(βk+1)=0i=1lyiϕk+1(xi)cik+1edk+1yiϕk+1(xi)=0{i:yiϕk=1(xi)=1}cik+1edk+1={i:yiϕk=1(xi)=1}cik+1edk+1,   i=1lcik+2yiϕk+1(xi)=0dk+1=21lnck+1c+k+1,   i=1lcik+1yiϕk(xi)=0

事实上
min ⁡ ϕ { ∑ i = 1 l c i e − d y i ϕ ( x i ) } ⇔ min ⁡ z i { ∑ i = 1 l − c i z i : z i ∈ { e − d , e d } } ⇔ min ⁡ z i { ∑ i = 1 l − c i z i : z i ∈ { e − d − e d 2 , e d − e − d 2 } } ⇔ min ⁡ z i { ∑ i = 1 l − c i z i : z i ∈ { − 1 , 1 } } \min\limits_{\phi} \{\sum\limits_{i=1}^l c_i e^{- d y_i \phi(x_i)} \} \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{e^{-d}, e^d\} \} \\ \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{\dfrac{e^{-d} - e^d}{2}, \dfrac{ e^d - e^{-d}}{2}\} \} \\ \Leftrightarrow \min\limits_{z_i} \{\sum\limits_{i=1}^l -c_i z_i : z_i \in \{-1, 1\}\} ϕmin{i=1lciedyiϕ(xi)}zimin{i=1lcizi:zi{ed,ed}}zimin{i=1lcizi:zi{2eded,2eded}}zimin{i=1lcizi:zi{1,1}}
故原文(不加推导地)给出选择 ϕ k + 1 ( x ) \phi_{k+1}(x) ϕk+1(x) 的方法:使之最小泛函
R ( ϕ ) = − ∑ i = 1 l c i k + 1 y i ϕ k + 1 ( x i ) R(\phi) = -\sum\limits_{i=1}^l c^{k+1}_i y_i \phi_{k+1}(x_i) R(ϕ)=i=1lcik+1yiϕk+1(xi)

  1. 利用上述递归步骤可给出决策规则(称为AdaBoost决策规则)
    s g n { f ( x , α N ) } = s g n { ∑ r = 1 N d r ϕ r ( x ) } sgn\{f(x, \alpha_N)\} = sgn\{\sum\limits_{r=1}^N d_r \phi_r(x)\} sgn{f(x,αN)}=sgn{r=1Ndrϕr(x)}
5.10.2 SVM的组合

# 以下理解可能与原文思想不符(原文直接给出了含 ξ \xi ξ 的目标泛函和约束条件),同时,推导得出结论也与原文略有不同,值得商榷

使用上述AdaBoost方法的思路来构造SVM的组合,即找到N个软间隔最优超平面以贪婪的方式最小化泛函
R ( w , b ) = ∑ i = 1 l exp ⁡ { − y i ∑ k = 1 N d k s g n { w k ⋅ x i + b k } } R(w, b) = \sum\limits_{i=1}^l \exp\{ -y_i\sum\limits_{k=1}^N d_k sgn\{w_k \cdot x_i + b_k\} \} R(w,b)=i=1lexp{yik=1Ndksgn{wkxi+bk}}
由5.10.1节可以知道,在第k次迭代过程中等价于最小化泛函(含第一次迭代)
R ( w k , b k ) = − ∑ i = 1 l c i k y i s g n { w k ⋅ x i + b k } ,    c i 1 = 1 R(w_k, b_k) = -\sum\limits_{i=1}^l c^k_i y_i sgn\{w_k \cdot x_i + b_k\},\ \ c^1_i = 1 R(wk,bk)=i=1lcikyisgn{wkxi+bk},  ci1=1
考虑到分类正确或最小化上式时
y i s g n { w k ⋅ x i + b k } = 1 ⇔ y i [ w k ⋅ x i + b k ] ≥ 1 , ∃ w k , b k y_i sgn\{w_k \cdot x_i + b_k\} = 1 \Leftrightarrow y_i [w_k \cdot x_i + b_k] \ge 1, \exist w_k, b_k yisgn{wkxi+bk}=1yi[wkxi+bk]1,wk,bk
因而采用新的(近似等价的)损失函数
Q = c i k ( 1 − y i [ w k ⋅ x i + b k ] ) + = c i k ξ k , i Q=c^k_i (1- y_i [w_k \cdot x_i + b_k])_+ = c^k_i \xi_{k,i} Q=cik(1yi[wkxi+bk])+=cikξk,i
在约束条件 w k ⋅ w k ≤ Δ − 2 w_k \cdot w_k \le \Delta ^ {-2} wkwkΔ2 下最小化泛函
R ( w k , b k ) = ∑ i = 1 l c i k ξ k , i R(w_k, b_k) = \sum\limits_{i=1}^l c^k_i \xi_{k,i} R(wk,bk)=i=1lcikξk,i
采用软间隔最优超平面,则需最小化泛函
R ( w k , b k ) = 1 2 w k ⋅ w k + C ∑ i = 1 l c i k ξ k , i R(w_k, b_k) = \dfrac{1}{2} w_k \cdot w_k + C \sum\limits_{i=1}^l c^k_i \xi_{k,i} R(wk,bk)=21wkwk+Ci=1lcikξk,i
约束条件
ξ k , i ≥ 0 ,    ( 1 − ξ k , i − y i [ w k ⋅ x i + b k ] ) ≤ 0 \xi_{k,i}\ge 0,\ \ (1- \xi_{k,i} - y_i [w_k \cdot x_i + b_k]) \le 0 ξk,i0,  (1ξk,iyi[wkxi+bk])0
第k次迭代中(以下推导省略k)拉格朗日函数
L ( w , b , α , β ) = 1 2 w ⋅ w + C ∑ i = 1 l c i ξ i + ∑ i = 1 l α i ( 1 − ξ i − y i ( w ⋅ x i + b ) ) + ∑ i = 1 l β i ( − ξ i ) , α i ≥ 0 ,    β i ≥ 0 L(w, b, \alpha, \beta) = \dfrac{1}{2} w \cdot w + C\sum\limits_{i=1}^l c_i \xi_i + \sum\limits_{i=1}^l \alpha_i (1- \xi_i - y_i(w \cdot x_i +b)) + \sum\limits_{i=1}^l \beta_i (-\xi_i),\\ \alpha_i \ge 0,\ \ \beta_i \ge 0 L(w,b,α,β)=21ww+Ci=1lciξi+i=1lαi(1ξiyi(wxi+b))+i=1lβi(ξi),αi0,  βi0
目标为 max ⁡ α , β min ⁡ w , b , ξ L \max\limits_{\alpha, \beta}\min\limits_{w, b, \xi}L α,βmaxw,b,ξminL,对拉格朗日函数求梯度得
∂ L ∂ w = 0 → w = ∑ i = 1 l α i y i x i ∂ L ∂ b = 0 → ∑ i = 1 l α i y i = 0 ∂ L ∂ ξ i = 0 → α i + β i = C c i \dfrac{\partial L}{\partial w} = 0 \to w = \sum\limits_{i=1}^l \alpha_i y_i x_i\\ \dfrac{\partial L}{\partial b} = 0 \to \sum\limits_{i=1}^l \alpha_i y_i = 0\\ \dfrac{\partial L}{\partial \xi_i} = 0 \to \alpha_i + \beta_i = Cc_i wL=0w=i=1lαiyixibL=0i=1lαiyi=0ξiL=0αi+βi=Cci
代入拉格朗日函数可得到该问题的对偶问题,最大化泛函
W ( α ) = ∑ i = 1 l α i − 1 2 ∑ i , j = 1 l α i α j y i y j ( x i ⋅ x j ) W(\alpha) = \sum\limits_{i=1}^l \alpha_i - \dfrac{1}{2}\sum\limits_{i,j=1}^l \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) W(α)=i=1lαi21i,j=1lαiαjyiyj(xixj)
约束条件
∑ i = 1 l α i y i = 0 ,    0 ≤ α i ≤ C c i ,    β i = C c i − α i \sum\limits_{i=1}^l \alpha_i y_i = 0,\ \ 0 \le \alpha_i \le Cc_i,\ \ \beta_i = Cc_i- \alpha_i i=1lαiyi=0,  0αiCci,  βi=Cciαi
还需满足 Kuhn-Tucker条件
α i ( 1 − ξ i − y i ( w ⋅ x i + b ) ) = 0 ,    β i ξ i = 0 \alpha_i (1- \xi_i - y_i(w \cdot x_i +b)) = 0,\ \ \beta_i \xi_i = 0 αi(1ξiyi(wxi+b))=0,  βiξi=0

回顾5.10.1节便可以知道 c i k ,    d k c^k_i,\ \ d_k cik,  dk 的值,故可以得出软间隔最优分类超平面,进而得到SVM组合的决策规则
Φ ( x ) = s g n { ∑ r = 1 N d r ( w k ⋅ x + b k ) } = s g n { ∑ r = 1 N d r ( ∑ i = 1 l α k , i y i ( x i ⋅ x ) + b k ) } \Phi(x) = sgn\{\sum\limits_{r=1}^N d_r (w_k \cdot x + b_k)\} = sgn\{\sum\limits_{r=1}^N d_r (\sum\limits_{i=1}^l \alpha_{k,i} y_i (x_i \cdot x) + b_k)\} Φ(x)=sgn{r=1Ndr(wkx+bk)}=sgn{r=1Ndr(i=1lαk,iyi(xix)+bk)}
若采用Mercer条件核函数,只需把 x i ⋅ x j x_i \cdot x_j xixj 变为 K ( x i , x j ) K(x_i,x_j) K(xi,xj) 即可。

5.11 工程技巧与正式推理

5.12 统计模型的高明所在

5.13 从数字识别实验中我们学到了什么

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值