模式识别 | PRML Chapter 7 Sparse Kernel Machines

PRML Chapter 7 Sparse Kernel Machines

7.1 Maximum Margin Classifiers

The two-class classification problem using linear models of the form:

y ( x ) = w T ϕ ( x ) + b y(x) = w^{T}\phi(x) + b y(x)=wTϕ(x)+b

The maximum margin solution is found by solving:

arg ⁡ max ⁡ w , b { 1 ∣ ∣ w ∣ ∣ min ⁡ n [ t n ( w T ϕ ( x n ) + b ) ] } \arg\max_{w, b} \left\{\frac{1}{||w||}\min_{n}[t_{n}(w^{T}\phi(x_{n}) + b)] \right \} argw,bmax{w1nmin[tn(wTϕ(xn)+b)]}

where we have the constraint: t n y ( x n ) ≥ 0 t_n y(x_n)\geq 0 tny(xn)0

The direct solution of this optimization problem would be very complex and we can transform the objective function to:

arg ⁡ max ⁡ w , b 1 2 ∣ ∣ w ∣ ∣ 2 \arg\max_{w, b} \frac{1}{2}||w||^2 argw,bmax21w2

where we have:

t ( w T ϕ ( x n ) + b ) ≥ 1 t(w^{T}\phi(x_{n}) + b) \geq 1 t(wTϕ(xn)+b)1

To solve the problem, we introduce Lagrange multipliers a n ≥ 0 a_n\geq 0 an0, and the Lagrangian function will be:

L ( w , b , a ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ n = 1 N a n { t n ( w T ϕ ( x n ) − 1 ) } L(w, b, a) = \frac{1}{2}||w||^{2} - \sum_{n=1}^{N}a_{n}\{ t_{n}(w^{T}\phi(x_{n}) - 1) \} L(w,b,a)=21w2n=1Nan{tn(wTϕ(xn)1)}

Setting the derivatives with respect to w w w and b b b equal to zero we will obtain:

w = ∑ n = 1 N a n t n ϕ ( x n ) w = \sum_{n=1}^{N}a_{n}t_{n}\phi(x_{n}) w=n=1Nantnϕ(xn)

0 = ∑ n = 1 N a n t n 0 = \sum_{n=1}^{N}a_{n}t_{n} 0=n=1Nantn

Eliminating w w w and b b b from L ( w , b , a ) L(w,b,a) L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem in which we maximize:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ n = 1 N a − n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{n=1}^{N}a-{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1Nan21n=1Nn=1Nanamtntmk(xn,xm)

with respect to the constraints:

a n ≥ 0 ,      n = 1 , … , N a_{n} \geq 0,~~~~ n = 1, \dots, N an0,    n=1,,N

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N}a_{n}t_{n} = 0 n=1Nantn=0

Substituting for w w w, using parameters { a n } \{a_n\} {an} and the kernel function:

y ( x ) = ∑ n = 1 N a n t n k ( x , x n ) + b y(x)=\sum_{n=1}^N a_n t_n k(x,x_n)+b y(x)=n=1Nantnk(x,xn)+b

A constrained optimization of this form satisfies the KKT conditions:

a n ≥ 0 a_{n} \geq 0 an0

t n y ( x n ) − 1 ≥ 0 t_{n}y(x_{n}) - 1 \geq 0 tny(xn)10

a n { t n y ( x n ) − 1 } = 0 a_{n}\{ t_{n}y(x_{n}) - 1 \} = 0 an{tny(xn)1}=0

thus for every data point, either a n = 0 a_n=0 an=0 or t n y ( x n ) = 1 t_n y(x_n)=1 tny(xn)=1. Solving for b b b to give:

b = 1 N S ∑ n ∈ S ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{S}}\sum_{n\in S}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NS1nS(tnmSamtmk(xn,xm))

7.1.1 Overlapping class distributions

When we are going to maximize the margin while softy penalizing points that lie on the wrong sie of the margin boundary, we therefore minimize:

C ∑ n = 1 N ξ n + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}\xi_{n} + \frac{1}{2}||w||^{2} Cn=1Nξn+21w2

The constraints will be replaced with:

t n y ( x n ) ≥ 1 − ξ n ,     n = 1 , … , N t_{n}y(x_{n}) \geq 1-\xi_{n} , ~~~ n=1, \dots, N tny(xn)1ξn,   n=1,,N

ξ n ≥ 0 \xi_{n} \geq 0 ξn0

The corresponding Lagrangian is given by:

L ( w , b , ξ , a , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ n = 1 N ξ n − ∑ n = 1 N a n { t n y ( x n ) − 1 + ξ n } − ∑ n = 1 N μ n ξ n L(w, b, \xi, a, \mu) = \frac{1}{2}||w||^{2} + C\sum_{n=1}^{N}\xi_{n}-\sum_{n=1}^{N}a_{n}\{t_{n}y(x_{n}) - 1 + \xi_{n}\} - \sum_{n=1}^{N}\mu_{n}\xi_{n} L(w,b,ξ,a,μ)=21w2+Cn=1Nξnn=1Nan{tny(xn)1+ξn}n=1Nμnξn

The corresponding set of KKT conditions are given by:

a n ≥ 0 a_{n} \geq 0 an0

t n y ( x n ) − 1 + ξ n ≥ 0 t_{n}y(x_{n}) - 1 + \xi_{n} \geq 0 tny(xn)1+ξn0

a n ( t n y ( x n ) − 1 + ξ n ) = 0 a_{n} (t_{n}y(x_{n}) - 1 + \xi_{n}) = 0 an(tny(xn)1+ξn)=0

μ n ≥ 0 \mu_{n} \geq 0 μn0

ξ n ≥ 0 \xi_{n} \geq 0 ξn0

μ n ξ n = 0 \mu_{n}\xi_{n} = 0 μnξn=0

After we eliminate w w w, b b b and ξ n \xi_n ξn from the Lagrangian we obtain the dual Lagrangian in the form:

L ~ ( a ) = ∑ n = 1 N a n − 1 2 ∑ n = 1 N ∑ m = 1 N a n a m t n t m k ( x n , x m ) \tilde{L}(a) = \sum_{n=1}^{N}a_{n} - \frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}a_{n}a_{m}t_{n}t_{m}k(x_{n}, x_{m}) L~(a)=n=1Nan21n=1Nm=1Nanamtntmk(xn,xm)

with respect to the subjection:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0anC

∑ n = 1 N a n t n = 0 \sum_{n=1}^{N} a_{n}t_{n} = 0 n=1Nantn=0

Similarly, a numerically stable solution is obtained by:

b = 1 N M ∑ n ∈ M ( t n − ∑ m ∈ S a m t m k ( x n , x m ) ) b = \frac{1}{N_{M}}\sum_{n\in M}\left( t_{n} - \sum_{m\in S}a_{m}t_{m}k(x_{n}, x_{m}) \right) b=NM1nM(tnmSamtmk(xn,xm))

7.1.2 Relation to logistic regression

The objective function can be written in the form:

∑ n = 1 N E S V ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{SV}(y_n t_n)+\lambda ||w||^2 n=1NESV(yntn)+λw2

Also, we can construct an error function by taking the negative logarithm of the likelihood function that, with a quadratic regularizer, takes the form:

∑ n = 1 N E L R ( y n t n ) + λ ∣ ∣ w ∣ ∣ 2 \sum_{n=1}^N E_{LR}(y_n t_n)+\lambda ||w||^2 n=1NELR(yntn)+λw2

7.1.3 Multiclass SVMs

7.1.4 SVMs for regression

To obtain sparse solutions, we therefore minimizing a regularized error function:

C ∑ n = 1 N E ϵ ( y ( x ) − t n ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}E_{\epsilon}(y(x) - t_{n}) + \frac{1}{2}||w||^{2} Cn=1NEϵ(y(x)tn)+21w2

Introducing slack variables, the error function for support vector regression can then be written as:

C ∑ n = 1 N ( ξ n + ξ n ^ ) + 1 2 ∣ ∣ w ∣ ∣ 2 C\sum_{n=1}^{N}(\xi_{n} + \hat{\xi_{n}}) + \frac{1}{2}||w||^{2} Cn=1N(ξn+ξn^)+21w2

with the corresponding conditions:

y n − ϵ ≤ t n ≤ y n + ϵ y_{n} - \epsilon \leq t_{n} \leq y_{n} + \epsilon ynϵtnyn+ϵ

ξ n ≥ 0 ,    ξ n ^ ≥ 0 \xi_{n} \geq 0,\ \ \hat{\xi_{n}} \geq 0 ξn0,  ξn^0

The dual problem involves maximizing:

L ~ ( a , a ^ ) = − 1 2 ∑ n = 1 N ∑ m = 1 N ( a n − a n ^ ) ( a m − a m ^ ) k ( x n , x m ) − ϵ ∑ n = 1 N ( a n + a n ^ ) + ∑ n = 1 N ( a n − a n ^ ) t n \tilde{L}(a, \hat{a}) = -\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}(a_{n} - \hat{a_{n}}) (a_{m} -\hat{a_{m}})k(x_{n}, x_{m}) - \epsilon\sum_{n=1}^{N}(a_{n} + \hat{a_{n}}) + \sum_{n=1}^{N}(a_{n} - \hat{a_{n}})t_{n} L~(a,a^)=21n=1Nm=1N(anan^)(amam^)k(xn,xm)ϵn=1N(an+an^)+n=1N(anan^)tn

with respect to the condition:

0 ≤ a n ≤ C 0 \leq a_{n} \leq C 0anC

0 ≤ a ^ n ≤ C 0 \leq \hat{a}_{n} \leq C 0a^nC

Solve the dual problem and we can see the predictions for new inputs can be made using:

y ( x ) = ∑ n = 1 N ( a n − a ^ n ) k ( x , x n ) + b y(x) = \sum_{n=1}^{N}(a_{n} - \hat{a}_{n})k(x, x_{n}) + b y(x)=n=1N(ana^n)k(x,xn)+b

which is again expressed in terms of the Kernel function. The corresponding KKT conditions:

a n ( ϵ + ξ n + y n − t n ) = 0 a_{n}(\epsilon + \xi_{n} + y_{n} - t_{n}) = 0 an(ϵ+ξn+yntn)=0

a ^ n ( ϵ + ξ ^ n − y n + t n ) = 0 \hat{a}_{n}(\epsilon + \hat{\xi}_{n} - y_{n} + t_{n}) = 0 a^n(ϵ+ξ^nyn+tn)=0

( C − a n ) ξ n = 0 (C - a_{n})\xi_{n} = 0 (Can)ξn=0

( C − a ^ n ) ξ ^ n = 0 (C - \hat{a}_{n})\hat{\xi}_{n} = 0 (Ca^n)ξ^n=0

Solving for b b b we obtain:

b = t n − ϵ − ∑ m = 1 N ( a m − a ^ m ) k ( x n , x m ) b = t_{n} - \epsilon - \sum_{m=1}^{N}(a_{m} - \hat{a}_{m})k(x_{n}, x_{m}) b=tnϵm=1N(ama^m)k(xn,xm)

7.1.5 Computational learning theory

7.2 Relevance Vector Machines

7.2.1 RVM for regression

The relevance vector machine for regression is a linear model of the form discussed in Chapter 3 but with a modified prior that results in sparse solutions. The condition distribution is:

p ( t ∣ x , w , β ) = N ( t ∣ y ( x ) , β − 1 ) p(t| \mathbf{x}, \mathbf{w}, \beta) = N(t | y(x), \beta^{-1}) p(tx,w,β)=N(ty(x),β1)

where β = σ − 2 \beta=\sigma^{-2} β=σ2 is the noise precision and the mean is given by a linear model of the form:

y ( x ) = ∑ i = 1 M w i ϕ i ( x ) = w T ϕ ( x ) y(x) = \sum_{i=1}^{M}w_{i}\phi_{i}(x) = w^{T}\phi(x) y(x)=i=1Mwiϕi(x)=wTϕ(x)

this general expression takes the SVM-like form:

y ( x ) = ∑ n = 1 N w n k ( x , x n ) + b y(x) = \sum_{n=1}^{N}w_{n}k(x, x_{n}) + b y(x)=n=1Nwnk(x,xn)+b

Suppose we have N observations, the likelihood function is given by:

p ( t ∣ X , w , β ) = ∏ n = 1 N p ( t n ∣ x n , w , β ) p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta) p(tX,w,β)=n=1Np(tnxn,w,β)

The prior distribution over the parameter w w w:

p ( w ∣ α ) = ∏ i = 1 M N ( w i ∣ 0 , α i − 1 ) p(w| \alpha) = \prod_{i=1}^{M}N(w_{i} | 0, \alpha_{i}^{-1}) p(wα)=i=1MN(wi0,αi1)

It is easy to infer that the posterior distribution for w w w takes the form:

p ( w ∣ t , X , α , β ) = N ( w ∣ m , Σ ) p(w|\mathbf{t},\mathbf{X},\alpha,\beta)=N(w|m,\Sigma) p(wt,X,α,β)=N(wm,Σ)

where m = β Σ Φ T t m=\beta\Sigma\mathbf{\Phi^T t} m=βΣΦTt and Σ = ( A + β Φ T Φ ) − 1 \Sigma=(\mathbf{A}+\beta\mathbf{\Phi^T\Phi})^{-1} Σ=(A+βΦTΦ)1.

We should maximize the marginal likelihood function:

p ( t ∣ X , α , β ) = ∫ p ( t ∣ X , w , β ) p ( w ∣ α ) d w p(\mathbf{t}|\mathbf{X},\alpha,\beta)=\int p(\mathbf{t}|\mathbf{X},w,\beta)p(w|\alpha)dw p(tX,α,β)=p(tX,w,β)p(wα)dw

The log marginal likelihood will be the form:

p ( t ∣ X , α , β ) = ln ⁡ N ( t ∣ 0 , C ) p(\mathbf{t} | \mathbf{X}, \alpha, \beta) = \ln N(\mathbf{t} | \mathbf{0, \mathbf{C}}) p(tX,α,β)=lnN(t0,C)

C = β − 1 I + Φ A − 1 Φ T C = \beta^{-1}\mathbf{I} + \mathbf{\Phi A^{-1}\Phi^{T}} C=β1I+ΦA1ΦT

The predictive distribution will be:

p ( t ∣ x , X , t , α ∗ , β ∗ ) = N ( t ∣ m T ϕ ( x ) , σ 2 ( x ) ) p(t| x, \mathbf{X}, \mathbf{t}, \alpha^{*}, \beta^{*}) = N(t | m^{T}\phi(x), \sigma^{2}(x)) p(tx,X,t,α,β)=N(tmTϕ(x),σ2(x))

7.2.2 Analysis of sparsity

7.2.3 RVM for classification

We can extend the relevance vector machine framework to classification problems by applying the ARD prior over weights to a probabilistic linear classification model of that in Chapter 4.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值