模式识别 | PRML Chapter 6 Kernel Methods

PRML Chapter 6 Kernel Methods

For models which are based on a fixed nonlinear feature space mapping ϕ ( x ) \phi(x) ϕ(x), the kernel function is given by the relation:

k ( x , x ′ ) = ϕ ( x ) T ϕ ( x ′ ) k(x, x^{'}) = \phi(x)^{T}\phi(x^{'}) k(x,x)=ϕ(x)Tϕ(x)

Forms of kernel functions:

  • linear kernels: k ( x , x ′ ) = x T x k(x,x')=x^T x k(x,x)=xTx
  • stationary kernels: k ( x , x ′ ) = k ( x − x ′ ) k(x,x')=k(x-x') k(x,x)=k(xx)
  • homogeneous kernels: k ( x , x ′ ) = k ( ∣ ∣ x − x ′ ∣ ∣ ) k(x,x')=k(||x-x'||) k(x,x)=k(xx)

6.1 Dual Representations

Consider a linear regression model whose parameters are determined by minimizing a regularized sum-of-squares error funciton given by:

J ( w ) = 1 2 ∑ n = 1 N { w T ϕ ( x n ) − t n } 2 + λ 2 w T w J(w) = \frac{1}{2}\sum_{n=1}^{N}\{ w^{T}\phi(x_{n}) - t_{n} \}^{2} + \frac{\lambda}{2}w^{T}w J(w)=21n=1N{wTϕ(xn)tn}2+2λwTw

Set the gradient of J ( w ) J(w) J(w) with respect to w w w equals to zero, we can get the solution:

w = − 1 λ ∑ n = 1 N { w T ϕ ( x n ) − t n } ϕ ( x n ) = ∑ n = 1 N a n ϕ ( x n ) = Φ T a w = -\frac{1}{\lambda}\sum_{n=1}^N \{ w^{T}\phi(x_{n}) - t_{n} \}\phi(x_n)=\sum_{n=1}^N a_n\phi(x_n)= \Phi^{T}a w=λ1n=1N{wTϕ(xn)tn}ϕ(xn)=n=1Nanϕ(xn)=ΦTa

a n = − 1 λ { w T ϕ ( x n ) − t n } a_{n} = -\frac{1}{\lambda}\{ w^{T}\phi(x_{n}) - t_{n} \} an=λ1{wTϕ(xn)tn}

We now define the Gram matrix K = Φ Φ T K=\Phi\Phi^T K=ΦΦT, which is an N ∗ N N*N NN symmetric matrix with elements:

K n m = ϕ ( x n ) T ϕ ( x m ) = k ( x n , x m ) K_{nm} = \phi(x_{n})^{T}\phi(x_{m}) = k(x_{n}, x_{m}) Knm=ϕ(xn)Tϕ(xm)=k(xn,xm)

In terms of the Gram matrix, the sum-of-squares error function can be written as:

J ( a ) = 1 2 a T K K a − a T K t + 1 2 t T t + λ 2 a T K a J(a) = \frac{1}{2}a^{T}\mathbf{KK}a - a^{T}\mathbf{Kt} + \frac{1}{2}\mathbf{t^{T}t} + \frac{\lambda}{2}a^{T}\mathbf{K}a J(a)=21aTKKaaTKt+21tTt+2λaTKa

Setting the gradient J ( a ) J(a) J(a) with respect to a a a to zero, we obtain the following solution:

a = ( K + λ I N ) − 1 t a = (K + \lambda\mathbf{I}_{N})^{-1}\mathbf{t} a=(K+λIN)1t

If we substitute this back into the linear regression model, we obtain the following prediction for a new x x x.

y ( x ) = w T ϕ ( x ) = a T Φ ϕ ( x ) = k ( x ) T ( K + λ I N ) − 1 t y(x) = w^T\phi(x)=a^T\Phi\phi(x)=k(x)^{T}(K + \lambda\mathbf{I}_{N})^{-1}\mathbf{t} y(x)=wTϕ(x)=aTΦϕ(x)=k(x)T(K+λIN)1t

where we have defined the vector k ( x ) k(x) k(x) with elements: k n ( x ) = k ( x n , x ) k_{n}(x) = k(x_{n}, x) kn(x)=k(xn,x).

6.2 Constructing Kernels

Here the kernel function is defined for a one-dimensional input space by

k ( x , x ′ ) = ϕ ( x ) T ϕ ( x ′ ) = ∑ i = 1 M ϕ i ( x ) ϕ i ( x ′ ) k(x,x')=\phi(x)^T\phi(x')=\sum_{i=1}^M\phi_i(x)\phi_i(x') k(x,x)=ϕ(x)Tϕ(x)=i=1Mϕi(x)ϕi(x)

A necessary and sufficient condition for a function k ( x , x ′ ) k(x,x') k(x,x) to be a valid kernel is that the Gram matrix K K K should be positive semidefintie for all possible choices of the set { x n } \{x_n\} {xn}.

Given valid kernels k 1 ( x , x ′ ) k_1(x,x') k1(x,x) and k 2 ( x , x ′ ) k_2(x,x') k2(x,x), the following new kernels will also be valid:

{ k ( x , x ′ ) = c k 1 ( x , x ′ ) k ( x , x ′ ) = f ( x ) k 1 ( x , x ′ ) f ( x ′ ) k ( x , x ′ ) = q ( k 1 ( x , x ′ ) ) k ( x , x ′ ) = e x p ( k 1 ( x , x ′ ) ) k ( x , x ′ ) = k 1 ( x , x ′ ) + k 2 ( x , x ′ ) k ( x , x ′ ) = k 1 ( x , x ′ ) k 2 ( x , x ′ ) k ( x , x ′ ) = k 3 ( ϕ ( x ) , ϕ ( x ′ ) ) k ( x , x ′ ) = x T A x ′ k ( x , x ′ ) = k a ( x a , x a ′ ) + k b ( x b , x b ′ ) k ( x , x ′ ) = k a ( x a , x a ′ ) k b ( x b , x b ′ ) \left\{ \begin{aligned} k(x, x^{'}) & = & ck_{1}(x, x^{'}) \\ k(x, x^{'}) & = & f(x)k_{1}(x, x^{'})f(x^{'}) \\ k(x, x^{'}) & = & q(k_{1}(x, x^{'})) \\ k(x, x^{'}) & = & exp(k_{1}(x, x^{'})) \\ k(x, x^{'}) & = & k_{1}(x, x^{'}) + k_{2}(x, x^{'})\\ k(x, x^{'}) & = & k_{1}(x, x^{'})k_{2}(x, x^{'}) \\ k(x, x^{'}) & = & k_{3}(\phi(x), \phi(x^{'}))\\ k(x, x^{'}) & = & x^{T}Ax^{'} \\ k(x, x^{'}) & = & k_{a}(x_{a}, x_{a}^{'}) + k_{b}(x_{b}, x_{b}^{'}) \\ k(x, x^{'}) & = & k_{a}(x_{a}, x_{a}^{'})k_{b}(x_{b}, x_{b}^{'}) \end{aligned} \right. k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)k(x,x)==========ck1(x,x)f(x)k1(x,x)f(x)q(k1(x,x))exp(k1(x,x))k1(x,x)+k2(x,x)k1(x,x)k2(x,x)k3(ϕ(x),ϕ(x))xTAxka(xa,xa)+kb(xb,xb)ka(xa,xa)kb(xb,xb)

Another commonly used kernel takes the form:

k ( x , x ′ ) = e x p ( − ∣ ∣ x − x ′ ∣ ∣ 2 2 σ 2 ) k(x, x^{'}) = exp\left( -\frac{||x - x^{'}||^2}{2\sigma^{2}} \right) k(x,x)=exp(2σ2xx2)

which is called Gaussian kernel.

One powerful approach to the construction of kernels starts from a probabilistic generative model, which allows us to apply generative models in a discriminative setting. Given a generative model p ( x ) p(x) p(x) we can define a kernel by:

k ( x , x ′ ) = p ( x ) p ( x ′ ) k(x,x')=p(x)p(x') k(x,x)=p(x)p(x)

To extend this class of kernels by considering sums over products of different probability distributions:

k ( x , x ′ ) = ∑ i p ( x ∣ i ) p ( x ′ ∣ i ) p ( i ) k(x,x')=\sum_i p(x|i)p(x'|i)p(i) k(x,x)=ip(xi)p(xi)p(i)

Taking the limit of an infinite sum, we can also consider kernels of the form:

k ( x , x ′ ) = ∫ p ( x ∣ z ) p ( x ′ ∣ z ) p ( z ) d z k(x,x')=\int p(x|z)p(x'|z)p(z)dz k(x,x)=p(xz)p(xz)p(z)dz

Also, by extending the mixture representation we can define a kernel function measuring the similarity of two sequences X X X and X ′ X' X:

k ( X , X ′ ) = ∑ Z p ( X ∣ Z ) p ( X ′ ∣ Z ) p ( Z ) k(X,X')=\sum_Z p(X|Z)p(X'|Z)p(Z) k(X,X)=Zp(XZ)p(XZ)p(Z)

Another models is known as Fisher kernel defined by:

k ( x , x ′ ) = g ( θ , x ) T F − 1 g ( θ , x ′ ) k(x,x')=g(\theta,x)^T F^{-1} g(\theta,x') k(x,x)=g(θ,x)TF1g(θ,x)

here the Fisher score is g ( θ , x ) = ∇ θ l n p ( x ∥ θ ) g(\theta,x)=\nabla_{\theta}lnp(x\|\theta) g(θ,x)=θlnp(xθ), F F F is the Fisher information matrix, given by F = E x [ g ( θ , x ) g ( θ , x ) T ] F=E_x[g(\theta,x)g(\theta,x)^T] F=Ex[g(θ,x)g(θ,x)T].

6.3 Radial Basis Function Networks

Radial basis functions have the property that each basis function depends only on the radial distance (typically Euclidean) from a centre μ j \mu_j μj, so that

ϕ j ( x ) = h ( ∣ ∣ x − μ j ∣ ∣ ) \phi_j(x)=h(||x-\mu_j||) ϕj(x)=h(xμj)

6.3.1 Nadaraya-Watson model

Starting with kernel density estimation. Suppose we have a training set { x n , t n } \{x_n,t_n\} {xn,tn} and we use a Parzen density estimator to model the joint distribution p ( x , t ) p(x,t) p(x,t), so that

p ( x , t ) = 1 N ∑ n = 1 N f ( x − x n , t − t n ) p(x,t)=\frac{1}{N}\sum_{n=1}^N f(x-x_n,t-t_n) p(x,t)=N1n=1Nf(xxn,ttn)

We now find an expression for the regression function y ( x ) y(x) y(x), corresponding to the conditional average of the target variable conditioned on the input variable, which is given by:

y ( x ) = E [ t ∣ x ] = ∫ − ∞ ∞ t p ( t ∣ x ) = ∫ t p ( x , t ) d t ∫ p ( x , t ) d t = ∑ n ∫ t f ( x − x n , t − t n ) d t ∑ m ∫ f ( x − x m , t − t m ) d t \begin{aligned} y(x) &= E[t | x] =\int_{-\infin}^{\infin}tp(t|x)\\ &=\frac{\int tp(x,t)dt}{\int p(x,t)dt}\\ &= \frac{\sum_{n}\int t f(x - x_{n}, t - t_{n}) dt} { \sum_{m}\int f(x - x_{m}, t - t_{m}) dt } \end{aligned} y(x)=E[tx]=tp(tx)=p(x,t)dttp(x,t)dt=mf(xxm,ttm)dtntf(xxn,ttn)dt

We now assume for simplicity that the component density functions have zero mean so that:

∫ − ∞ ∞ f ( x , t ) t d t = 0 \int_{-\infty}^{\infty} f(x, t)t dt = 0 f(x,t)tdt=0

for all values of x x x, using a simple change of variable, we then obtain:

y ( x ) = k ( x , x n ) = g ( x − x n ) t n ∑ m g ( x − x m ) = ∑ n k ( x , x n ) t n y(x) = k(x, x_{n}) = \frac{g(x - x_{n})t_{n}}{\sum_{m}g(x - x_{m})} =\sum_n k(x,x_n) t_n y(x)=k(x,xn)=mg(xxm)g(xxn)tn=nk(x,xn)tn

and we have defined:

g ( x ) = ∫ − ∞ ∞ f ( x , t ) d t g(x) = \int_{-\infty}^{\infty} f(x, t) dt g(x)=f(x,t)dt

The result is known as the Nadaraya-Watson model or kernel regression.

6.4 Gaussian Processes

6.4.1 Linear regression revisited

Consider a model defined in terms of a linear combination of fixed basis functions:

y ( x ) = w T ϕ ( x ) y(x) = w^{T}\phi(x) y(x)=wTϕ(x)

The prior distribution over w w w:

p ( w ) = N ( w ∣ 0 , α − 1 I ) p(w) = N(w | 0, \alpha^{-1}\mathbf{I}) p(w)=N(w0,α1I)

Evaluate the function at specific values of x x x, the vector is given by:

y = Φ w \mathbf{y} = \Phi w y=Φw

And the mean and covariance:

E [ y ] = Φ E [ w ] = 0 E[\mathbf{y}] = \Phi E[w] = 0 E[y]=ΦE[w]=0

c o v [ y ] = E [ y y T ] = K cov[\mathbf{y}] = E[\mathbf{y}\mathbf{y}^{T}] = \mathbf{K} cov[y]=E[yyT]=K

where K K K is the Gram matrix with elements:

K n m = k ( x n , x m ) = 1 α ϕ ( x n ) T ϕ ( x m ) K_{nm} = k(x_{n}, x_{m}) = \frac{1}{\alpha}\phi(x_{n})^{T}\phi(x_{m}) Knm=k(xn,xm)=α1ϕ(xn)Tϕ(xm)

6.4.2 Gaussian processes for regression

In order to apply Gaussian process models to the problem of regression, we need to take account of the noise on the observed target values:

t n = y n + ϵ n t_{n} = y_{n} + \epsilon_{n} tn=yn+ϵn

Consider noise processes that have a Gaussian distribution, so that:

p ( t n ∣ y n ) = N ( t n ∣ y n , β − 1 ) p(t_{n} | y_{n}) = N(t_{n}| y_{n}, \beta^{-1}) p(tnyn)=N(tnyn,β1)

The conditional distribution is given by an isotropic Gaussian of the form:

p ( t ∣ y ) = N ( t ∣ y , β − 1 I N ) p(\mathbf{t} | \mathbf{y}) = N(\mathbf{t}| \mathbf{y}, \beta^{-1}\mathbf{I}_{N}) p(ty)=N(ty,β1IN)

From the definition, the marginal distribution p ( y ) p(y) p(y) is given by a Gaussian:

p ( y ) = N ( y ∣ 0 , K ) p(\mathbf{y}) = N(\mathbf{y} | 0, \mathbf{K}) p(y)=N(y0,K)

We see that the marginal distribution of t \textbf{t} t is given by:

p ( t ) = ∫ p ( t ∣ y ) p ( y ) d y = N ( t ∣ 0 , C ) p(\mathbf{t}) = \int p(\mathbf{t} | \mathbf{y})p(\mathbf{y}) d\mathbf{y} = N(\mathbf{t} | 0, \mathbf{C}) p(t)=p(ty)p(y)dy=N(t0,C)

where the covariance matrix C C C has elements:

C ( x n , x m ) = k ( x n , x m ) + β − 1 δ n m C(x_{n}, x_{m}) = k(x_{n}, x_{m}) + \beta^{-1}\delta_{nm} C(xn,xm)=k(xn,xm)+β1δnm

One widely used kernel function for Gaussian process regression is given by the exponential of a quadratic form, with the addition of constant and linear terms to give:

k ( x n , x m ) = θ 0 e x p { − θ 1 2 ∣ ∣ x n − x m ∣ ∣ 2 } + θ 2 + θ 3 x n T x m k(x_{n}, x_{m}) = \theta_{0}exp\left\{ -\frac{\theta_{1}}{2}||x_{n} - x_{m}||^{2} \right\} + \theta_{2} + \theta_{3}x_{n}^{T}x_{m} k(xn,xm)=θ0exp{2θ1xnxm2}+θ2+θ3xnTxm

The joint distribution over t 1 , . . . , t N + 1 t_1,...,t_{N+1} t1,...,tN+1 will be given by:

p ( t N + 1 ) = N ( t N + 1 ∣ 0 , C N + 1 ) p(t_{N+1}) = N(t_{N+1} | 0, C_{N+1}) p(tN+1)=N(tN+10,CN+1)

So the conditional distribution p ( t N + 1 ∥ t ) p(t_{N+1}\|\textbf{t}) p(tN+1t) is a Gaussian distribution with mean and covariance given by:

m ( x N + 1 ) = K t c N − 1 t m(x_{N+1}) = K^{t}c_{N}^{-1}\mathbf{t} m(xN+1)=KtcN1t

σ 2 ( x N + 1 ) = c − k T C N − 1 k \sigma^{2}(x_{N+1}) = c - k^{T}C_{N}^{-1}\mathbf{k} σ2(xN+1)=ckTCN1k

6.4.3 Learning the hyperparameters

Techniques are based on the evaluation of the likelihood function and maximize the log likelihood.

6.4.4 Automatic relevance determination

The example of a Gaussian process with a two-dimensional input space x = ( x 1 , x 2 ) x=(x_1,x_2) x=(x1,x2), having a kernel function of the form:

k ( x , x ′ ) = θ 0 e x p { − 1 2 ∑ i = 1 2 η i ( x i − x i ′ ) 2 } k(x,x')=\theta_0 exp\{-\frac{1}{2}\sum_{i=1}^2 \eta_i(x_i-x_i')^2\} k(x,x)=θ0exp{21i=12ηi(xixi)2}

The ARD framework is easily incorporated into the exponential-quadratic kernel to give the following form of kernel function:

k ( x n , x m ) = θ 0 e x p { − 1 2 ∑ i = 1 D η i ( x n i − x m i ) 2 } + θ 2 + θ 3 ∑ i = 1 D x n i x m i k(x_n,x_m)=\theta_0 exp\{-\frac{1}{2}\sum_{i=1}^D\eta_i(x_{ni}-x_{mi})^2\}+\theta_2+\theta_3\sum_{i=1}^D x_{ni}x_{mi} k(xn,xm)=θ0exp{21i=1Dηi(xnixmi)2}+θ2+θ3i=1Dxnixmi

6.4.5 Gaussian processes for classification

For the two-class problem if we define a Gaussian process over a function a ( x ) a(x) a(x) and then transform the function using a logistic sigmoid y = σ ( a ) y=\sigma(a) y=σ(a), then we obtain a non-Gaussian stochastic process over functions y ( x ) y(x) y(x) where y ∈ ( 0 , 1 ) y\in (0,1) y(0,1). The distribution of target variable t t t is given by the Bernoulli distribution:

p ( t ∣ a ) = σ ( a ) t ( 1 − σ ( a ) ) 1 − t p(t|a) = \sigma(a)^{t}(1-\sigma(a))^{1-t} p(ta)=σ(a)t(1σ(a))1t

Our goal is to determine the predictive distribution p ( t N + 1 ∥ t ) p(t_{N+1}\|\textbf{t}) p(tN+1t), so we introduce a Gaussian process prior over the vector a N + 1 a_{N+1} aN+1:

p ( a N + 1 ) = N ( a N + 1 ∣ 0 , C N + 1 ) p(a_{N+1}) = N(a_{N+1} | 0, C_{N+1}) p(aN+1)=N(aN+10,CN+1)

For numerical reasons it is convenient to introduce a noise-like term governed by a parameter v v v that ensures the covariance matrix is positive definite. Thus the covariance matrix C N + 1 C_{N+1} CN+1 has elements given by:

C ( x n , x m ) = k ( x n , x m ) + ν δ n m C(x_{n}, x_{m}) = k(x_{n}, x_{m}) + \nu\delta_{nm} C(xn,xm)=k(xn,xm)+νδnm

The required predictive distribution is given by:

p ( t N + 1 = 1 ∣ t N ) = ∫ p ( t N + 1 = 1 ∣ a N + 1 ) p ( a N + 1 ∣ t N ) d a N + 1 p(t_{N+1} = 1 | t_{N}) = \int p(t_{N+1} = 1 | a_{N+1})p(a_{N+1} | t_{N}) da_{N+1} p(tN+1=1tN)=p(tN+1=1aN+1)p(aN+1tN)daN+1

where p ( t N + 1 = 1 ∣ a N + 1 ) = σ ( a N + 1 ) p(t_{N+1}=1|a_{N+1})=\sigma(a_{N+1}) p(tN+1=1aN+1)=σ(aN+1)

6.4.6 Laplace approximation

In order to evaluate the predictive distribution, we seek a Gaussian approximation to the posterior distribution over a N + 1 a_{N+1} aN+1:

p ( a N + 1 ∣ t N ) = ∫ p ( a N + 1 , a N ∣ t N ) d a N = ∫ p ( a N + 1 ∣ a N ) p ( a N ∣ t N ) d a N p(a_{N+1}|\textbf{t}_N)=\int p(a_{N+1},a_N|\textbf{t}_N)da_N=\int p(a_{N+1}|a_N)p(a_N|\textbf{t}_N)da_N p(aN+1tN)=p(aN+1,aNtN)daN=p(aN+1aN)p(aNtN)daN

We also need to determine the parameters θ \theta θ of the covariance function. The likelihood funciton is defined by:

p ( t N ∣ θ ) = ∫ p ( t N ∣ a N ) p ( a N ∣ θ ) d a N p(\textbf{t}_N|\theta)=\int p(\textbf{t}_N|a_N)p(a_N|\theta)da_N p(tNθ)=p(tNaN)p(aNθ)daN

6.4.7 Connection to neural networks

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值