PRML Chapter 6 Kernel Methods
For models which are based on a fixed nonlinear feature space mapping ϕ ( x ) \phi(x) ϕ(x), the kernel function is given by the relation:
k ( x , x ′ ) = ϕ ( x ) T ϕ ( x ′ ) k(x, x^{'}) = \phi(x)^{T}\phi(x^{'}) k(x,x′)=ϕ(x)Tϕ(x′)
Forms of kernel functions:
- linear kernels: k ( x , x ′ ) = x T x k(x,x')=x^T x k(x,x′)=xTx
- stationary kernels: k ( x , x ′ ) = k ( x − x ′ ) k(x,x')=k(x-x') k(x,x′)=k(x−x′)
- homogeneous kernels: k ( x , x ′ ) = k ( ∣ ∣ x − x ′ ∣ ∣ ) k(x,x')=k(||x-x'||) k(x,x′)=k(∣∣x−x′∣∣)
6.1 Dual Representations
Consider a linear regression model whose parameters are determined by minimizing a regularized sum-of-squares error funciton given by:
J ( w ) = 1 2 ∑ n = 1 N { w T ϕ ( x n ) − t n } 2 + λ 2 w T w J(w) = \frac{1}{2}\sum_{n=1}^{N}\{ w^{T}\phi(x_{n}) - t_{n} \}^{2} + \frac{\lambda}{2}w^{T}w J(w)=21n=1∑N{wTϕ(xn)−tn}2+2λwTw
Set the gradient of J ( w ) J(w) J(w) with respect to w w w equals to zero, we can get the solution:
w = − 1 λ ∑ n = 1 N { w T ϕ ( x n ) − t n } ϕ ( x n ) = ∑ n = 1 N a n ϕ ( x n ) = Φ T a w = -\frac{1}{\lambda}\sum_{n=1}^N \{ w^{T}\phi(x_{n}) - t_{n} \}\phi(x_n)=\sum_{n=1}^N a_n\phi(x_n)= \Phi^{T}a w=−λ1n=1∑N{wTϕ(xn)−tn}ϕ(xn)=n=1∑Nanϕ(xn)=ΦTa
a n = − 1 λ { w T ϕ ( x n ) − t n } a_{n} = -\frac{1}{\lambda}\{ w^{T}\phi(x_{n}) - t_{n} \} an=−λ1{wTϕ(xn)−tn}
We now define the Gram matrix K = Φ Φ T K=\Phi\Phi^T K=ΦΦT, which is an N ∗ N N*N N∗N symmetric matrix with elements:
K n m = ϕ ( x n ) T ϕ ( x m ) = k ( x n , x m ) K_{nm} = \phi(x_{n})^{T}\phi(x_{m}) = k(x_{n}, x_{m}) Knm=ϕ(xn)Tϕ(xm)=k(xn,xm)
In terms of the Gram matrix, the sum-of-squares error function can be written as:
J ( a ) = 1 2 a T K K a − a T K t + 1 2 t T t + λ 2 a T K a J(a) = \frac{1}{2}a^{T}\mathbf{KK}a - a^{T}\mathbf{Kt} + \frac{1}{2}\mathbf{t^{T}t} + \frac{\lambda}{2}a^{T}\mathbf{K}a J(a)=21aTKKa−aTKt+21tTt+2λaTKa
Setting the gradient J ( a ) J(a) J(a) with respect to a a a to zero, we obtain the following solution:
a = ( K + λ I N ) − 1 t a = (K + \lambda\mathbf{I}_{N})^{-1}\mathbf{t} a=(K+λIN)−1t
If we substitute this back into the linear regression model, we obtain the following prediction for a new x x x.
y ( x ) = w T ϕ ( x ) = a T Φ ϕ ( x ) = k ( x ) T ( K + λ I N ) − 1 t y(x) = w^T\phi(x)=a^T\Phi\phi(x)=k(x)^{T}(K + \lambda\mathbf{I}_{N})^{-1}\mathbf{t} y(x)=wTϕ(x)=aTΦϕ(x)=k(x)T(K+λIN)−1t
where we have defined the vector k ( x ) k(x) k(x) with elements: k n ( x ) = k ( x n , x ) k_{n}(x) = k(x_{n}, x) kn(x)=k(xn,x).
6.2 Constructing Kernels
Here the kernel function is defined for a one-dimensional input space by
k ( x , x ′ ) = ϕ ( x ) T ϕ ( x ′ ) = ∑ i = 1 M ϕ i ( x ) ϕ i ( x ′ ) k(x,x')=\phi(x)^T\phi(x')=\sum_{i=1}^M\phi_i(x)\phi_i(x') k(x,x′)=ϕ(x)Tϕ(x′)=i=1∑Mϕi(x)ϕi(x′)
A necessary and sufficient condition for a function k ( x , x ′ ) k(x,x') k(x,x′) to be a valid kernel is that the Gram matrix K K K should be positive semidefintie for all possible choices of the set { x n } \{x_n\} {xn}.
Given valid kernels k 1 ( x , x ′ ) k_1(x,x') k1(x,x′) and k 2 ( x , x ′ ) k_2(x,x') k2(x,x′), the following new kernels will also be valid:
{ k ( x , x ′ ) = c k 1 ( x , x ′ ) k ( x , x ′ ) = f ( x ) k 1 ( x , x ′ ) f ( x ′ ) k ( x , x ′ ) = q ( k 1 ( x , x ′ ) ) k ( x , x ′ ) = e x p ( k 1 ( x , x ′ ) ) k ( x , x ′ ) = k 1 ( x , x ′ ) + k 2 ( x , x ′ ) k ( x , x ′ ) = k 1 ( x , x ′ ) k 2 ( x , x ′ ) k ( x , x ′ ) = k 3 ( ϕ ( x ) , ϕ ( x ′ ) ) k ( x , x ′ ) = x T A x ′ k ( x , x ′ ) = k a ( x a , x a ′ ) + k b ( x b , x b ′ ) k ( x , x ′ ) = k a ( x a , x a ′ ) k b ( x b , x b ′ ) \left\{ \begin{aligned} k(x, x^{'}) & = & ck_{1}(x, x^{'}) \\ k(x, x^{'}) & = & f(x)k_{1}(x, x^{'})f(x^{'}) \\ k(x, x^{'}) & = & q(k_{1}(x, x^{'})) \\ k(x, x^{'}) & = & exp(k_{1}(x, x^{'})) \\ k(x, x^{'}) & = & k_{1}(x, x^{'}) + k_{2}(x, x^{'})\\ k(x, x^{'}) & = & k_{1}(x, x^{'})k_{2}(x, x^{'}) \\ k(x, x^{'}) & = & k_{3}(\phi(x), \phi(x^{'}))\\ k(x, x^{'}) & = & x^{T}Ax^{'} \\ k(x, x^{'}) & = & k_{a}(x_{a}, x_{a}^{'}) + k_{b}(x_{b}, x_{b}^{'}) \\ k(x, x^{'}) & = & k_{a}(x_{a}, x_{a}^{'})k_{b}(x_{b}, x_{b}^{'}) \end{aligned} \right. ⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)k(x,x′)==========ck1(x,x′)f(x)k1(x,x′)f(x′)q(k1(x,x′))exp(k1(x,x′))k1(x,x′)+k2(x,x′)k1(x,x′)k2(x,x′)k3(ϕ(x),ϕ(x′))xTAx′ka(xa,xa′)+kb(xb,xb′)ka(xa,xa′)kb(xb,xb′)
Another commonly used kernel takes the form:
k ( x , x ′ ) = e x p ( − ∣ ∣ x − x ′ ∣ ∣ 2 2 σ 2 ) k(x, x^{'}) = exp\left( -\frac{||x - x^{'}||^2}{2\sigma^{2}} \right) k(x,x′)=exp(−2σ2∣∣x−x′∣∣2)
which is called Gaussian kernel.
One powerful approach to the construction of kernels starts from a probabilistic generative model, which allows us to apply generative models in a discriminative setting. Given a generative model p ( x ) p(x) p(x) we can define a kernel by:
k ( x , x ′ ) = p ( x ) p ( x ′ ) k(x,x')=p(x)p(x') k(x,x′)=p(x)p(x′)
To extend this class of kernels by considering sums over products of different probability distributions:
k ( x , x ′ ) = ∑ i p ( x ∣ i ) p ( x ′ ∣ i ) p ( i ) k(x,x')=\sum_i p(x|i)p(x'|i)p(i) k(x,x′)=i∑p(x∣i)p(x′∣i)p(i)
Taking the limit of an infinite sum, we can also consider kernels of the form:
k ( x , x ′ ) = ∫ p ( x ∣ z ) p ( x ′ ∣ z ) p ( z ) d z k(x,x')=\int p(x|z)p(x'|z)p(z)dz k(x,x′)=∫p(x∣z)p(x′∣z)p(z)dz
Also, by extending the mixture representation we can define a kernel function measuring the similarity of two sequences X X X and X ′ X' X′:
k ( X , X ′ ) = ∑ Z p ( X ∣ Z ) p ( X ′ ∣ Z ) p ( Z ) k(X,X')=\sum_Z p(X|Z)p(X'|Z)p(Z) k(X,X′)=Z∑p(X∣Z)p(X′∣Z)p(Z)
Another models is known as Fisher kernel defined by:
k ( x , x ′ ) = g ( θ , x ) T F − 1 g ( θ , x ′ ) k(x,x')=g(\theta,x)^T F^{-1} g(\theta,x') k(x,x′)=g(θ,x)TF−1g(θ,x′)
here the Fisher score is g ( θ , x ) = ∇ θ l n p ( x ∥ θ ) g(\theta,x)=\nabla_{\theta}lnp(x\|\theta) g(θ,x)=∇θlnp(x∥θ), F F F is the Fisher information matrix, given by F = E x [ g ( θ , x ) g ( θ , x ) T ] F=E_x[g(\theta,x)g(\theta,x)^T] F=Ex[g(θ,x)g(θ,x)T].
6.3 Radial Basis Function Networks
Radial basis functions have the property that each basis function depends only on the radial distance (typically Euclidean) from a centre μ j \mu_j μj, so that
ϕ j ( x ) = h ( ∣ ∣ x − μ j ∣ ∣ ) \phi_j(x)=h(||x-\mu_j||) ϕj(x)=h(∣∣x−μj∣∣)
6.3.1 Nadaraya-Watson model
Starting with kernel density estimation. Suppose we have a training set { x n , t n } \{x_n,t_n\} {xn,tn} and we use a Parzen density estimator to model the joint distribution p ( x , t ) p(x,t) p(x,t), so that
p ( x , t ) = 1 N ∑ n = 1 N f ( x − x n , t − t n ) p(x,t)=\frac{1}{N}\sum_{n=1}^N f(x-x_n,t-t_n) p(x,t)=N1n=1∑Nf(x−xn,t−tn)
We now find an expression for the regression function y ( x ) y(x) y(x), corresponding to the conditional average of the target variable conditioned on the input variable, which is given by:
y ( x ) = E [ t ∣ x ] = ∫ − ∞ ∞ t p ( t ∣ x ) = ∫ t p ( x , t ) d t ∫ p ( x , t ) d t = ∑ n ∫ t f ( x − x n , t − t n ) d t ∑ m ∫ f ( x − x m , t − t m ) d t \begin{aligned} y(x) &= E[t | x] =\int_{-\infin}^{\infin}tp(t|x)\\ &=\frac{\int tp(x,t)dt}{\int p(x,t)dt}\\ &= \frac{\sum_{n}\int t f(x - x_{n}, t - t_{n}) dt} { \sum_{m}\int f(x - x_{m}, t - t_{m}) dt } \end{aligned} y(x)=E[t∣x]=∫−∞∞tp(t∣x)=∫p(x,t)dt∫tp(x,t)dt=∑m∫f(x−xm,t−tm)dt∑n∫tf(x−xn,t−tn)dt
We now assume for simplicity that the component density functions have zero mean so that:
∫ − ∞ ∞ f ( x , t ) t d t = 0 \int_{-\infty}^{\infty} f(x, t)t dt = 0 ∫−∞∞f(x,t)tdt=0
for all values of x x x, using a simple change of variable, we then obtain:
y ( x ) = k ( x , x n ) = g ( x − x n ) t n ∑ m g ( x − x m ) = ∑ n k ( x , x n ) t n y(x) = k(x, x_{n}) = \frac{g(x - x_{n})t_{n}}{\sum_{m}g(x - x_{m})} =\sum_n k(x,x_n) t_n y(x)=k(x,xn)=∑mg(x−xm)g(x−xn)tn=n∑k(x,xn)tn
and we have defined:
g ( x ) = ∫ − ∞ ∞ f ( x , t ) d t g(x) = \int_{-\infty}^{\infty} f(x, t) dt g(x)=∫−∞∞f(x,t)dt
The result is known as the Nadaraya-Watson model or kernel regression.
6.4 Gaussian Processes
6.4.1 Linear regression revisited
Consider a model defined in terms of a linear combination of fixed basis functions:
y ( x ) = w T ϕ ( x ) y(x) = w^{T}\phi(x) y(x)=wTϕ(x)
The prior distribution over w w w:
p ( w ) = N ( w ∣ 0 , α − 1 I ) p(w) = N(w | 0, \alpha^{-1}\mathbf{I}) p(w)=N(w∣0,α−1I)
Evaluate the function at specific values of x x x, the vector is given by:
y = Φ w \mathbf{y} = \Phi w y=Φw
And the mean and covariance:
E [ y ] = Φ E [ w ] = 0 E[\mathbf{y}] = \Phi E[w] = 0 E[y]=ΦE[w]=0
c o v [ y ] = E [ y y T ] = K cov[\mathbf{y}] = E[\mathbf{y}\mathbf{y}^{T}] = \mathbf{K} cov[y]=E[yyT]=K
where K K K is the Gram matrix with elements:
K n m = k ( x n , x m ) = 1 α ϕ ( x n ) T ϕ ( x m ) K_{nm} = k(x_{n}, x_{m}) = \frac{1}{\alpha}\phi(x_{n})^{T}\phi(x_{m}) Knm=k(xn,xm)=α1ϕ(xn)Tϕ(xm)
6.4.2 Gaussian processes for regression
In order to apply Gaussian process models to the problem of regression, we need to take account of the noise on the observed target values:
t n = y n + ϵ n t_{n} = y_{n} + \epsilon_{n} tn=yn+ϵn
Consider noise processes that have a Gaussian distribution, so that:
p ( t n ∣ y n ) = N ( t n ∣ y n , β − 1 ) p(t_{n} | y_{n}) = N(t_{n}| y_{n}, \beta^{-1}) p(tn∣yn)=N(tn∣yn,β−1)
The conditional distribution is given by an isotropic Gaussian of the form:
p ( t ∣ y ) = N ( t ∣ y , β − 1 I N ) p(\mathbf{t} | \mathbf{y}) = N(\mathbf{t}| \mathbf{y}, \beta^{-1}\mathbf{I}_{N}) p(t∣y)=N(t∣y,β−1IN)
From the definition, the marginal distribution p ( y ) p(y) p(y) is given by a Gaussian:
p ( y ) = N ( y ∣ 0 , K ) p(\mathbf{y}) = N(\mathbf{y} | 0, \mathbf{K}) p(y)=N(y∣0,K)
We see that the marginal distribution of t \textbf{t} t is given by:
p ( t ) = ∫ p ( t ∣ y ) p ( y ) d y = N ( t ∣ 0 , C ) p(\mathbf{t}) = \int p(\mathbf{t} | \mathbf{y})p(\mathbf{y}) d\mathbf{y} = N(\mathbf{t} | 0, \mathbf{C}) p(t)=∫p(t∣y)p(y)dy=N(t∣0,C)
where the covariance matrix C C C has elements:
C ( x n , x m ) = k ( x n , x m ) + β − 1 δ n m C(x_{n}, x_{m}) = k(x_{n}, x_{m}) + \beta^{-1}\delta_{nm} C(xn,xm)=k(xn,xm)+β−1δnm
One widely used kernel function for Gaussian process regression is given by the exponential of a quadratic form, with the addition of constant and linear terms to give:
k
(
x
n
,
x
m
)
=
θ
0
e
x
p
{
−
θ
1
2
∣
∣
x
n
−
x
m
∣
∣
2
}
+
θ
2
+
θ
3
x
n
T
x
m
k(x_{n}, x_{m}) = \theta_{0}exp\left\{ -\frac{\theta_{1}}{2}||x_{n} - x_{m}||^{2} \right\} + \theta_{2} + \theta_{3}x_{n}^{T}x_{m}
k(xn,xm)=θ0exp{−2θ1∣∣xn−xm∣∣2}+θ2+θ3xnTxm
The joint distribution over t 1 , . . . , t N + 1 t_1,...,t_{N+1} t1,...,tN+1 will be given by:
p ( t N + 1 ) = N ( t N + 1 ∣ 0 , C N + 1 ) p(t_{N+1}) = N(t_{N+1} | 0, C_{N+1}) p(tN+1)=N(tN+1∣0,CN+1)
So the conditional distribution p ( t N + 1 ∥ t ) p(t_{N+1}\|\textbf{t}) p(tN+1∥t) is a Gaussian distribution with mean and covariance given by:
m ( x N + 1 ) = K t c N − 1 t m(x_{N+1}) = K^{t}c_{N}^{-1}\mathbf{t} m(xN+1)=KtcN−1t
σ 2 ( x N + 1 ) = c − k T C N − 1 k \sigma^{2}(x_{N+1}) = c - k^{T}C_{N}^{-1}\mathbf{k} σ2(xN+1)=c−kTCN−1k
6.4.3 Learning the hyperparameters
Techniques are based on the evaluation of the likelihood function and maximize the log likelihood.
6.4.4 Automatic relevance determination
The example of a Gaussian process with a two-dimensional input space x = ( x 1 , x 2 ) x=(x_1,x_2) x=(x1,x2), having a kernel function of the form:
k ( x , x ′ ) = θ 0 e x p { − 1 2 ∑ i = 1 2 η i ( x i − x i ′ ) 2 } k(x,x')=\theta_0 exp\{-\frac{1}{2}\sum_{i=1}^2 \eta_i(x_i-x_i')^2\} k(x,x′)=θ0exp{−21i=1∑2ηi(xi−xi′)2}
The ARD framework is easily incorporated into the exponential-quadratic kernel to give the following form of kernel function:
k ( x n , x m ) = θ 0 e x p { − 1 2 ∑ i = 1 D η i ( x n i − x m i ) 2 } + θ 2 + θ 3 ∑ i = 1 D x n i x m i k(x_n,x_m)=\theta_0 exp\{-\frac{1}{2}\sum_{i=1}^D\eta_i(x_{ni}-x_{mi})^2\}+\theta_2+\theta_3\sum_{i=1}^D x_{ni}x_{mi} k(xn,xm)=θ0exp{−21i=1∑Dηi(xni−xmi)2}+θ2+θ3i=1∑Dxnixmi
6.4.5 Gaussian processes for classification
For the two-class problem if we define a Gaussian process over a function a ( x ) a(x) a(x) and then transform the function using a logistic sigmoid y = σ ( a ) y=\sigma(a) y=σ(a), then we obtain a non-Gaussian stochastic process over functions y ( x ) y(x) y(x) where y ∈ ( 0 , 1 ) y\in (0,1) y∈(0,1). The distribution of target variable t t t is given by the Bernoulli distribution:
p ( t ∣ a ) = σ ( a ) t ( 1 − σ ( a ) ) 1 − t p(t|a) = \sigma(a)^{t}(1-\sigma(a))^{1-t} p(t∣a)=σ(a)t(1−σ(a))1−t
Our goal is to determine the predictive distribution p ( t N + 1 ∥ t ) p(t_{N+1}\|\textbf{t}) p(tN+1∥t), so we introduce a Gaussian process prior over the vector a N + 1 a_{N+1} aN+1:
p ( a N + 1 ) = N ( a N + 1 ∣ 0 , C N + 1 ) p(a_{N+1}) = N(a_{N+1} | 0, C_{N+1}) p(aN+1)=N(aN+1∣0,CN+1)
For numerical reasons it is convenient to introduce a noise-like term governed by a parameter v v v that ensures the covariance matrix is positive definite. Thus the covariance matrix C N + 1 C_{N+1} CN+1 has elements given by:
C ( x n , x m ) = k ( x n , x m ) + ν δ n m C(x_{n}, x_{m}) = k(x_{n}, x_{m}) + \nu\delta_{nm} C(xn,xm)=k(xn,xm)+νδnm
The required predictive distribution is given by:
p ( t N + 1 = 1 ∣ t N ) = ∫ p ( t N + 1 = 1 ∣ a N + 1 ) p ( a N + 1 ∣ t N ) d a N + 1 p(t_{N+1} = 1 | t_{N}) = \int p(t_{N+1} = 1 | a_{N+1})p(a_{N+1} | t_{N}) da_{N+1} p(tN+1=1∣tN)=∫p(tN+1=1∣aN+1)p(aN+1∣tN)daN+1
where p ( t N + 1 = 1 ∣ a N + 1 ) = σ ( a N + 1 ) p(t_{N+1}=1|a_{N+1})=\sigma(a_{N+1}) p(tN+1=1∣aN+1)=σ(aN+1)
6.4.6 Laplace approximation
In order to evaluate the predictive distribution, we seek a Gaussian approximation to the posterior distribution over a N + 1 a_{N+1} aN+1:
p ( a N + 1 ∣ t N ) = ∫ p ( a N + 1 , a N ∣ t N ) d a N = ∫ p ( a N + 1 ∣ a N ) p ( a N ∣ t N ) d a N p(a_{N+1}|\textbf{t}_N)=\int p(a_{N+1},a_N|\textbf{t}_N)da_N=\int p(a_{N+1}|a_N)p(a_N|\textbf{t}_N)da_N p(aN+1∣tN)=∫p(aN+1,aN∣tN)daN=∫p(aN+1∣aN)p(aN∣tN)daN
We also need to determine the parameters θ \theta θ of the covariance function. The likelihood funciton is defined by:
p ( t N ∣ θ ) = ∫ p ( t N ∣ a N ) p ( a N ∣ θ ) d a N p(\textbf{t}_N|\theta)=\int p(\textbf{t}_N|a_N)p(a_N|\theta)da_N p(tN∣θ)=∫p(tN∣aN)p(aN∣θ)daN