模式识别 | PRML Chapter 1 Introduction

PRML Chapter 1 Introduction

1.1 Example:Polynomial Curve Fitting

For a simple regression problem, our goal is to use the training set to predict new value t ^ \hat{t} t^ for input variable x ^ \hat{x} x^. The simplest way is to fit the curve with polynomial.

y ( x , w ) = w 0 + w 1 x + w 2 x 2 + ⋯ + w M x M = ∑ j = 0 M w j x j y(x,w) = w_{0} + w_{1} x + w_{2} x^{2} + \dots + w_{M}x^{M} = \sum_{j=0}^{M}w_{j}x^{j} y(x,w)=w0+w1x+w2x2++wMxM=j=0Mwjxj

To determine the coefficients, we need to minimize the loss function. A general type of loss function could be written as:

E ( w ) = 1 2 ∑ n = 1 N y ( x n , w ) − t n 2 E(w) = \frac{1}{2}\sum_{n=1}^{N}{y(x_{n},w) - t_{n}}^{2} E(w)=21n=1Ny(xn,w)tn2

To avoid the coefficients reaching a huge value, regularization is what we need. By adding the penalty term, we can avoid overfitting:

E ^ ( w ) = 1 2 ∑ n = 1 N y ( x n , w ) − t n 2 + λ 2 ∣ ∣ w ∣ ∣ 2 \hat{E}(w) = \frac{1}{2}\sum_{n=1}^{N}{y(x_{n},w) - t_{n}}^{2} + \frac{\lambda}{2} ||w||^{2} E^(w)=21n=1Ny(xn,w)tn2+2λw2

Where:
∣ ∣ w ∣ ∣ 2 = w T w = w 0 2 + w 1 2 + ⋯ + w M 2 ||w||^{2} = w^{T}w = w_{0}^2 + w_{1}^{2} + \dots + w_{M}^{2} w2=wTw=w02+w12++wM2

1.2 Probability Theory

Two important rules in probability theory:

  • sum rule: p ( X ) = ∑ Y p ( X , Y ) p(X)=\sum_{Y}p(X,Y) p(X)=Yp(X,Y)
  • product rule: p ( X , Y ) = p ( Y ∣ X ) p ( X ) p(X,Y)=p(Y|X)p(X) p(X,Y)=p(YX)p(X)
    Base on the rules we could derive the Bayes’ theorem:

P ( Y ∣ X ) = p ( X ∣ Y ) p ( Y ) p ( X ) = p ( X ∣ Y ) p ( Y ) ∑ Y p ( X ∣ Y ) p ( Y ) P(Y | X) = \frac{p(X | Y)p(Y)}{p(X)} = \frac{p(X | Y)p(Y)}{\sum_{Y}p(X|Y)p(Y)} P(YX)=p(X)p(XY)p(Y)=Yp(XY)p(Y)p(XY)p(Y)

1.2.1 Probability densities

For continuous variables, we define the probability density over x as:

p ( x ∈ ( a , b ) ) = ∫ a b p ( x ) d x p(x \in (a,b)) = \int_{a}^{b} p(x)dx p(x(a,b))=abp(x)dx

Also we have transformative rules between variables:

p y ( y ) = p x ( g ( y ) ) ∥ g ′ ( y ) ∥ p_y(y)=p_x(g(y))\|g'(y)\| py(y)=px(g(y))g(y)

1.2.2 Expectations and covariances

In this chapter we define the expectation, variance and covariance:

E [ f ] = ∑ x p ( x ) f ( x ) E[f] = \sum_{x}p(x)f(x) E[f]=xp(x)f(x)

v a r [ f ] = E [ ( f ( x ) − E [ f ( x ) ] 2 ) ] = E [ f ( x ) 2 ] − E [ f ( x ) ] 2 var[f] = E[(f(x) - E[f(x)]^{2})] = E[f(x)^2] - E[f(x)]^2 var[f]=E[(f(x)E[f(x)]2)]=E[f(x)2]E[f(x)]2

c o v [ x , y ] = E x , y [ x − E [ x ] y − E [ y ] ] = E x , y x y − E [ x ] E [ y ] cov[x, y] = E_{x,y}[{x - E[x]}{y-E[y]}] = E_{x,y}{xy} - E[x]E[y] cov[x,y]=Ex,y[xE[x]yE[y]]=Ex,yxyE[x]E[y]

1.2.3 Bayesian probabilities

Here we use bayesian perspective to give a description of uncertainty:

p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) = p ( D ∣ w ) p ( w ) ∫ p ( D ∣ w ) p ( w ) d w p(w | D) = \frac{p(D | w)p(w)}{p(D)} = \frac{p(D | w)p(w)}{\int p(D | w)p(w)dw} p(wD)=p(D)p(Dw)p(w)=p(Dw)p(w)dwp(Dw)p(w)

which allow us to value the uncertainty of w w w after we observe the dataset D D D.

1.2.4 The Gaussian distribution

The Gaussian distribution is defined as:

N ( x ∣ μ , σ 2 ) = 1 ( 2 π σ 2 ) 1 2 e x p { − 1 2 σ 2 ( x − μ ) 2 } N(x | \mu,\sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{\frac{1}{2}}}exp\{-\frac{1}{2\sigma^{2}}(x-\mu)^{2}\} N(xμ,σ2)=(2πσ2)211exp{2σ21(xμ)2}

For independent identically distributed dataset, the likelihood function could be written as:

p ( x ∣ μ , σ 2 ) = ∏ n = 1 N N ( x n ∣ μ , σ 2 ) p(x | \mu, \sigma^{2}) = \prod_{n=1}^{N}N(x_{n} | \mu, \sigma^{2}) p(xμ,σ2)=n=1NN(xnμ,σ2)

it can also be written in the log form:

l n p ( x ∣ μ , σ 2 ) = − 1 2 σ 2 ∑ n = 1 N ( x n − μ ) 2 − N 2 l n σ 2 − N 2 l n ( 2 π ) ln p(x| \mu, \sigma^{2}) = -\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_{n} - \mu)^{2} - \frac{N}{2}ln \sigma^{2} - \frac{N}{2}ln(2\pi) lnp(xμ,σ2)=2σ21n=1N(xnμ)22Nlnσ22Nln(2π)

Thus we can calculate the partial derivative to get the MLE solution:

μ M L = 1 N ∑ n = 1 N x n \mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_{n} μML=N1n=1Nxn

σ M L 2 = 1 N ∑ n = 1 N ( x n − μ M L ) 2 \sigma_{ML}^{2} = \frac{1}{N}\sum_{n=1}{N}(x_{n} - \mu_{ML})^{2} σML2=N1n=1N(xnμML)2

The expectation of the estimation is unbiased, but the variance is underestimated.

E [ μ M L ] = μ E[\mu_{ML}] = \mu E[μML]=μ

E [ σ M L 2 ] = N − 1 N σ 2 E[\sigma_{ML}^{2}] = \frac{N-1}{N}\sigma^{2} E[σML2]=NN1σ2

1.2.5 Curve fitting re-visited

We assume that given the value of x x x, the corresponding value of t t t has a Guassian distribution with a mean equal to the value y ( x , w ) y(x,w) y(x,w). Thus we have

p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w , β − 1 ) ) p(t | x,w,\beta) = N(t| y(x,w,\beta^{-1})) p(tx,w,β)=N(ty(x,w,β1))

The likelihood function and log likelihood function is:

p ( t ∣ x , w , β ) = ∏ n = 1 N N ( t n ∣ y ( x n , w ) , β − 1 ) p(t | x, w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x_{n},w), \beta^{-1}) p(tx,w,β)=n=1NN(tny(xn,w),β1)

l n p ( t ∣ x , w , β ) = − β 2 ∑ n = 1 N { y ( x n , w ) − t n 2 } + N 2 l n β − N 2 l n ( 2 π ) ln p(t | x, w, \beta) = -\frac{\beta}{2}\sum_{n=1}^{N}\{y(x_{n},w) - t_{n}^{2}\} + \frac{N}{2}ln\beta - \frac{N}{2}ln(2\pi) lnp(tx,w,β)=2βn=1N{y(xn,w)tn2}+2Nlnβ2Nln(2π)

When we calculate w M L E w_{MLE} wMLE, we can see that maximize the likelihood function is equivalent to minimize the sum-of-square error function.

Similarily, for the precision parameter β \beta β, we have:

1 β M L = 1 N ∑ n = 1 N { y ( x n , w M L ) − t n } 2 \frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{y(x_n,w_{ML})-t_n\}^2 βML1=N1n=1N{y(xn,wML)tn}2

To predict the distribution, we substitute the MLE solution into the function:

p ( t ∣ x , w M L , β M L ) = N ( t ∣ y ( x , w M L ) , β M L − 1 ) p(t | x, w_{ML}, \beta_{ML}) = N(t | y(x, w_{ML}), \beta_{ML}^{-1}) p(tx,wML,βML)=N(ty(x,wML),βML1)

Introduce the prior distribution of w w w (assume Guassian):

p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) = ( α 2 π ) M + 1 2 e x p { − α 2 w T w } p(w | \alpha) = N(w | 0, \alpha^{-1}I) = (\frac{\alpha}{2\pi})^{\frac{M+1}{2}}exp\{-\frac{\alpha}{2}w^{T}w \} p(wα)=N(w0,α1I)=(2πα)2M+1exp{2αwTw}

We can apply bayes’ theorem and maximize the posterior:(MAP techique)

p ( w ∣ x , t , α , β ) ∝ p ( t ∣ x , w , β ) p ( w ∣ α ) p(w | x,t, \alpha, \beta) \propto p(t | x, w, \beta)p(w | \alpha) p(wx,t,α,β)p(tx,w,β)p(wα)

and it is equivalent to minimize regularized sum-of-square error function:

β 2 ∑ n = 1 N { y ( x n , w ) − t n } 2 + α 2 w T w \frac{\beta}{2}\sum_{n=1}^{N}\{y(x_{n},w) - t_{n} \}^{2} + \frac{\alpha}{2}w^{T}w 2βn=1N{y(xn,w)tn}2+2αwTw

1.2.6 Bayesian curve fitting

The pure Bayesian treatment simply corresponds to a consistent application of the sum and product rules.

p ( t ∣ x , X , t ) = ∫ p ( t ∣ x , w ) p ( w ∣ X , t ) d w p(t | x,X,t) = \int p(t | x, w)p(w | X, t)dw p(tx,X,t)=p(tx,w)p(wX,t)dw

1.3 Model Selection

When the training and testing data are limited, we can use cross validation to improve the performance. However, it may be time costly.

1.4 The Curse of Dimensionality

In a high dimension space the data will be really sparse. Take the example of sphere(see what happen when ϵ → 0 \epsilon\rightarrow 0 ϵ0):

V D ( r ) = K D r D V_D(r)=K_D r^D VD(r)=KDrD

V D ( 1 ) − V D ( 1 − ϵ ) V D ( 1 ) = 1 − ( 1 − ϵ ) D \frac{V_D(1)-V_D(1-\epsilon)}{V_D(1)}=1-(1-\epsilon)^D VD(1)VD(1)VD(1ϵ)=1(1ϵ)D

1.5 Decision Theory

When we predict the category of a new data point x x x, we can use bayes’ theorem:

p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) p ( x ) p(C_{k} | x) = \frac{p(x | C_{k})p(C_{k})}{p(x)} p(Ckx)=p(x)p(xCk)p(Ck)

Intuitively we would choose the class having the higher posterior probability.

1.5.1 Minimizing the misclassification rate

We should minimize the term:
p ( m i s t a k e ) = p ( x ∈ R 1 , C 2 ) + p ( x ∈ R 2 , C 1 ) p(mistake) = p(x \in R_{1},C_{2}) + p(x \in R_{2},C_{1}) p(mistake)=p(xR1,C2)+p(xR2,C1)

More generally, for k classes, we can maximize:

p ( m i s t a k e ) = ∑ k = 1 K p ( x ∈ R k , C k ) = ∑ k = 1 K ∫ R k p ( x , C k ) d x p(mistake)=\sum_{k=1}^Kp(x\in R_k,C_k)=\sum_{k=1}^K\int_{R_k}p(x,C_k)dx p(mistake)=k=1Kp(xRk,Ck)=k=1KRkp(x,Ck)dx

1.5.2 Minimizing the expected loss

The uncertainty of real category could be represented by joint distribution p ( x , C k ) p(x,C_k) p(x,Ck), so we need to minimize the average loss:

E [ L ] = ∑ k ∑ j ∫ R j L k j p ( x , C k ) d x E[L]=\sum_k\sum_j\int_{R_j}L_{kj}p(x,C_k)dx E[L]=kjRjLkjp(x,Ck)dx

1.5.3 The reject option

Set the threshold value θ \theta θ to reject the x x x whose maximum posterior p ( C k ∥ x ) p(C_k\|x) p(Ckx) value lower then θ \theta θ.

1.5.4 Inference and decision

  • generative models: model the distribution of inputs and outputs p ( x , C k ) p(x,C_k) p(x,Ck).
  • discriminative models: model the posterior probabilities directly p ( C k ∥ x ) p(C_k\|x) p(Ckx).

1.5.5 Loss functions for regression

The expected loss is given by:

E [ L ] = ∫ ∫ L ( t , y ( x ) ) p ( x , t ) d x d t E[L] = \int\int L(t,y(x))p(x, t) dxdt E[L]=L(t,y(x))p(x,t)dxdt

For example when we use the squared loss:

E [ L ] = ∫ ∫ { y ( x ) − t } 2 p ( x , t ) d x d t E[L] = \int\int \{y(x) - t\}^{2}p(x, t) dxdt E[L]={y(x)t}2p(x,t)dxdt

By making ∂ E [ L ] ∂ y ( x ) \frac{\partial E[L]}{\partial y(x)} y(x)E[L]equals to 0, we can get the regression function, the optimal solution is conditional average:

y ( x ) = ∫ t p ( x , t ) d t p ( x ) = ∫ t p ( t ∥ x ) d t = E t [ t ∥ x ] y(x)=\frac{\int tp(x,t)dt}{p(x)}=\int tp(t\|x)dt=E_t[t\|x] y(x)=p(x)tp(x,t)dt=tp(tx)dt=Et[tx]

From another point of view, the expected loss function can be written as (where the second part can be seen as noise):

E [ L ] = ∫ { y ( x ) − E [ t ∣ x ] } 2 p ( x ) d x + ∫ v a r [ t ∣ x ] p ( x ) d x E[L] = \int \{y(x) - E[t | x]\}^{2}p(x)dx + \int var[t | x]p(x)dx E[L]={y(x)E[tx]}2p(x)dx+var[tx]p(x)dx

1.6 Information Theory

The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x x x. Also the amount should be non-negative:

h ( x ) = − l o g 2 p ( x ) h(x) = -log_{2} p(x) h(x)=log2p(x)

And the entropy can be seen as the average amount of information:

H [ x ] = − ∑ x p ( x ) l o g 2 p ( x ) H[x] = -\sum_{x}p(x)log_{2}p(x) H[x]=xp(x)log2p(x)

1.6.1 Relative entropy and mutual information.

Suppose that we are modeling an unknown distribution p ( x ) p(x) p(x) using q ( x ) q(x) q(x). Then the average additional amount of information required to specify x x x as a result of using q ( x ) q(x) q(x) instead of true p ( x ) p(x) p(x) is given by:

K L ( p ∣ ∣ q ) = − ∫ p ( x ) l n q ( x ) d x − ( − ∫ p ( x ) l n p ( x ) d x ) = − ∫ p ( x ) l n { q ( x ) p ( x ) } d x KL(p || q) = -\int p(x)ln q(x)dx - (-\int p(x)lnp(x)dx) = -\int p(x)ln\{\frac{q(x)}{p(x)}\}dx KL(pq)=p(x)lnq(x)dx(p(x)lnp(x)dx)=p(x)ln{p(x)q(x)}dx

which is call relative entropy or KL divergence.

If we have observed a finite set of training points(from p ( x ) p(x) p(x)), then the expectation with respect to p ( x ) p(x) p(x) can by approximated, so that:

K L ( p ∣ ∣ q ) ≃ 1 N { − l n q ( x n ∣ θ ) + l n p ( x n ) } KL(p || q) \simeq \frac{1}{N}\{-ln q(x_{n} | \theta) + ln p(x_{n}) \} KL(pq)N1{lnq(xnθ)+lnp(xn)}

which shows that minimizing KL divergence is equivalent to maximizing the likelihood function.

We can also use mutual information to judge whether two distributions are closed to independent:

I [ x , y ] = K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) = − ∫ ∫ p ( x , y ) l n ( p ( x ) p ( y ) p ( x , y ) ) d x d y I[x, y] = KL( p(x, y) || p(x)p(y) ) = -\int\int p(x,y)ln (\frac{p(x)p(y)}{p(x,y)})dxdy I[x,y]=KL(p(x,y)p(x)p(y))=p(x,y)ln(p(x,y)p(x)p(y))dxdy

The relationship between mutual information and conditional entropy:

I [ x , y ] = H [ x ] − H [ x ∣ y ] = H [ y ] − H [ y ∣ x ] I[x,y] = H[x] - H[x | y] = H[y] - H[y | x] I[x,y]=H[x]H[xy]=H[y]H[yx]

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值