PRML Chapter 1 Introduction
1.1 Example:Polynomial Curve Fitting
For a simple regression problem, our goal is to use the training set to predict new value t ^ \hat{t} t^ for input variable x ^ \hat{x} x^. The simplest way is to fit the curve with polynomial.
y ( x , w ) = w 0 + w 1 x + w 2 x 2 + ⋯ + w M x M = ∑ j = 0 M w j x j y(x,w) = w_{0} + w_{1} x + w_{2} x^{2} + \dots + w_{M}x^{M} = \sum_{j=0}^{M}w_{j}x^{j} y(x,w)=w0+w1x+w2x2+⋯+wMxM=j=0∑Mwjxj
To determine the coefficients, we need to minimize the loss function. A general type of loss function could be written as:
E ( w ) = 1 2 ∑ n = 1 N y ( x n , w ) − t n 2 E(w) = \frac{1}{2}\sum_{n=1}^{N}{y(x_{n},w) - t_{n}}^{2} E(w)=21n=1∑Ny(xn,w)−tn2
To avoid the coefficients reaching a huge value, regularization is what we need. By adding the penalty term, we can avoid overfitting:
E ^ ( w ) = 1 2 ∑ n = 1 N y ( x n , w ) − t n 2 + λ 2 ∣ ∣ w ∣ ∣ 2 \hat{E}(w) = \frac{1}{2}\sum_{n=1}^{N}{y(x_{n},w) - t_{n}}^{2} + \frac{\lambda}{2} ||w||^{2} E^(w)=21n=1∑Ny(xn,w)−tn2+2λ∣∣w∣∣2
Where:
∣
∣
w
∣
∣
2
=
w
T
w
=
w
0
2
+
w
1
2
+
⋯
+
w
M
2
||w||^{2} = w^{T}w = w_{0}^2 + w_{1}^{2} + \dots + w_{M}^{2}
∣∣w∣∣2=wTw=w02+w12+⋯+wM2
1.2 Probability Theory
Two important rules in probability theory:
- sum rule: p ( X ) = ∑ Y p ( X , Y ) p(X)=\sum_{Y}p(X,Y) p(X)=∑Yp(X,Y)
- product rule:
p
(
X
,
Y
)
=
p
(
Y
∣
X
)
p
(
X
)
p(X,Y)=p(Y|X)p(X)
p(X,Y)=p(Y∣X)p(X)
Base on the rules we could derive the Bayes’ theorem:
P ( Y ∣ X ) = p ( X ∣ Y ) p ( Y ) p ( X ) = p ( X ∣ Y ) p ( Y ) ∑ Y p ( X ∣ Y ) p ( Y ) P(Y | X) = \frac{p(X | Y)p(Y)}{p(X)} = \frac{p(X | Y)p(Y)}{\sum_{Y}p(X|Y)p(Y)} P(Y∣X)=p(X)p(X∣Y)p(Y)=∑Yp(X∣Y)p(Y)p(X∣Y)p(Y)
1.2.1 Probability densities
For continuous variables, we define the probability density over x as:
p ( x ∈ ( a , b ) ) = ∫ a b p ( x ) d x p(x \in (a,b)) = \int_{a}^{b} p(x)dx p(x∈(a,b))=∫abp(x)dx
Also we have transformative rules between variables:
p y ( y ) = p x ( g ( y ) ) ∥ g ′ ( y ) ∥ p_y(y)=p_x(g(y))\|g'(y)\| py(y)=px(g(y))∥g′(y)∥
1.2.2 Expectations and covariances
In this chapter we define the expectation, variance and covariance:
E [ f ] = ∑ x p ( x ) f ( x ) E[f] = \sum_{x}p(x)f(x) E[f]=x∑p(x)f(x)
v a r [ f ] = E [ ( f ( x ) − E [ f ( x ) ] 2 ) ] = E [ f ( x ) 2 ] − E [ f ( x ) ] 2 var[f] = E[(f(x) - E[f(x)]^{2})] = E[f(x)^2] - E[f(x)]^2 var[f]=E[(f(x)−E[f(x)]2)]=E[f(x)2]−E[f(x)]2
c o v [ x , y ] = E x , y [ x − E [ x ] y − E [ y ] ] = E x , y x y − E [ x ] E [ y ] cov[x, y] = E_{x,y}[{x - E[x]}{y-E[y]}] = E_{x,y}{xy} - E[x]E[y] cov[x,y]=Ex,y[x−E[x]y−E[y]]=Ex,yxy−E[x]E[y]
1.2.3 Bayesian probabilities
Here we use bayesian perspective to give a description of uncertainty:
p ( w ∣ D ) = p ( D ∣ w ) p ( w ) p ( D ) = p ( D ∣ w ) p ( w ) ∫ p ( D ∣ w ) p ( w ) d w p(w | D) = \frac{p(D | w)p(w)}{p(D)} = \frac{p(D | w)p(w)}{\int p(D | w)p(w)dw} p(w∣D)=p(D)p(D∣w)p(w)=∫p(D∣w)p(w)dwp(D∣w)p(w)
which allow us to value the uncertainty of w w w after we observe the dataset D D D.
1.2.4 The Gaussian distribution
The Gaussian distribution is defined as:
N ( x ∣ μ , σ 2 ) = 1 ( 2 π σ 2 ) 1 2 e x p { − 1 2 σ 2 ( x − μ ) 2 } N(x | \mu,\sigma^{2}) = \frac{1}{(2\pi\sigma^{2})^{\frac{1}{2}}}exp\{-\frac{1}{2\sigma^{2}}(x-\mu)^{2}\} N(x∣μ,σ2)=(2πσ2)211exp{−2σ21(x−μ)2}
For independent identically distributed dataset, the likelihood function could be written as:
p ( x ∣ μ , σ 2 ) = ∏ n = 1 N N ( x n ∣ μ , σ 2 ) p(x | \mu, \sigma^{2}) = \prod_{n=1}^{N}N(x_{n} | \mu, \sigma^{2}) p(x∣μ,σ2)=n=1∏NN(xn∣μ,σ2)
it can also be written in the log form:
l n p ( x ∣ μ , σ 2 ) = − 1 2 σ 2 ∑ n = 1 N ( x n − μ ) 2 − N 2 l n σ 2 − N 2 l n ( 2 π ) ln p(x| \mu, \sigma^{2}) = -\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_{n} - \mu)^{2} - \frac{N}{2}ln \sigma^{2} - \frac{N}{2}ln(2\pi) lnp(x∣μ,σ2)=−2σ21n=1∑N(xn−μ)2−2Nlnσ2−2Nln(2π)
Thus we can calculate the partial derivative to get the MLE solution:
μ M L = 1 N ∑ n = 1 N x n \mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_{n} μML=N1n=1∑Nxn
σ M L 2 = 1 N ∑ n = 1 N ( x n − μ M L ) 2 \sigma_{ML}^{2} = \frac{1}{N}\sum_{n=1}{N}(x_{n} - \mu_{ML})^{2} σML2=N1n=1∑N(xn−μML)2
The expectation of the estimation is unbiased, but the variance is underestimated.
E [ μ M L ] = μ E[\mu_{ML}] = \mu E[μML]=μ
E [ σ M L 2 ] = N − 1 N σ 2 E[\sigma_{ML}^{2}] = \frac{N-1}{N}\sigma^{2} E[σML2]=NN−1σ2
1.2.5 Curve fitting re-visited
We assume that given the value of x x x, the corresponding value of t t t has a Guassian distribution with a mean equal to the value y ( x , w ) y(x,w) y(x,w). Thus we have
p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w , β − 1 ) ) p(t | x,w,\beta) = N(t| y(x,w,\beta^{-1})) p(t∣x,w,β)=N(t∣y(x,w,β−1))
The likelihood function and log likelihood function is:
p ( t ∣ x , w , β ) = ∏ n = 1 N N ( t n ∣ y ( x n , w ) , β − 1 ) p(t | x, w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x_{n},w), \beta^{-1}) p(t∣x,w,β)=n=1∏NN(tn∣y(xn,w),β−1)
l n p ( t ∣ x , w , β ) = − β 2 ∑ n = 1 N { y ( x n , w ) − t n 2 } + N 2 l n β − N 2 l n ( 2 π ) ln p(t | x, w, \beta) = -\frac{\beta}{2}\sum_{n=1}^{N}\{y(x_{n},w) - t_{n}^{2}\} + \frac{N}{2}ln\beta - \frac{N}{2}ln(2\pi) lnp(t∣x,w,β)=−2βn=1∑N{y(xn,w)−tn2}+2Nlnβ−2Nln(2π)
When we calculate w M L E w_{MLE} wMLE, we can see that maximize the likelihood function is equivalent to minimize the sum-of-square error function.
Similarily, for the precision parameter β \beta β, we have:
1 β M L = 1 N ∑ n = 1 N { y ( x n , w M L ) − t n } 2 \frac{1}{\beta_{ML}}=\frac{1}{N}\sum_{n=1}^N\{y(x_n,w_{ML})-t_n\}^2 βML1=N1n=1∑N{y(xn,wML)−tn}2
To predict the distribution, we substitute the MLE solution into the function:
p ( t ∣ x , w M L , β M L ) = N ( t ∣ y ( x , w M L ) , β M L − 1 ) p(t | x, w_{ML}, \beta_{ML}) = N(t | y(x, w_{ML}), \beta_{ML}^{-1}) p(t∣x,wML,βML)=N(t∣y(x,wML),βML−1)
Introduce the prior distribution of w w w (assume Guassian):
p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) = ( α 2 π ) M + 1 2 e x p { − α 2 w T w } p(w | \alpha) = N(w | 0, \alpha^{-1}I) = (\frac{\alpha}{2\pi})^{\frac{M+1}{2}}exp\{-\frac{\alpha}{2}w^{T}w \} p(w∣α)=N(w∣0,α−1I)=(2πα)2M+1exp{−2αwTw}
We can apply bayes’ theorem and maximize the posterior:(MAP techique)
p ( w ∣ x , t , α , β ) ∝ p ( t ∣ x , w , β ) p ( w ∣ α ) p(w | x,t, \alpha, \beta) \propto p(t | x, w, \beta)p(w | \alpha) p(w∣x,t,α,β)∝p(t∣x,w,β)p(w∣α)
and it is equivalent to minimize regularized sum-of-square error function:
β 2 ∑ n = 1 N { y ( x n , w ) − t n } 2 + α 2 w T w \frac{\beta}{2}\sum_{n=1}^{N}\{y(x_{n},w) - t_{n} \}^{2} + \frac{\alpha}{2}w^{T}w 2βn=1∑N{y(xn,w)−tn}2+2αwTw
1.2.6 Bayesian curve fitting
The pure Bayesian treatment simply corresponds to a consistent application of the sum and product rules.
p ( t ∣ x , X , t ) = ∫ p ( t ∣ x , w ) p ( w ∣ X , t ) d w p(t | x,X,t) = \int p(t | x, w)p(w | X, t)dw p(t∣x,X,t)=∫p(t∣x,w)p(w∣X,t)dw
1.3 Model Selection
When the training and testing data are limited, we can use cross validation to improve the performance. However, it may be time costly.
1.4 The Curse of Dimensionality
In a high dimension space the data will be really sparse. Take the example of sphere(see what happen when ϵ → 0 \epsilon\rightarrow 0 ϵ→0):
V D ( r ) = K D r D V_D(r)=K_D r^D VD(r)=KDrD
V D ( 1 ) − V D ( 1 − ϵ ) V D ( 1 ) = 1 − ( 1 − ϵ ) D \frac{V_D(1)-V_D(1-\epsilon)}{V_D(1)}=1-(1-\epsilon)^D VD(1)VD(1)−VD(1−ϵ)=1−(1−ϵ)D
1.5 Decision Theory
When we predict the category of a new data point x x x, we can use bayes’ theorem:
p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) p ( x ) p(C_{k} | x) = \frac{p(x | C_{k})p(C_{k})}{p(x)} p(Ck∣x)=p(x)p(x∣Ck)p(Ck)
Intuitively we would choose the class having the higher posterior probability.
1.5.1 Minimizing the misclassification rate
We should minimize the term:
p
(
m
i
s
t
a
k
e
)
=
p
(
x
∈
R
1
,
C
2
)
+
p
(
x
∈
R
2
,
C
1
)
p(mistake) = p(x \in R_{1},C_{2}) + p(x \in R_{2},C_{1})
p(mistake)=p(x∈R1,C2)+p(x∈R2,C1)
More generally, for k classes, we can maximize:
p ( m i s t a k e ) = ∑ k = 1 K p ( x ∈ R k , C k ) = ∑ k = 1 K ∫ R k p ( x , C k ) d x p(mistake)=\sum_{k=1}^Kp(x\in R_k,C_k)=\sum_{k=1}^K\int_{R_k}p(x,C_k)dx p(mistake)=k=1∑Kp(x∈Rk,Ck)=k=1∑K∫Rkp(x,Ck)dx
1.5.2 Minimizing the expected loss
The uncertainty of real category could be represented by joint distribution p ( x , C k ) p(x,C_k) p(x,Ck), so we need to minimize the average loss:
E [ L ] = ∑ k ∑ j ∫ R j L k j p ( x , C k ) d x E[L]=\sum_k\sum_j\int_{R_j}L_{kj}p(x,C_k)dx E[L]=k∑j∑∫RjLkjp(x,Ck)dx
1.5.3 The reject option
Set the threshold value θ \theta θ to reject the x x x whose maximum posterior p ( C k ∥ x ) p(C_k\|x) p(Ck∥x) value lower then θ \theta θ.
1.5.4 Inference and decision
- generative models: model the distribution of inputs and outputs p ( x , C k ) p(x,C_k) p(x,Ck).
- discriminative models: model the posterior probabilities directly p ( C k ∥ x ) p(C_k\|x) p(Ck∥x).
1.5.5 Loss functions for regression
The expected loss is given by:
E [ L ] = ∫ ∫ L ( t , y ( x ) ) p ( x , t ) d x d t E[L] = \int\int L(t,y(x))p(x, t) dxdt E[L]=∫∫L(t,y(x))p(x,t)dxdt
For example when we use the squared loss:
E [ L ] = ∫ ∫ { y ( x ) − t } 2 p ( x , t ) d x d t E[L] = \int\int \{y(x) - t\}^{2}p(x, t) dxdt E[L]=∫∫{y(x)−t}2p(x,t)dxdt
By making ∂ E [ L ] ∂ y ( x ) \frac{\partial E[L]}{\partial y(x)} ∂y(x)∂E[L]equals to 0, we can get the regression function, the optimal solution is conditional average:
y ( x ) = ∫ t p ( x , t ) d t p ( x ) = ∫ t p ( t ∥ x ) d t = E t [ t ∥ x ] y(x)=\frac{\int tp(x,t)dt}{p(x)}=\int tp(t\|x)dt=E_t[t\|x] y(x)=p(x)∫tp(x,t)dt=∫tp(t∥x)dt=Et[t∥x]
From another point of view, the expected loss function can be written as (where the second part can be seen as noise):
E [ L ] = ∫ { y ( x ) − E [ t ∣ x ] } 2 p ( x ) d x + ∫ v a r [ t ∣ x ] p ( x ) d x E[L] = \int \{y(x) - E[t | x]\}^{2}p(x)dx + \int var[t | x]p(x)dx E[L]=∫{y(x)−E[t∣x]}2p(x)dx+∫var[t∣x]p(x)dx
1.6 Information Theory
The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x x x. Also the amount should be non-negative:
h ( x ) = − l o g 2 p ( x ) h(x) = -log_{2} p(x) h(x)=−log2p(x)
And the entropy can be seen as the average amount of information:
H [ x ] = − ∑ x p ( x ) l o g 2 p ( x ) H[x] = -\sum_{x}p(x)log_{2}p(x) H[x]=−x∑p(x)log2p(x)
1.6.1 Relative entropy and mutual information.
Suppose that we are modeling an unknown distribution p ( x ) p(x) p(x) using q ( x ) q(x) q(x). Then the average additional amount of information required to specify x x x as a result of using q ( x ) q(x) q(x) instead of true p ( x ) p(x) p(x) is given by:
K L ( p ∣ ∣ q ) = − ∫ p ( x ) l n q ( x ) d x − ( − ∫ p ( x ) l n p ( x ) d x ) = − ∫ p ( x ) l n { q ( x ) p ( x ) } d x KL(p || q) = -\int p(x)ln q(x)dx - (-\int p(x)lnp(x)dx) = -\int p(x)ln\{\frac{q(x)}{p(x)}\}dx KL(p∣∣q)=−∫p(x)lnq(x)dx−(−∫p(x)lnp(x)dx)=−∫p(x)ln{p(x)q(x)}dx
which is call relative entropy or KL divergence.
If we have observed a finite set of training points(from p ( x ) p(x) p(x)), then the expectation with respect to p ( x ) p(x) p(x) can by approximated, so that:
K L ( p ∣ ∣ q ) ≃ 1 N { − l n q ( x n ∣ θ ) + l n p ( x n ) } KL(p || q) \simeq \frac{1}{N}\{-ln q(x_{n} | \theta) + ln p(x_{n}) \} KL(p∣∣q)≃N1{−lnq(xn∣θ)+lnp(xn)}
which shows that minimizing KL divergence is equivalent to maximizing the likelihood function.
We can also use mutual information to judge whether two distributions are closed to independent:
I [ x , y ] = K L ( p ( x , y ) ∣ ∣ p ( x ) p ( y ) ) = − ∫ ∫ p ( x , y ) l n ( p ( x ) p ( y ) p ( x , y ) ) d x d y I[x, y] = KL( p(x, y) || p(x)p(y) ) = -\int\int p(x,y)ln (\frac{p(x)p(y)}{p(x,y)})dxdy I[x,y]=KL(p(x,y)∣∣p(x)p(y))=−∫∫p(x,y)ln(p(x,y)p(x)p(y))dxdy
The relationship between mutual information and conditional entropy:
I [ x , y ] = H [ x ] − H [ x ∣ y ] = H [ y ] − H [ y ∣ x ] I[x,y] = H[x] - H[x | y] = H[y] - H[y | x] I[x,y]=H[x]−H[x∣y]=H[y]−H[y∣x]