PRML Chapter 3 Linear Models for Regression
3.1 Linear Basis Function Models
The simplest linear model for regression is the form:
y ( x , w ) = w 0 + ∑ j = 1 M − 1 w j ϕ j ( x ) y(x, w) = w_{0} + \sum_{j=1}^{M-1}w_{j}\phi_{j}(x) y(x,w)=w0+j=1∑M−1wjϕj(x)
and when we define ϕ 0 ( x ) = 1 \phi_{0}(x) = 1 ϕ0(x)=1, we will have:
y ( x , w ) = ∑ j = 1 M − 1 w j ϕ j ( x ) = w T ϕ ( x ) y(x, w) = \sum_{j=1}^{M-1}w_{j}\phi_{j}(x) = w^{T}\phi(x) y(x,w)=j=1∑M−1wjϕj(x)=wTϕ(x)
There are many possible choices for the basis functions:
- powers of x x x: ϕ j ( x ) = x j \phi_{j}(x)=x^{j} ϕj(x)=xj
- Gaussian: ϕ j ( x ) = e x p { − ( x − μ j ) 2 2 s 2 } \phi_{j}(x) = exp \left\{ -\frac{(x-\mu_{j})^{2}}{2s^{2}} \right\} ϕj(x)=exp{−2s2(x−μj)2}
- sigmoid: ϕ j ( x ) = σ ( x − μ j s ) \phi_{j}(x) = \sigma\left(\frac{x-\mu_{j}}{s} \right) ϕj(x)=σ(sx−μj)
- logistic sigmoid: σ a = 1 1 + e x p ( − a ) \sigma_{a} = \frac{1}{1 + exp(-a)} σa=1+exp(−a)1
- tanh: t a n h ( a ) = 2 σ ( 2 a ) − 1 tanh(a) = 2\sigma(2a) - 1 tanh(a)=2σ(2a)−1
3.1.1 Maximum likelihood and least squares
We assume that the target variable t t t is given by a deterministic function y ( x , w ) y(x,w) y(x,w) with additive Gaussian noise:
t = y ( x , w ) + ϵ t = y(x,w) + \epsilon t=y(x,w)+ϵ
So, we have
p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t | x, w, \beta) = N(t | y(x,w), \beta^{-1}) p(t∣x,w,β)=N(t∣y(x,w),β−1)
In the case of this Gaussian conditional distribution, the conditional mean will be
E [ t ∣ x ] = ∫ t p ( t ∣ x ) d t = y ( x , w ) E[t|x] = \int tp(t | x)dt = y(x, w) E[t∣x]=∫tp(t∣x)dt=y(x,w)
We assume that the data points are independent, and we can get the likelihood function and log likelihood function.
p ( t ∣ X , w , β ) = ∏ n = 1 N N ( t n ∣ w T ϕ ( x n ) , β − 1 ) p(t|X, w, \beta) = \prod_{n=1}^{N}N(t_{n} | w^{T}\phi(x_{n}), \beta^{-1}) p(t∣X,w,β)=n=1∏NN(tn∣wTϕ(xn),β−1)
l n p ( t ∣ w , β ) = ∑ n = 1 N l n N ( t n ∣ w T ϕ ( x n ) , β − 1 ) ln p(t | w, \beta) = \sum_{n=1}^{N} ln N(t_{n} | w^{T}\phi(x_{n}), \beta^{-1}) lnp(t∣w,β)=n=1∑NlnN(tn∣wTϕ(xn),β−1)
where the sum-of-squares error is E D ( w ) = 1 2 ∑ n = 1 N { t n − w T ϕ ( x n ) } 2 E_{D}(w) = \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w^{T}\phi(x_{n}) \}^{2} ED(w)=21∑n=1N{tn−wTϕ(xn)}2.
The MLE solution is:
w M L = ( Φ T Φ ) − 1 Φ T t w_{ML} = (\Phi^{T}\Phi)^{-1}\Phi^{T}t wML=(ΦTΦ)−1ΦTt
this is called normal equations for the least squares problem. Φ \Phi Φ is called the design matrix:
Φ = ( ϕ 0 ( x 1 ) ϕ 1 ( x 1 ) … ϕ M − 1 ( x 1 ) ϕ 0 ( x 2 ) ϕ 1 ( x 2 ) … ϕ M − 1 ( x 2 ) ⋮ ⋮ ⋱ ⋮ ϕ 0 ( x N ) ϕ 1 ( x N ) … ϕ M − 1 ( x N ) ) % <![CDATA[ \Phi = \left( \begin{aligned} \phi_{0}(x_{1}) && \phi_{1}(x_{1}) && \dots && \phi_{M-1}(x_{1}) \\ \phi_{0}(x_{2}) && \phi_{1}(x_{2}) && \dots && \phi_{M-1}(x_{2}) \\ \vdots && \vdots && \ddots&& \vdots \\ \phi_{0}(x_{N}) && \phi_{1}(x_{N}) && \dots && \phi_{M-1}(x_{N}) \end{aligned} \right) %]]> Φ=⎝⎜⎜⎜⎜⎜⎛ϕ0(x1)ϕ0(x2)⋮ϕ0(xN)ϕ1(x1)ϕ1(x2)⋮ϕ1(xN)……⋱…ϕM−1(x1)ϕM−1(x2)⋮ϕM−1(xN)⎠⎟⎟⎟⎟⎟⎞
and the Φ † ≡ ( Φ T Φ ) − 1 Φ T \Phi^{\dag} \equiv (\Phi^{T}\Phi)^{-1}\Phi^{T} Φ†≡(ΦTΦ)−1ΦT is known as the Moore-Penrose pseudo-inverse of the matrix Φ \Phi Φ.
If we make the bias parameter explicit, then the error function above becomes:
E D ( w ) = 1 2 ∑ n = 1 N { t n − w 0 − ∑ j = 1 M − 1 w j ϕ j ( x n ) } 2 E_{D}(w) = \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w_{0} - \sum_{j=1}^{M-1}w_{j}\phi_{j}(x_{n}) \}^{2} ED(w)=21n=1∑N{tn−w0−j=1∑M−1wjϕj(xn)}2
Setting the derivative with respect to w 0 w_0 w0 equal to zero, we obtain:
w 0 = t ˉ − ∑ j = 1 M − 1 w j ϕ ˉ j w_{0} = \bar{t} - \sum_{j=1}^{M-1}w_{j}\bar{\phi}_{j} w0=tˉ−j=1∑M−1wjϕˉj
where t ˉ = 1 N ∑ n = 1 N t n , ϕ ˉ j = 1 N ∑ n = 1 N ϕ j ( x n ) \bar{t} = \frac{1}{N}\sum_{n=1}^{N}t_{n}~~,~~~~~\bar{\phi}_{j} = \frac{1}{N}\sum_{n=1}^{N}\phi_{j}(x_{n}) tˉ=N1∑n=1Ntn , ϕˉj=N1∑n=1Nϕj(xn), thus the bias w 0 w_0 w0 compensates for the difference between the averages of the target values and the weighted sum of the averages of the basis function values.
Also we can get the MLE solution of precision parameter.
1 β M L = 1 N ∑ n = 1 N { t n − w M L T ϕ ( x n ) } 2 \frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ t_{n} - w_{ML}^{T}\phi(x_{n}) \}^{2} βML1=N1n=1∑N{tn−wMLTϕ(xn)}2
3.1.2 Geometry of least squares
3.1.3 Sequential learning
Using gradient descent algorithm updates the parameter vector:
w ( τ + 1 ) = w ( τ ) − η ∇ E n w^{(\tau+1)}=w^{(\tau)}-\eta\nabla E_n w(τ+1)=w(τ)−η∇En
For the case of the sum-of-squares error function:
w ( τ + 1 ) = w ( τ ) + η ( t n − w ( τ ) T ϕ n ) ϕ n w^{(\tau+1)}=w^{(\tau)}+\eta(t_n-w^{(\tau)T}\phi_n)\phi_n w(τ+1)=w(τ)+η(tn−w(τ)Tϕn)ϕn
3.1.4 Regularized least squares
The total error function to be minimized takes the form:
E D ( w ) + λ E W ( w ) E_D(w)+\lambda E_W(w) ED(w)+λEW(w)
Solving for
w
w
w as before, we obtain:
w
=
(
λ
I
+
Φ
T
Φ
)
−
1
Φ
T
t
w=(\lambda I+\Phi^T \Phi)^{-1}\Phi^Tt
w=(λI+ΦTΦ)−1ΦTt
Also we can consider the general regularized error:
1 2 ∑ n = 1 N { t n − w T Φ ( x n ) } 2 + λ 2 ∑ j = 1 M ∣ w j ∣ q \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w^{T}\Phi(x_{n}) \}^{2} + \frac{\lambda}{2}\sum_{j=1}^{M}|w_{j}|^{q} 21n=1∑N{tn−wTΦ(xn)}2+2λj=1∑M∣wj∣q
3.1.5 Multiple outputs
For multiple target variables, use the same set of basis functions to model all of the components of the target vector:
y ( x , w ) = W T ϕ ( x ) y(x,w) = W^{T}\phi(x) y(x,w)=WTϕ(x)
Suppose the conditional distribution of the target vector is isotropic Gaussian: p ( t ∣ x , W , β ) = N ( t ∣ W T ϕ ( x ) , β − 1 I ) p(t | x, W, \beta) = N(t | W^{T}\phi(x), \beta^{-1}I) p(t∣x,W,β)=N(t∣WTϕ(x),β−1I), we can get the MLE solution:
W M L = ( Φ T Φ ) − 1 Φ T T W_{ML} = (\Phi^{T}\Phi)^{-1}\Phi^{T}T WML=(ΦTΦ)−1ΦTT
3.2 The Bias-Variance Decomposition
The conditional expectation can be written as:
h ( x ) = E [ t ∣ x ] = ∫ t p ( t ∣ x ) d t h(x)=E[t|x]=\int tp(t|x)dt h(x)=E[t∣x]=∫tp(t∣x)dt
The expected squared loss in 1.5.5 is
E [ L ] = ∫ { y ( x ) − h ( x ) } 2 p ( x ) d x + ∫ { h ( x ) − t } 2 p ( x , t ) d x d t E[L] = \int \{y(x) - h(x)\}^{2}p(x)dx + \int \{h(x)-t\}^2p(x,t)dxdt E[L]=∫{y(x)−h(x)}2p(x)dx+∫{h(x)−t}2p(x,t)dxdt
The expected squared difference between y ( x , D ) y(x,D) y(x,D) and the regression function h ( x ) h(x) h(x) can be expressed as the sum of two terms:
E D [ { y ( x ; D ) − h ( x ) } 2 ] = { E D [ y ( x ; D ) ] − h ( x ) } 2 + E D [ { y ( x ; D ) − E D [ y ( x ; D ) ] } 2 ] E_{D}[\{ y(x;D) - h(x) \}^{2} ] = \{ E_{D}[y(x;D)] - h(x) \}^{2} + E_{D}[\{y(x;D) - E_{D}[y(x;D)] \}^{2} ] ED[{y(x;D)−h(x)}2]={ED[y(x;D)]−h(x)}2+ED[{y(x;D)−ED[y(x;D)]}2]
Generalize from the case of a single data point to the entire data set, we have:
e x p e c t e d l o s s = ( b i a s ) 2 + v a r i a n c e + n o i s e expected\ loss=(bias)^2+variance+noise expected loss=(bias)2+variance+noise
where:
b i a s 2 = ∫ { E D [ y ( x ; D ) ] − h ( x ) } 2 p ( x ) d x bias^{2} = \int \{E_{D}[y(x;D)] - h(x)\}^{2}p(x) dx bias2=∫{ED[y(x;D)]−h(x)}2p(x)dx
v a r i a n c e = ∫ E D [ { y ( x ; D ) − E D [ y ( x ; D ) ] } 2 ] p ( x ) d x variance = \int E_{D}[\{y(x;D) - E_{D}[y(x;D)]\}^{2}]p(x) dx variance=∫ED[{y(x;D)−ED[y(x;D)]}2]p(x)dx
n o i s e = ∫ ∫ { h ( x ) − t } 2 p ( x , t ) d x d t noise = \int\int\{ h(x) - t\}^{2}p(x,t) dxdt noise=∫∫{h(x)−t}2p(x,t)dxdt
3.3 Bayesian Linear Regression
3.3.1 Parameter distribution
First we introduce the conjugate prior:
p ( w ) = N ( w ∣ m 0 , S 0 ) p(w) = N(w | m_{0}, S_{0}) p(w)=N(w∣m0,S0)
It is easy to calculate the posterior distribution:
p ( w ∣ t ) = N ( w ∣ m N , S N ) p(w|t) = N(w | m_{N}, S_{N}) p(w∣t)=N(w∣mN,SN)
where
m
N
=
S
N
(
S
0
−
1
m
0
+
β
Φ
T
t
)
m_{N} = S_{N}(S_{0}^{-1}m_{0} + \beta\Phi^{T}t)
mN=SN(S0−1m0+βΦTt) and
S
N
−
1
=
S
0
−
1
+
β
Φ
T
Φ
S_{N}^{-1} = S_{0}^{-1} + \beta\Phi^{T}\Phi
SN−1=S0−1+βΦTΦ
For the remainder of this chapter, we consider a zero-mean isotropic Gaussian prior in order to simplify the treatment.
p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w | \alpha) = N(w | 0, \alpha^{-1}I) p(w∣α)=N(w∣0,α−1I)
and the parameters in the posterior will be m N = β S N Φ T t m_{N} = \beta S_{N}\Phi^{T}t mN=βSNΦTt and S N − 1 = α I + β Φ T Φ S_{N}^{-1} = \alpha I + \beta \Phi^{T}\Phi SN−1=αI+βΦTΦ.
3.3.2 Predictive distribution
The predictive distribution is defined by
p ( t ∣ t , α , β ) = ∫ p ( t ∣ w , β ) p ( w ∣ t , α , β ) d w p(t | \mathbf{t}, \alpha, \beta) = \int p(t | w, \beta)p(w | \mathbf{t}, \alpha, \beta) dw p(t∣t,α,β)=∫p(t∣w,β)p(w∣t,α,β)dw
It is the convolution of two Gaussian distributions, and the predictive distributions takes the form:
p ( t ∣ x , t , α , β ) = N ( t ∣ m N T ϕ ( x ) , σ N 2 ( x ) ) p(t | \mathbf{x}, \mathbf{t}, \alpha, \beta) = N(t | m_{N}^{T}\phi(x), \sigma_{N}^{2}(x)) p(t∣x,t,α,β)=N(t∣mNTϕ(x),σN2(x))
where
σ N 2 ( x ) = 1 β + ϕ ( x ) T S N ϕ ( x ) \sigma_{N}^{2}(x) = \frac{1}{\beta} + \phi(x)^{T}S_{N}\phi(x) σN2(x)=β1+ϕ(x)TSNϕ(x)
3.3.3 Equivalent kernel
We can transform the predicted mean:
y ( x , m N ) = m N T ϕ ( x ) = β ϕ ( x ) T S N Φ T t = ∑ n = 1 N β ϕ ( x ) T S N ϕ ( x n ) t n y(x, m_{N}) = m_{N}^{T}\phi(x) = \beta\phi(x)^{T}S_{N}\Phi^{T}t = \sum_{n=1}^{N}\beta\phi(x)^{T}S_{N}\phi(x_{n})t_{n} y(x,mN)=mNTϕ(x)=βϕ(x)TSNΦTt=n=1∑Nβϕ(x)TSNϕ(xn)tn
Thus the mean of the predictive distribution at a point x x x is given by a linear combination of the training set target variables t n t_n tn:
y ( x , m N ) = ∑ n = 1 N k ( x , x n ) t n y(x,m_N)=\sum_{n=1}^N k(x,x_n)t_n y(x,mN)=n=1∑Nk(x,xn)tn
where the function k ( x , x ′ ) = β ϕ ( x ) T S N ϕ ( x ′ ) k(x, x^{'}) = \beta\phi(x)^{T}S_{N}\phi(x^{'}) k(x,x′)=βϕ(x)TSNϕ(x′) is known as the smoother matrix or the equivalent kernel.
The equivalent kernel shows the connection with the covariance:
c o v [ y ( x ) , y ′ ( x ) ] = c o v [ ϕ ( x ) T w , w T ϕ ( x ′ ) ] = ϕ ( x ) T S N ϕ ( x ′ ) = β − 1 k ( x , x ′ ) cov[y(x),y'(x)]=cov[\phi(x)^Tw,w^T\phi(x')]=\phi(x)^TS_N\phi(x')=\beta^{-1}k(x,x') cov[y(x),y′(x)]=cov[ϕ(x)Tw,wTϕ(x′)]=ϕ(x)TSNϕ(x′)=β−1k(x,x′)
Also, for all values of x x x, it satisfies:
∑ n = 1 N k ( x , x n ) = 1 \sum_{n=1}^N k(x,x_n)=1 n=1∑Nk(x,xn)=1
What’s more, it can be expressed in inner product form:
k ( x , z ) = ψ ( x ) T ψ ( z ) k(x,z)=\psi(x)^T\psi(z) k(x,z)=ψ(x)Tψ(z)
where ψ ( x ) = β 1 / 2 S N 1 / 2 ϕ ( x ) \psi(x)=\beta^{1/2}S_N^{1/2}\phi(x) ψ(x)=β1/2SN1/2ϕ(x).
3.4 Bayesian Model Comparison
Suppose we wish to compare a set of L models { M i } \{M_i\} {Mi}, here a model refers to a probability distribution over the observed data D D D. Given a training set D D D, we then wish to evaluate the posterior distribution:
p ( M i ∣ D ) ∝ p ( M i ) p ( D ∣ M i ) p(M_{i} | D) \propto p(M_{i})p(D | M_{i}) p(Mi∣D)∝p(Mi)p(D∣Mi)
p ( M i ) p ( D ∥ M i ) p(M_{i})p(D \| M_{i}) p(Mi)p(D∥Mi) means model evidence and the ratio of model evidences p ( D ∥ M i ) p ( D ∥ M j ) \frac{p(D\|M_i)}{p(D \| M_j)} p(D∥Mj)p(D∥Mi) for two models is known as a Bayes factor.
The model evidence can be calculated through
p ( D ∣ M i ) = ∫ p ( D ∣ w , M i ) p ( w ∣ M i ) d w p( D | M_{i} ) = \int p(D|w, M_{i})p(w|M_{i})dw p(D∣Mi)=∫p(D∣w,Mi)p(w∣Mi)dw
Assume that p ( w ) = 1 Δ w p r i o r p(w) = \frac{1}{\Delta w_{prior}} p(w)=Δwprior1, we will have:
p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w ≃ p ( D ∣ w M A P ) Δ w p o s t e r i o r Δ w p r i o r p(D) = \int p(D |w)p(w)dw \simeq p(D|w_{MAP})\frac{\Delta w_{posterior}}{\Delta w_{prior}} p(D)=∫p(D∣w)p(w)dw≃p(D∣wMAP)ΔwpriorΔwposterior
take the logarithm:
ln p ( D ) ≃ ln p ( D ∣ w M A P ) + ln ( Δ w p o s t e r i o r Δ w p r i o r ) \ln p(D) \simeq \ln p(D | w_{MAP}) + \ln(\frac{\Delta w_{posterior}}{\Delta w_{prior}}) lnp(D)≃lnp(D∣wMAP)+ln(ΔwpriorΔwposterior)
For model with M parameters, and their Δ w p o s t e r i o r Δ w p r i o r \frac{\Delta w_{posterior}}{\Delta w_{prior}} ΔwpriorΔwposterior are the same, we will have:
ln p ( D ) ≃ p ( D ∣ w M A P ) + M ln ( Δ w p o s t e r i o r Δ w p r i o r ) \ln p(D) \simeq p(D|w_{MAP}) + M\ln(\frac{\Delta w_{posterior}}{\Delta w_{prior}}) lnp(D)≃p(D∣wMAP)+Mln(ΔwpriorΔwposterior)
3.5 The Evidence Approximation
Introduce hyperpriors over α \alpha α and β \beta β, the predictive distribution is obtained by marginalizing over w , α , β w,\alpha,\beta w,α,β.
p ( t ∣ t ) = ∫ ∫ ∫ p ( t ∣ w , β ) p ( w ∣ t , α , β ) p ( α , β ∣ t ) d w d α d β p(t | \mathbf{t}) = \int\int\int p(t | w, \beta)p(w | \mathbf{t},\alpha,\beta)p(\alpha,\beta | \mathbf{t}) dw d\alpha d\beta p(t∣t)=∫∫∫p(t∣w,β)p(w∣t,α,β)p(α,β∣t)dwdαdβ
If the posterior distribution p ( α , β ∥ t ) p(\alpha,\beta \| \textbf{t}) p(α,β∥t) is sharply peaked around α ^ \hat{\alpha} α^ and β ^ \hat{\beta} β^, then the predictive distribution is obtained simply by marginalizing over w w w:
p ( t ∣ t ) ≃ p ( t ∣ t , α ^ , β ^ ) = ∫ p ( t ∣ w , β ^ ) p ( w ∣ t , α ^ , β ^ ) ) d w p(t | \mathbf{t}) \simeq p(t | \mathbf{t}, \hat{\alpha}, \hat{\beta}) = \int p(t|w,\hat{\beta})p(w | \mathbf{t}, \hat{\alpha}, \hat{\beta}))dw p(t∣t)≃p(t∣t,α^,β^)=∫p(t∣w,β^)p(w∣t,α^,β^))dw
From Bayes’ theorem, the posterior distribution for α \alpha α and β \beta β is given by:
p ( α , β ∣ t ) ∝ p ( t ∣ α , β ) p ( α , β ) p(\alpha, \beta | \mathbf{t}) \propto p(\mathbf{t} | \alpha, \beta)p(\alpha, \beta) p(α,β∣t)∝p(t∣α,β)p(α,β)
3.5.1 Evaluation of the evidence function
The marginal likelihood function p ( t ∥ α , β ) p(\mathbf{t} \| \alpha, \beta) p(t∥α,β) is obtained by integrating over the weight parameters w w w, so that
p ( t ∣ α , β ) = ∫ p ( t ∣ w , β ) p ( w ∣ α ) d w p(\mathbf{t} | \alpha, \beta) = \int p(\mathbf{t} | w, \beta)p(w| \alpha) dw p(t∣α,β)=∫p(t∣w,β)p(w∣α)dw
By completing the square in the exponent and making use of the standard form for the normalization coefficient of a Gaussian, we can get the log of the marginal likelihood in the form:
ln p ( t ∣ α , β ) = M 2 ln α + N 2 ln β − E ( m N ) − 1 2 ln ∣ A ∣ − N 2 ln ( 2 π ) \ln p(\mathbf{t} | \alpha, \beta) = \frac{M}{2}\ln\alpha + \frac{N}{2}\ln\beta - E(m_{N}) - \frac{1}{2}\ln|A| - \frac{N}{2}\ln(2\pi) lnp(t∣α,β)=2Mlnα+2Nlnβ−E(mN)−21ln∣A∣−2Nln(2π)
3.5.2 Maximizing the evidence function
Defining the eigenvector equation
( β Φ T Φ ) u i = λ i u i (\beta\Phi^{T}\Phi)u_{i} = \lambda_{i}u_{i} (βΦTΦ)ui=λiui
It then follows that A A A has eigenvalues α + λ i \alpha+\lambda_i α+λi. Now consider the derivative of the term involving l n ∥ A ∥ ln\|A\| ln∥A∥ with respect to α \alpha α.
d d α ln ∣ A ∣ = d d α ln ∏ i ( λ i + α ) = d d α ∑ i ln ( λ i + α ) = ∑ i 1 λ i + α \frac{d}{d\alpha}\ln|A| = \frac{d}{d\alpha}\ln\prod_{i}(\lambda_{i} + \alpha) = \frac{d}{d\alpha}\sum_{i}\ln(\lambda_{i} + \alpha) = \sum_{i}\frac{1}{\lambda_{i} + \alpha} dαdln∣A∣=dαdlni∏(λi+α)=dαdi∑ln(λi+α)=i∑λi+α1
Let the derivative be zero, we will have:
α
m
N
T
m
N
=
M
−
α
∑
i
1
λ
i
+
α
=
γ
\alpha m_{N}^{T}m_{N} = M - \alpha\sum_{i}\frac{1}{\lambda_{i} + \alpha} = \gamma
αmNTmN=M−αi∑λi+α1=γ
So it is easy to get that
γ = ∑ i λ i α + λ i a n d α = γ m N T m N \gamma = \sum_{i} \frac{\lambda_{i}}{\alpha + \lambda_{i}}\ \ and\ \ \alpha = \frac{\gamma}{m_{N}^{T}m_{N}} γ=i∑α+λiλi and α=mNTmNγ
As for β \beta β, we can follow the same idea to get
1 β = 1 N − γ ∑ n = 1 N { t n − m N P T ϕ ( x n ) } \frac{1}{\beta} = \frac{1}{N-\gamma}\sum_{n=1}^{N}\{ t_{n}-m_{N}^PT \phi(x_{n})\} β1=N−γ1n=1∑N{tn−mNPTϕ(xn)}
3.5.3 Effective number of parameters
When the number of data points is far more larger than parameters, it will be easy to compute that:
α = M 2 E W ( m N ) \alpha=\frac{M}{2E_W(m_N)} α=2EW(mN)M
β = N 2 E D ( m N ) \beta=\frac{N}{2E_D(m_N)} β=2ED(mN)N