模式识别 | PRML Chapter 3 Linear Models for Regression

PRML Chapter 3 Linear Models for Regression

3.1 Linear Basis Function Models

The simplest linear model for regression is the form:

y ( x , w ) = w 0 + ∑ j = 1 M − 1 w j ϕ j ( x ) y(x, w) = w_{0} + \sum_{j=1}^{M-1}w_{j}\phi_{j}(x) y(x,w)=w0+j=1M1wjϕj(x)

and when we define ϕ 0 ( x ) = 1 \phi_{0}(x) = 1 ϕ0(x)=1, we will have:

y ( x , w ) = ∑ j = 1 M − 1 w j ϕ j ( x ) = w T ϕ ( x ) y(x, w) = \sum_{j=1}^{M-1}w_{j}\phi_{j}(x) = w^{T}\phi(x) y(x,w)=j=1M1wjϕj(x)=wTϕ(x)

There are many possible choices for the basis functions:

  • powers of x x x: ϕ j ( x ) = x j \phi_{j}(x)=x^{j} ϕj(x)=xj
  • Gaussian: ϕ j ( x ) = e x p { − ( x − μ j ) 2 2 s 2 } \phi_{j}(x) = exp \left\{ -\frac{(x-\mu_{j})^{2}}{2s^{2}} \right\} ϕj(x)=exp{2s2(xμj)2}
  • sigmoid: ϕ j ( x ) = σ ( x − μ j s ) \phi_{j}(x) = \sigma\left(\frac{x-\mu_{j}}{s} \right) ϕj(x)=σ(sxμj)
  • logistic sigmoid: σ a = 1 1 + e x p ( − a ) \sigma_{a} = \frac{1}{1 + exp(-a)} σa=1+exp(a)1
  • tanh: t a n h ( a ) = 2 σ ( 2 a ) − 1 tanh(a) = 2\sigma(2a) - 1 tanh(a)=2σ(2a)1

3.1.1 Maximum likelihood and least squares

We assume that the target variable t t t is given by a deterministic function y ( x , w ) y(x,w) y(x,w) with additive Gaussian noise:

t = y ( x , w ) + ϵ t = y(x,w) + \epsilon t=y(x,w)+ϵ

So, we have

p ( t ∣ x , w , β ) = N ( t ∣ y ( x , w ) , β − 1 ) p(t | x, w, \beta) = N(t | y(x,w), \beta^{-1}) p(tx,w,β)=N(ty(x,w),β1)

In the case of this Gaussian conditional distribution, the conditional mean will be

E [ t ∣ x ] = ∫ t p ( t ∣ x ) d t = y ( x , w ) E[t|x] = \int tp(t | x)dt = y(x, w) E[tx]=tp(tx)dt=y(x,w)

We assume that the data points are independent, and we can get the likelihood function and log likelihood function.

p ( t ∣ X , w , β ) = ∏ n = 1 N N ( t n ∣ w T ϕ ( x n ) , β − 1 ) p(t|X, w, \beta) = \prod_{n=1}^{N}N(t_{n} | w^{T}\phi(x_{n}), \beta^{-1}) p(tX,w,β)=n=1NN(tnwTϕ(xn),β1)

l n p ( t ∣ w , β ) = ∑ n = 1 N l n N ( t n ∣ w T ϕ ( x n ) , β − 1 ) ln p(t | w, \beta) = \sum_{n=1}^{N} ln N(t_{n} | w^{T}\phi(x_{n}), \beta^{-1}) lnp(tw,β)=n=1NlnN(tnwTϕ(xn),β1)

where the sum-of-squares error is E D ( w ) = 1 2 ∑ n = 1 N { t n − w T ϕ ( x n ) } 2 E_{D}(w) = \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w^{T}\phi(x_{n}) \}^{2} ED(w)=21n=1N{tnwTϕ(xn)}2.

The MLE solution is:

w M L = ( Φ T Φ ) − 1 Φ T t w_{ML} = (\Phi^{T}\Phi)^{-1}\Phi^{T}t wML=(ΦTΦ)1ΦTt

this is called normal equations for the least squares problem. Φ \Phi Φ is called the design matrix:

Φ = ( ϕ 0 ( x 1 ) ϕ 1 ( x 1 ) … ϕ M − 1 ( x 1 ) ϕ 0 ( x 2 ) ϕ 1 ( x 2 ) … ϕ M − 1 ( x 2 ) ⋮ ⋮ ⋱ ⋮ ϕ 0 ( x N ) ϕ 1 ( x N ) … ϕ M − 1 ( x N ) ) % <![CDATA[ \Phi = \left( \begin{aligned} \phi_{0}(x_{1}) && \phi_{1}(x_{1}) && \dots && \phi_{M-1}(x_{1}) \\ \phi_{0}(x_{2}) && \phi_{1}(x_{2}) && \dots && \phi_{M-1}(x_{2}) \\ \vdots && \vdots && \ddots&& \vdots \\ \phi_{0}(x_{N}) && \phi_{1}(x_{N}) && \dots && \phi_{M-1}(x_{N}) \end{aligned} \right) %]]> Φ=ϕ0(x1)ϕ0(x2)ϕ0(xN)ϕ1(x1)ϕ1(x2)ϕ1(xN)ϕM1(x1)ϕM1(x2)ϕM1(xN)

and the Φ † ≡ ( Φ T Φ ) − 1 Φ T \Phi^{\dag} \equiv (\Phi^{T}\Phi)^{-1}\Phi^{T} Φ(ΦTΦ)1ΦT is known as the Moore-Penrose pseudo-inverse of the matrix Φ \Phi Φ.

If we make the bias parameter explicit, then the error function above becomes:

E D ( w ) = 1 2 ∑ n = 1 N { t n − w 0 − ∑ j = 1 M − 1 w j ϕ j ( x n ) } 2 E_{D}(w) = \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w_{0} - \sum_{j=1}^{M-1}w_{j}\phi_{j}(x_{n}) \}^{2} ED(w)=21n=1N{tnw0j=1M1wjϕj(xn)}2

Setting the derivative with respect to w 0 w_0 w0 equal to zero, we obtain:

w 0 = t ˉ − ∑ j = 1 M − 1 w j ϕ ˉ j w_{0} = \bar{t} - \sum_{j=1}^{M-1}w_{j}\bar{\phi}_{j} w0=tˉj=1M1wjϕˉj

where t ˉ = 1 N ∑ n = 1 N t n    ,       ϕ ˉ j = 1 N ∑ n = 1 N ϕ j ( x n ) \bar{t} = \frac{1}{N}\sum_{n=1}^{N}t_{n}~~,~~~~~\bar{\phi}_{j} = \frac{1}{N}\sum_{n=1}^{N}\phi_{j}(x_{n}) tˉ=N1n=1Ntn  ,     ϕˉj=N1n=1Nϕj(xn), thus the bias w 0 w_0 w0 compensates for the difference between the averages of the target values and the weighted sum of the averages of the basis function values.

Also we can get the MLE solution of precision parameter.

1 β M L = 1 N ∑ n = 1 N { t n − w M L T ϕ ( x n ) } 2 \frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ t_{n} - w_{ML}^{T}\phi(x_{n}) \}^{2} βML1=N1n=1N{tnwMLTϕ(xn)}2

3.1.2 Geometry of least squares

3.1.3 Sequential learning

Using gradient descent algorithm updates the parameter vector:

w ( τ + 1 ) = w ( τ ) − η ∇ E n w^{(\tau+1)}=w^{(\tau)}-\eta\nabla E_n w(τ+1)=w(τ)ηEn

For the case of the sum-of-squares error function:

w ( τ + 1 ) = w ( τ ) + η ( t n − w ( τ ) T ϕ n ) ϕ n w^{(\tau+1)}=w^{(\tau)}+\eta(t_n-w^{(\tau)T}\phi_n)\phi_n w(τ+1)=w(τ)+η(tnw(τ)Tϕn)ϕn

3.1.4 Regularized least squares

The total error function to be minimized takes the form:

E D ( w ) + λ E W ( w ) E_D(w)+\lambda E_W(w) ED(w)+λEW(w)

Solving for w w w as before, we obtain: w = ( λ I + Φ T Φ ) − 1 Φ T t w=(\lambda I+\Phi^T \Phi)^{-1}\Phi^Tt w=(λI+ΦTΦ)1ΦTt
Also we can consider the general regularized error:

1 2 ∑ n = 1 N { t n − w T Φ ( x n ) } 2 + λ 2 ∑ j = 1 M ∣ w j ∣ q \frac{1}{2}\sum_{n=1}^{N}\{ t_{n} - w^{T}\Phi(x_{n}) \}^{2} + \frac{\lambda}{2}\sum_{j=1}^{M}|w_{j}|^{q} 21n=1N{tnwTΦ(xn)}2+2λj=1Mwjq

3.1.5 Multiple outputs

For multiple target variables, use the same set of basis functions to model all of the components of the target vector:

y ( x , w ) = W T ϕ ( x ) y(x,w) = W^{T}\phi(x) y(x,w)=WTϕ(x)

Suppose the conditional distribution of the target vector is isotropic Gaussian: p ( t ∣ x , W , β ) = N ( t ∣ W T ϕ ( x ) , β − 1 I ) p(t | x, W, \beta) = N(t | W^{T}\phi(x), \beta^{-1}I) p(tx,W,β)=N(tWTϕ(x),β1I), we can get the MLE solution:

W M L = ( Φ T Φ ) − 1 Φ T T W_{ML} = (\Phi^{T}\Phi)^{-1}\Phi^{T}T WML=(ΦTΦ)1ΦTT

3.2 The Bias-Variance Decomposition

The conditional expectation can be written as:

h ( x ) = E [ t ∣ x ] = ∫ t p ( t ∣ x ) d t h(x)=E[t|x]=\int tp(t|x)dt h(x)=E[tx]=tp(tx)dt

The expected squared loss in 1.5.5 is

E [ L ] = ∫ { y ( x ) − h ( x ) } 2 p ( x ) d x + ∫ { h ( x ) − t } 2 p ( x , t ) d x d t E[L] = \int \{y(x) - h(x)\}^{2}p(x)dx + \int \{h(x)-t\}^2p(x,t)dxdt E[L]={y(x)h(x)}2p(x)dx+{h(x)t}2p(x,t)dxdt

The expected squared difference between y ( x , D ) y(x,D) y(x,D) and the regression function h ( x ) h(x) h(x) can be expressed as the sum of two terms:

E D [ { y ( x ; D ) − h ( x ) } 2 ] = { E D [ y ( x ; D ) ] − h ( x ) } 2 + E D [ { y ( x ; D ) − E D [ y ( x ; D ) ] } 2 ] E_{D}[\{ y(x;D) - h(x) \}^{2} ] = \{ E_{D}[y(x;D)] - h(x) \}^{2} + E_{D}[\{y(x;D) - E_{D}[y(x;D)] \}^{2} ] ED[{y(x;D)h(x)}2]={ED[y(x;D)]h(x)}2+ED[{y(x;D)ED[y(x;D)]}2]

Generalize from the case of a single data point to the entire data set, we have:

e x p e c t e d   l o s s = ( b i a s ) 2 + v a r i a n c e + n o i s e expected\ loss=(bias)^2+variance+noise expected loss=(bias)2+variance+noise

where:

b i a s 2 = ∫ { E D [ y ( x ; D ) ] − h ( x ) } 2 p ( x ) d x bias^{2} = \int \{E_{D}[y(x;D)] - h(x)\}^{2}p(x) dx bias2={ED[y(x;D)]h(x)}2p(x)dx

v a r i a n c e = ∫ E D [ { y ( x ; D ) − E D [ y ( x ; D ) ] } 2 ] p ( x ) d x variance = \int E_{D}[\{y(x;D) - E_{D}[y(x;D)]\}^{2}]p(x) dx variance=ED[{y(x;D)ED[y(x;D)]}2]p(x)dx

n o i s e = ∫ ∫ { h ( x ) − t } 2 p ( x , t ) d x d t noise = \int\int\{ h(x) - t\}^{2}p(x,t) dxdt noise={h(x)t}2p(x,t)dxdt

3.3 Bayesian Linear Regression

3.3.1 Parameter distribution

First we introduce the conjugate prior:

p ( w ) = N ( w ∣ m 0 , S 0 ) p(w) = N(w | m_{0}, S_{0}) p(w)=N(wm0,S0)

It is easy to calculate the posterior distribution:

p ( w ∣ t ) = N ( w ∣ m N , S N ) p(w|t) = N(w | m_{N}, S_{N}) p(wt)=N(wmN,SN)

where m N = S N ( S 0 − 1 m 0 + β Φ T t ) m_{N} = S_{N}(S_{0}^{-1}m_{0} + \beta\Phi^{T}t) mN=SN(S01m0+βΦTt) and
S N − 1 = S 0 − 1 + β Φ T Φ S_{N}^{-1} = S_{0}^{-1} + \beta\Phi^{T}\Phi SN1=S01+βΦTΦ

For the remainder of this chapter, we consider a zero-mean isotropic Gaussian prior in order to simplify the treatment.

p ( w ∣ α ) = N ( w ∣ 0 , α − 1 I ) p(w | \alpha) = N(w | 0, \alpha^{-1}I) p(wα)=N(w0,α1I)

and the parameters in the posterior will be m N = β S N Φ T t m_{N} = \beta S_{N}\Phi^{T}t mN=βSNΦTt and S N − 1 = α I + β Φ T Φ S_{N}^{-1} = \alpha I + \beta \Phi^{T}\Phi SN1=αI+βΦTΦ.

3.3.2 Predictive distribution

The predictive distribution is defined by

p ( t ∣ t , α , β ) = ∫ p ( t ∣ w , β ) p ( w ∣ t , α , β ) d w p(t | \mathbf{t}, \alpha, \beta) = \int p(t | w, \beta)p(w | \mathbf{t}, \alpha, \beta) dw p(tt,α,β)=p(tw,β)p(wt,α,β)dw

It is the convolution of two Gaussian distributions, and the predictive distributions takes the form:

p ( t ∣ x , t , α , β ) = N ( t ∣ m N T ϕ ( x ) , σ N 2 ( x ) ) p(t | \mathbf{x}, \mathbf{t}, \alpha, \beta) = N(t | m_{N}^{T}\phi(x), \sigma_{N}^{2}(x)) p(tx,t,α,β)=N(tmNTϕ(x),σN2(x))

where

σ N 2 ( x ) = 1 β + ϕ ( x ) T S N ϕ ( x ) \sigma_{N}^{2}(x) = \frac{1}{\beta} + \phi(x)^{T}S_{N}\phi(x) σN2(x)=β1+ϕ(x)TSNϕ(x)

3.3.3 Equivalent kernel

We can transform the predicted mean:

y ( x , m N ) = m N T ϕ ( x ) = β ϕ ( x ) T S N Φ T t = ∑ n = 1 N β ϕ ( x ) T S N ϕ ( x n ) t n y(x, m_{N}) = m_{N}^{T}\phi(x) = \beta\phi(x)^{T}S_{N}\Phi^{T}t = \sum_{n=1}^{N}\beta\phi(x)^{T}S_{N}\phi(x_{n})t_{n} y(x,mN)=mNTϕ(x)=βϕ(x)TSNΦTt=n=1Nβϕ(x)TSNϕ(xn)tn

Thus the mean of the predictive distribution at a point x x x is given by a linear combination of the training set target variables t n t_n tn:

y ( x , m N ) = ∑ n = 1 N k ( x , x n ) t n y(x,m_N)=\sum_{n=1}^N k(x,x_n)t_n y(x,mN)=n=1Nk(x,xn)tn

where the function k ( x , x ′ ) = β ϕ ( x ) T S N ϕ ( x ′ ) k(x, x^{'}) = \beta\phi(x)^{T}S_{N}\phi(x^{'}) k(x,x)=βϕ(x)TSNϕ(x) is known as the smoother matrix or the equivalent kernel.

The equivalent kernel shows the connection with the covariance:

c o v [ y ( x ) , y ′ ( x ) ] = c o v [ ϕ ( x ) T w , w T ϕ ( x ′ ) ] = ϕ ( x ) T S N ϕ ( x ′ ) = β − 1 k ( x , x ′ ) cov[y(x),y'(x)]=cov[\phi(x)^Tw,w^T\phi(x')]=\phi(x)^TS_N\phi(x')=\beta^{-1}k(x,x') cov[y(x),y(x)]=cov[ϕ(x)Tw,wTϕ(x)]=ϕ(x)TSNϕ(x)=β1k(x,x)

Also, for all values of x x x, it satisfies:

∑ n = 1 N k ( x , x n ) = 1 \sum_{n=1}^N k(x,x_n)=1 n=1Nk(x,xn)=1

What’s more, it can be expressed in inner product form:

k ( x , z ) = ψ ( x ) T ψ ( z ) k(x,z)=\psi(x)^T\psi(z) k(x,z)=ψ(x)Tψ(z)

where ψ ( x ) = β 1 / 2 S N 1 / 2 ϕ ( x ) \psi(x)=\beta^{1/2}S_N^{1/2}\phi(x) ψ(x)=β1/2SN1/2ϕ(x).

3.4 Bayesian Model Comparison

Suppose we wish to compare a set of L models { M i } \{M_i\} {Mi}, here a model refers to a probability distribution over the observed data D D D. Given a training set D D D, we then wish to evaluate the posterior distribution:

p ( M i ∣ D ) ∝ p ( M i ) p ( D ∣ M i ) p(M_{i} | D) \propto p(M_{i})p(D | M_{i}) p(MiD)p(Mi)p(DMi)

p ( M i ) p ( D ∥ M i ) p(M_{i})p(D \| M_{i}) p(Mi)p(DMi) means model evidence and the ratio of model evidences p ( D ∥ M i ) p ( D ∥ M j ) \frac{p(D\|M_i)}{p(D \| M_j)} p(DMj)p(DMi) for two models is known as a Bayes factor.

The model evidence can be calculated through

p ( D ∣ M i ) = ∫ p ( D ∣ w , M i ) p ( w ∣ M i ) d w p( D | M_{i} ) = \int p(D|w, M_{i})p(w|M_{i})dw p(DMi)=p(Dw,Mi)p(wMi)dw

Assume that p ( w ) = 1 Δ w p r i o r p(w) = \frac{1}{\Delta w_{prior}} p(w)=Δwprior1, we will have:

p ( D ) = ∫ p ( D ∣ w ) p ( w ) d w ≃ p ( D ∣ w M A P ) Δ w p o s t e r i o r Δ w p r i o r p(D) = \int p(D |w)p(w)dw \simeq p(D|w_{MAP})\frac{\Delta w_{posterior}}{\Delta w_{prior}} p(D)=p(Dw)p(w)dwp(DwMAP)ΔwpriorΔwposterior

take the logarithm:

ln ⁡ p ( D ) ≃ ln ⁡ p ( D ∣ w M A P ) + ln ⁡ ( Δ w p o s t e r i o r Δ w p r i o r ) \ln p(D) \simeq \ln p(D | w_{MAP}) + \ln(\frac{\Delta w_{posterior}}{\Delta w_{prior}}) lnp(D)lnp(DwMAP)+ln(ΔwpriorΔwposterior)

For model with M parameters, and their Δ w p o s t e r i o r Δ w p r i o r \frac{\Delta w_{posterior}}{\Delta w_{prior}} ΔwpriorΔwposterior are the same, we will have:

ln ⁡ p ( D ) ≃ p ( D ∣ w M A P ) + M ln ⁡ ( Δ w p o s t e r i o r Δ w p r i o r ) \ln p(D) \simeq p(D|w_{MAP}) + M\ln(\frac{\Delta w_{posterior}}{\Delta w_{prior}}) lnp(D)p(DwMAP)+Mln(ΔwpriorΔwposterior)

3.5 The Evidence Approximation

Introduce hyperpriors over α \alpha α and β \beta β, the predictive distribution is obtained by marginalizing over w , α , β w,\alpha,\beta w,α,β.

p ( t ∣ t ) = ∫ ∫ ∫ p ( t ∣ w , β ) p ( w ∣ t , α , β ) p ( α , β ∣ t ) d w d α d β p(t | \mathbf{t}) = \int\int\int p(t | w, \beta)p(w | \mathbf{t},\alpha,\beta)p(\alpha,\beta | \mathbf{t}) dw d\alpha d\beta p(tt)=p(tw,β)p(wt,α,β)p(α,βt)dwdαdβ

If the posterior distribution p ( α , β ∥ t ) p(\alpha,\beta \| \textbf{t}) p(α,βt) is sharply peaked around α ^ \hat{\alpha} α^ and β ^ \hat{\beta} β^, then the predictive distribution is obtained simply by marginalizing over w w w:

p ( t ∣ t ) ≃ p ( t ∣ t , α ^ , β ^ ) = ∫ p ( t ∣ w , β ^ ) p ( w ∣ t , α ^ , β ^ ) ) d w p(t | \mathbf{t}) \simeq p(t | \mathbf{t}, \hat{\alpha}, \hat{\beta}) = \int p(t|w,\hat{\beta})p(w | \mathbf{t}, \hat{\alpha}, \hat{\beta}))dw p(tt)p(tt,α^,β^)=p(tw,β^)p(wt,α^,β^))dw

From Bayes’ theorem, the posterior distribution for α \alpha α and β \beta β is given by:

p ( α , β ∣ t ) ∝ p ( t ∣ α , β ) p ( α , β ) p(\alpha, \beta | \mathbf{t}) \propto p(\mathbf{t} | \alpha, \beta)p(\alpha, \beta) p(α,βt)p(tα,β)p(α,β)

3.5.1 Evaluation of the evidence function

The marginal likelihood function p ( t ∥ α , β ) p(\mathbf{t} \| \alpha, \beta) p(tα,β) is obtained by integrating over the weight parameters w w w, so that

p ( t ∣ α , β ) = ∫ p ( t ∣ w , β ) p ( w ∣ α ) d w p(\mathbf{t} | \alpha, \beta) = \int p(\mathbf{t} | w, \beta)p(w| \alpha) dw p(tα,β)=p(tw,β)p(wα)dw

By completing the square in the exponent and making use of the standard form for the normalization coefficient of a Gaussian, we can get the log of the marginal likelihood in the form:

ln ⁡ p ( t ∣ α , β ) = M 2 ln ⁡ α + N 2 ln ⁡ β − E ( m N ) − 1 2 ln ⁡ ∣ A ∣ − N 2 ln ⁡ ( 2 π ) \ln p(\mathbf{t} | \alpha, \beta) = \frac{M}{2}\ln\alpha + \frac{N}{2}\ln\beta - E(m_{N}) - \frac{1}{2}\ln|A| - \frac{N}{2}\ln(2\pi) lnp(tα,β)=2Mlnα+2NlnβE(mN)21lnA2Nln(2π)

3.5.2 Maximizing the evidence function

Defining the eigenvector equation

( β Φ T Φ ) u i = λ i u i (\beta\Phi^{T}\Phi)u_{i} = \lambda_{i}u_{i} (βΦTΦ)ui=λiui

It then follows that A A A has eigenvalues α + λ i \alpha+\lambda_i α+λi. Now consider the derivative of the term involving l n ∥ A ∥ ln\|A\| lnA with respect to α \alpha α.

d d α ln ⁡ ∣ A ∣ = d d α ln ⁡ ∏ i ( λ i + α ) = d d α ∑ i ln ⁡ ( λ i + α ) = ∑ i 1 λ i + α \frac{d}{d\alpha}\ln|A| = \frac{d}{d\alpha}\ln\prod_{i}(\lambda_{i} + \alpha) = \frac{d}{d\alpha}\sum_{i}\ln(\lambda_{i} + \alpha) = \sum_{i}\frac{1}{\lambda_{i} + \alpha} dαdlnA=dαdlni(λi+α)=dαdiln(λi+α)=iλi+α1

Let the derivative be zero, we will have:
α m N T m N = M − α ∑ i 1 λ i + α = γ \alpha m_{N}^{T}m_{N} = M - \alpha\sum_{i}\frac{1}{\lambda_{i} + \alpha} = \gamma αmNTmN=Mαiλi+α1=γ

So it is easy to get that

γ = ∑ i λ i α + λ i    a n d    α = γ m N T m N \gamma = \sum_{i} \frac{\lambda_{i}}{\alpha + \lambda_{i}}\ \ and\ \ \alpha = \frac{\gamma}{m_{N}^{T}m_{N}} γ=iα+λiλi  and  α=mNTmNγ

As for β \beta β, we can follow the same idea to get

1 β = 1 N − γ ∑ n = 1 N { t n − m N P T ϕ ( x n ) } \frac{1}{\beta} = \frac{1}{N-\gamma}\sum_{n=1}^{N}\{ t_{n}-m_{N}^PT \phi(x_{n})\} β1=Nγ1n=1N{tnmNPTϕ(xn)}

3.5.3 Effective number of parameters

When the number of data points is far more larger than parameters, it will be easy to compute that:

α = M 2 E W ( m N ) \alpha=\frac{M}{2E_W(m_N)} α=2EW(mN)M

β = N 2 E D ( m N ) \beta=\frac{N}{2E_D(m_N)} β=2ED(mN)N

3.6 Limitations of Fixed Basis Functions

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值