模式识别 | PRML Chapter 2 Probability Distributions

PRML Chapter 2 Probability Distributions

2.1 Binary Variables

  • bernoulli distribution: B e r n ( x ∣ μ ) = μ x ( 1 − μ ) 1 − x Bern(x | \mu) = \mu^{x}(1-\mu)^{1-x} Bern(xμ)=μx(1μ)1x
  • binomial distribution: B i n ( m ∣ N , μ ) = N ! ( N − m ) ! m ! μ m ( 1 − μ ) N − m Bin(m | N,\mu) = \frac{N!}{(N-m)!m!}\mu^{m}(1-\mu)^{N-m} Bin(mN,μ)=(Nm)!m!N!μm(1μ)Nm

2.1.1 The beta distribution

To get the MLE solution in a bayesian perspective, we need a prior distribution. Beta is a common one:

B e t a ( μ ∣ a , b ) = Γ ( a + b ) Γ ( a ) Γ ( b ) μ a − 1 ( 1 − μ ) b − 1 Beta(\mu | a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1} Beta(μa,b)=Γ(a)Γ(b)Γ(a+b)μa1(1μ)b1

Where E [ μ ] = a a + b E[\mu] = \frac{a}{a + b} E[μ]=a+ba and v a r [ μ ] = a b ( a + b ) 2 ( a + b + 1 ) var[\mu] = \frac{ab}{(a+b)^{2}(a + b + 1)} var[μ]=(a+b)2(a+b+1)ab. We can get the posterior distribution:

p ( μ ∣ m , l , a , b ) = Γ ( m + a + l + b ) Γ ( m + a ) Γ ( l + b ) μ m + a − 1 ( 1 − μ ) l + b − 1 p(\mu | m,l,a,b) = \frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1} p(μm,l,a,b)=Γ(m+a)Γ(l+b)Γ(m+a+l+b)μm+a1(1μ)l+b1

For prediction, we need to estimate the distribution of x x x given the condition that the training data has known.

p ( x = 1 ∣ D ) = ∫ 0 1 p ( x = 1 ∣ μ ) p ( μ ∣ D ) d μ = ∫ 0 1 μ p ( μ ∣ D ) d μ = E [ μ ∣ D ] p(x = 1 | D) = \int_{0}^{1}p(x=1 | \mu)p(\mu | D)d\mu = \int_{0}^{1}\mu p(\mu | D)d\mu = E[\mu | D] p(x=1D)=01p(x=1μ)p(μD)dμ=01μp(μD)dμ=E[μD]

So we have:

p ( x = 1 ∣ D ) = m + a m + a + l + b p(x = 1 | D) = \frac{m+a}{m+a+l+b} p(x=1D)=m+a+l+bm+a

On average, the more data we observe, the uncertainty of the posterior possibility will continuous decreasing.

E θ ( θ ) = E D [ E θ [ θ ∣ D ] ] E_{\theta}(\theta) = E_{D}[E_{\theta}[\theta | D]] Eθ(θ)=ED[Eθ[θD]]

v a r θ [ θ ] = E D [ v a r θ [ θ ∣ D ] ] + v a r D [ E θ [ θ ∣ D ] ] var_{\theta}[\theta] = E_{D}[var_{\theta}[\theta | D]] + var_{D}[E_{\theta}[\theta | D]] varθ[θ]=ED[varθ[θD]]+varD[Eθ[θD]]

2.2 Multinomial Variables

If we want to discribe a variable more than two states with binary variables, we can use the form like x = ( 0 , 0 , … , 1 , … , 0 ) T x = (0, 0, \dots, 1, \dots, 0)^{T} x=(0,0,,1,,0)T. If we use μ k \mu_k μk to represent the probability of x k = 1 x_k=1 xk=1 and we have N independent dataset, the likelihood function will be:

p ( D ∣ μ ) = ∏ k = 1 K μ k m k ,   m k = ∑ n x n k p(D | \mu) = \prod_{k=1}^{K}\mu_{k}^{m_{k}},\ m_k=\sum_n x_{nk} p(Dμ)=k=1Kμkmk, mk=nxnk

Using a Lagrange multiplier and let the partial derivative to find the MLE solution for μ \mu μ:

∑ k = 1 K m k l n μ k + λ ( ∑ k = 1 K μ k − 1 ) \sum_{k=1}^{K}m_{k}ln\mu_{k} + \lambda(\sum_{k=1}^{K}\mu_{k}-1) k=1Kmklnμk+λ(k=1Kμk1)

μ k M L = m k N \mu_{k}^{ML} = \frac{m_{k}}{N} μkML=Nmk

Now consider the conditional joint distribution, it is called multinomial distribution (here ∑ k = 1 K m k = N \sum_{k=1}^{K}m_{k} = N k=1Kmk=N):

M u l t ( m 1 , m 2 , … , m K ∣ μ , N ) = N ! m 1 ! m 2 ! … m K ! Mult(m_{1}, m_{2}, \dots, m_{K} |\mu, N) = \frac{N!}{m_{1}!m_{2}!\dots m_{K}!} Mult(m1,m2,,mKμ,N)=m1!m2!mK!N!

2.2.1 The Dirichlet Distribution

The prior distribution of parameter μ k \mu_k μk is Dirichlet distribution:

D I r ( μ ∣ α ) = Γ ( α 0 ) Γ ( α 1 ) … Γ ( α K ) ∏ k = 1 K μ k α k − 1 DIr(\mu | \alpha) = \frac{\Gamma(\alpha_{0})}{\Gamma(\alpha_{1})\dots\Gamma(\alpha_{K})} \prod_{k=1}^{K}\mu_{k}^{\alpha_{k-1}} DIr(μα)=Γ(α1)Γ(αK)Γ(α0)k=1Kμkαk1

where α 0 = ∑ k = 1 K α k \alpha_{0} = \sum_{k=1}^{K}\alpha_{k} α0=k=1Kαk and it is easy to get the posterior distribution:

p ( μ ∣ D , α ) = D i r ( μ ∣ α + m ) p(\mu | D,\alpha) = Dir(\mu | \alpha + m) p(μD,α)=Dir(μα+m)

2.3 The Gaussian Distribution

The geometric form of Guassian distribution: The exponent is a quadratic form:

Δ 2 = ( x − μ ) T Σ − 1 ( x − μ ) \Delta^{2} = (x-\mu)^{T}\Sigma^{-1}(x-\mu) Δ2=(xμ)TΣ1(xμ)

Σ \Sigma Σ can be symmetric and from the eigenvector equation Σ μ i = λ i μ i \Sigma\mu_{i} = \lambda_{i}\mu_{i} Σμi=λiμi, when we choose the eigenvertor to be orthogonal, we will have:

Σ = ∑ i = 1 D λ i μ i μ i T \Sigma = \sum_{i=1}^{D}\lambda_{i}\mu_{i}\mu_{i}^{T} Σ=i=1DλiμiμiT

So the quadratic form can be written as:

Δ 2 = ∑ i = 1 D y i 2 λ i ,   y i = μ i T ( x − μ ) \Delta^{2} = \sum_{i=1}^{D}\frac{y_{i}^{2}}{\lambda_{i}},\ y_{i} = \mu_{i}^{T}(x-\mu) Δ2=i=1Dλiyi2, yi=μiT(xμ)

Consider the newly defined coordinate system by y i y_i yi, the form of Guassian distribution will be:

p ( y ) = p ( x ) ∣ J ∣ = ∏ j = 1 D 1 ( 2 π λ j ) 1 2 e x p { − y j 2 2 λ j } ,    J i j = ∂ x i ∂ y j = U i j p(y) = p(x)|J| = \prod_{j=1}^{D}\frac{1}{(2\pi\lambda_{j})^{\frac{1}{2}}}exp\{-\frac{y_{j}^{2}}{2\lambda_{j}}\},\ \ J_{ij} = \frac{\partial x_{i}}{\partial y_{j}} = U_{ij} p(y)=p(x)J=j=1D(2πλj)211exp{2λjyj2},  Jij=yjxi=Uij

2.3.1 Conditional Gaussian distributions

Consider multivariate normal distribution, suppose we have:

x = ( x a x b ) , μ = ( μ a μ b ) , Σ = ( Σ a a     Σ a b Σ b a     Σ b b ) x = \left( \begin{aligned} x_{a} \\ x_{b} \end{aligned} \right) ,\mu = \left( \begin{aligned} \mu_{a} \\ \mu_{b} \end{aligned} \right) ,% <![CDATA[ \Sigma = \left( \begin{aligned} \Sigma_{aa} ~&~ \Sigma_{ab}\\ \Sigma_{ba} ~&~ \Sigma_{bb} \end{aligned} \right) %]]> x=(xaxb),μ=(μaμb),Σ=(Σaa Σba  Σab Σbb)

and we introduce the precision matrix:

Σ − 1 = Λ = ( Λ a a     Λ a b Λ b a     Λ b b ) \Sigma^{-1}= \Lambda = \left( \begin{aligned} \Lambda_{aa} ~&~ \Lambda_{ab}\\ \Lambda_{ba} ~&~ \Lambda_{bb} \end{aligned} \right) %]]> Σ1=Λ=(Λaa Λba  Λab Λbb)

To find an expression for the conditional distribution p ( x a ∥ x b ) p(x_a\|x_b) p(xaxb), we obtain:

− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 ( x a − μ a ) T Λ a a ( x a − μ a ) − 1 2 ( x a − μ a ) T Λ a b ( x b − μ b ) − 1 2 ( x b − μ b ) T Λ b a ( x a − μ a ) − 1 2 ( x b − μ b ) T Λ b b ( x b − μ b ) -\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) = -\frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{aa}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{a} - \mu_{a})^{T}\Lambda_{ab}(x_{b} - \mu_{b}) \\ -\frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{ba}(x_{a} - \mu_{a}) - \frac{1}{2}(x_{b} - \mu_{b})^{T}\Lambda_{bb}(x_{b} - \mu_{b}) 21(xμ)TΣ1(xμ)=21(xaμa)TΛaa(xaμa)21(xaμa)TΛab(xbμb)21(xbμb)TΛba(xaμa)21(xbμb)TΛbb(xbμb)

which is the exponential term of conditional Gaussian. Also, it is easy to know that:
− 1 2 ( x − μ ) T Σ − 1 ( x − μ ) = − 1 2 x T Σ − 1 x + x T Σ − 1 μ + c o n s t -\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu) = -\frac{1}{2}x^{T}\Sigma^{-1}x + x^{T}\Sigma^{-1}\mu + const 21(xμ)TΣ1(xμ)=21xTΣ1x+xTΣ1μ+const

Apply this method to p ( x a ∥ x b ) p(x_a\|x_b) p(xaxb) and x b x_b xb is regarded as a constant. Compare the second order term − 1 2 x a T Λ a a x a -\frac{1}{2}x_a^T\Lambda_{aa}x_a 21xaTΛaaxa and the linear terms x a T ( Λ a a μ a − Λ a b ( x b − μ b ) ) x_a^T(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b})) xaT(ΛaaμaΛab(xbμb)) in x a x_a xa, we can get the variance and mean:

Σ a ∣ b = Λ a a − 1 \Sigma_{a|b} = \Lambda_{aa}^{-1} Σab=Λaa1

μ a ∣ b = Σ a ∣ b ( Λ a a μ a − Λ a b ( x b − μ b ) ) = μ a − Λ a a − 1 Λ a b ( x b − μ b ) \mu_{a|b} = \Sigma_{a|b}(\Lambda_{aa}\mu_{a} - \Lambda_{ab}(x_{b} - \mu_{b}))\\ = \mu_{a} - \Lambda_{aa}^{-1}\Lambda_{ab}(x_{b} - \mu_{b}) μab=Σab(ΛaaμaΛab(xbμb))=μaΛaa1Λab(xbμb)

2.3.2 Marginal Gaussian distributions

To prove the marginal distribution p ( x a ) = ∫ p ( x a , x b ) d x b p(x_{a}) = \int p(x_{a}, x_{b})dx_{b} p(xa)=p(xa,xb)dxb is Gaussian, the method is similar to 2.3.1, the results are:

Σ a = ( Λ a a − Λ a b Λ b b − 1 Λ b a ) − 1 \Sigma_{a} = (\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})^{-1} Σa=(ΛaaΛabΛbb1Λba)1

μ a = Σ a ( Λ a a − Λ a b Λ b b − 1 Λ b a ) μ a \mu_{a} = \Sigma_{a}(\Lambda_{aa} - \Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})\mu_{a} μa=Σa(ΛaaΛabΛbb1Λba)μa

2.3.3 Bayes’ theorem for Gaussian variables

Suppose we are given a Gaussian marginal distribution p ( x ) p(x) p(x) and a Gaussian conditional distribution p ( y ∥ x ) p(y\|x) p(yx) (linear):
p ( x ) = N ( x ∣ μ , Λ − 1 ) p(x) = N(x | \mu, \Lambda^{-1}) p(x)=N(xμ,Λ1)

p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) p(y | x) = N(y | Ax + b, L^{-1}) p(yx)=N(yAx+b,L1)

we wish to find the marginal distribution p ( y ) p(y) p(y) and the conditional distribution p ( x ∥ y ) p(x\|y) p(xy). Consider the log of the joint distribution:

l n p ( z ) = l n p ( x ) + l n p ( y ∣ x ) = − 1 2 ( x − μ ) T Λ ( x − μ ) − 1 2 ( y − A x − b ) T L ( y − A x − b ) + C \begin{aligned} ln p(z) & = ln p(x) + ln p(y | x) \\ & = -\frac{1}{2}(x-\mu)^{T}\Lambda(x-\mu) - \frac{1}{2}(y-Ax-b)^{T}L(y-Ax-b) + C \end{aligned} %]]> lnp(z)=lnp(x)+lnp(yx)=21(xμ)TΛ(xμ)21(yAxb)TL(yAxb)+C

By comparing quadratic forms we can get the mean and covariance of the joint distribution:

c o v [ z ] = R − 1 = ( Λ − 1     Λ − 1 A T A Λ − 1     L − 1 + A Λ − 1 A T ) cov[z] = R^{-1} = \left( \begin{aligned} \Lambda^{-1} ~&~ \Lambda^{-1}A^{T} \\ A\Lambda^{-1} ~&~ L^{-1} + A\Lambda^{-1}A^{T} \end{aligned} \right) %]]> cov[z]=R1=(Λ1 AΛ1  Λ1AT L1+AΛ1AT)

E [ z ] = ( μ A μ + b ) E[z] = \left( \begin{aligned} \mu \\ A\mu + b \end{aligned} \right) E[z]=(μAμ+b)

And for the marginal distribution p ( y ) p(y) p(y), it is easy to observe that

E [ y ] = A μ + b E[y] = A\mu + b E[y]=Aμ+b

c o v [ y ] = L − 1 + A Λ − 1 A T cov[y] = L^{-1} + A\Lambda^{-1}A^{T} cov[y]=L1+AΛ1AT

Similarly, for conditional distribution, we have:
E [ x ∣ y ] = ( Λ + A T L A ) − 1 { A T L ( y − b ) + a μ } E[x | y] = (\Lambda + A^{T}LA)^{-1}\{A^{T}L(y-b) + a\mu \} E[xy]=(Λ+ATLA)1{ATL(yb)+aμ}

c o v [ x ∣ y ] = ( Λ + A T L A ) − 1 cov[x | y] = (\Lambda + A^{T}LA)^{-1} cov[xy]=(Λ+ATLA)1

2.3.4 Maximum likelihood for the Gaussian

This part we only show the result:
μ M L = 1 N ∑ n = 1 N x n \mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n μML=N1n=1Nxn

Σ M L = 1 N ∑ n = 1 N ( x n − μ M L ) ( x n − μ M L ) T \Sigma_{ML}=\frac{1}{N}\sum_{n=1}^N(x_n-\mu_{ML})(x_n-\mu_{ML})^T ΣML=N1n=1N(xnμML)(xnμML)T

2.3.5 Sequential estimation

Sequential methods allow data points to be processed one at a time and then discarded. Refer to 2.3.4, the contribution from the final data point x N x_N xN is:

μ M L ( N ) = 1 N ∑ n = 1 N x n = μ M L ( N − 1 ) + 1 N ( x N − μ M L ( N − 1 ) ) % <![CDATA[ \begin{aligned} \mu_{ML}^{(N)} & = \frac{1}{N}\sum_{n=1}^{N}x_{n} \\ & = \mu_{ML}^{(N-1)} + \frac{1}{N}(x_{N} - \mu_{ML}^{(N-1)}) \end{aligned} %]]> μML(N)=N1n=1Nxn=μML(N1)+N1(xNμML(N1))

A more general formulation pair of sequential learning is called Robbins-Monro algorithm. For random variables θ \theta θ and z z z governed by joint distribution p ( z , θ ) p(z,\theta) p(z,θ). The conditional expenctation of z z z given θ \theta θ is

f ( θ ) = E [ z ∣ θ ] = ∫ z p ( z ∣ θ ) d z f(\theta) = E[z | \theta] = \int zp(z | \theta) dz f(θ)=E[zθ]=zp(zθ)dz

f ( θ ) f(\theta) f(θ) is called regression functions. Our goal is to find θ ∗ \theta^* θ that f ( θ ∗ ) = 0 f(\theta^*)=0 f(θ)=0. Suppose we observe values of z z z and we with to find a corresponding sequential estimation scheme for θ ∗ \theta^* θ. Assume that $
E[(z-f)^{2}][\theta] < \infty %]]>$, the sequence of successive estimation will be:

θ ( N ) = θ ( N − 1 ) − α N − 1 z ( θ ( N − 1 ) ) \theta^{(N)} = \theta^{(N-1)} - \alpha_{N-1}z(\theta^{(N-1)}) θ(N)=θ(N1)αN1z(θ(N1))

2.3.6 Bayesian inference for the Gaussian

Estimate μ \mu μ ( σ \sigma σ known):
The likelihood function of Gaussian distribution:

p ( x ∣ μ ) = ∏ n = 1 N p ( x n ∣ μ ) = 1 2 π σ 2 e x p { − 1 2 σ 2 ∑ n = 1 N ( x n − μ 2 ) } p(x | \mu) = \prod_{n=1}^{N}p(x_{n} | \mu) = \frac{1}{2\pi\sigma^{2}}exp\left\{-\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}(x_{n} - \mu^{2})\right\} p(xμ)=n=1Np(xnμ)=2πσ21exp{2σ21n=1N(xnμ2)}

Choose p ( μ ) p(\mu) p(μ) as Gaussian distribution: p ( μ ) = N ( μ ∣ μ 0 , σ 0 2 ) p(\mu) = N(\mu | \mu_{0},\sigma_{0}^{2}) p(μ)=N(μμ0,σ02), and the posterior distribution is given by:

p ( μ ∣ x ) ∝ p ( x ∣ μ ) p ( μ ) p(\mu | x) \propto p(x | \mu)p(\mu) p(μx)p(xμ)p(μ)

Consequently, the posterior distribution will be:

p ( μ ∣ x ) = N ( μ ∣ μ N , σ N 2 ) p(\mu | x) = N(\mu | \mu_{N}, \sigma_{N}^{2}) p(μx)=N(μμN,σN2)

where

μ N = σ 2 N σ 0 2 + σ 2 μ 0 + N σ 0 2 N σ 0 2 + σ 2 μ M L ,    μ M L = 1 N ∑ n = 1 N x n \mu_{N} = \frac{\sigma^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{0} + \frac{N\sigma_{0}^{2}}{N\sigma_{0}^{2} + \sigma^{2}}\mu_{ML},\ \ \mu_{ML} = \frac{1}{N}\sum_{n=1}^{N}x_{n} μN=Nσ02+σ2σ2μ0+Nσ02+σ2Nσ02μML,  μML=N1n=1Nxn

1 σ N 2 = 1 σ 0 2 + N σ 2 \frac{1}{\sigma_{N}^{2}} = \frac{1}{\sigma_{0}^{2}} + \frac{N}{\sigma^{2}} σN21=σ021+σ2N

The Bayesian paradigm leads very naturally to a sequential view of the inference problem.

p ( μ ∣ x ) ∝ [ p ( μ ) ∏ n = 1 N p ( x n ∣ μ ) ] p ( x N ∣ μ ) p(\mu | x) \propto \left[ p(\mu)\prod_{n=1}^{N}p(x_{n} | \mu) \right] p(x_{N} | \mu) p(μx)[p(μ)n=1Np(xnμ)]p(xNμ)

Estimate σ \sigma σ ( μ \mu μ known):
The likelihood function for λ \lambda λ takes the form:

p ( x ∣ λ ) = ∏ n = 1 N N ( x n ∣ μ , λ − 1 ) ∝ λ 1 2 e x p { − λ 2 ∑ n = 1 N ( x n − μ ) 2 } p(x | \lambda) = \prod_{n=1}^{N}N(x_{n} | \mu, \lambda^{-1}) \propto \lambda^{\frac{1}{2}}exp\left\{ -\frac{\lambda}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} \right\} p(xλ)=n=1NN(xnμ,λ1)λ21exp{2λn=1N(xnμ)2}

The prior distribution that we choose is Gamma distribution:

G a m ( λ ∣ a , , b ) = 1 Γ ( a ) b a λ ( a − 1 ) e x p ( − b λ ) Gam(\lambda | a, ,b) = \frac{1}{\Gamma(a)}b^{a}\lambda^{(a-1)}exp(-b\lambda) Gam(λa,,b)=Γ(a)1baλ(a1)exp(bλ)

The posterior distribution is given by:

p ( λ ∣ x ) ∝ λ a 0 − 1 λ N 2 e x p { − b 0 λ − λ 2 ∑ n = 1 N ( x n − μ ) 2 } p(\lambda | x) \propto \lambda^{a_{0} - 1}\lambda^{\frac{N}{2}}exp\left\{ -b_{0}\lambda - \frac{\lambda}{2}\sum_{n=1}^{N}(x_{n}-\mu)^{2}\right\} p(λx)λa01λ2Nexp{b0λ2λn=1N(xnμ)2}

Consequently, the posterior distribution will be G a m ( λ ∣ a N , b N ) Gam(\lambda | a_{N}, b_{N}) Gam(λaN,bN), where

a N = a 0 + N 2 a_{N} = a_{0} + \frac{N}{2} aN=a0+2N

b N = b 0 + 1 2 ∑ n = 1 N ( x n − μ ) 2 = b 0 + N 2 σ M L 2 b_{N} = b_{0} + \frac{1}{2}\sum_{n=1}^{N}(x_{n} - \mu)^{2} = b_{0} + \frac{N}{2}\sigma_{ML}^{2} bN=b0+21n=1N(xnμ)2=b0+2NσML2

μ \mu μ and σ \sigma σ are unknown:

The form of prior distribution we choose is:

p ( μ , λ ) = N ( μ ∣ μ 0 , ( β λ ) − 1 ) G a m ( λ ∣ a , b ) p(\mu, \lambda) = N(\mu | \mu_{0}, (\beta\lambda)^{-1})Gam(\lambda | a, b) p(μ,λ)=N(μμ0,(βλ)1)Gam(λa,b)

Where μ 0 = c β \mu_{0} = \frac{c}{\beta} μ0=βc, a = 1 + β 2 a = \frac{1+\beta}{2} a=21+β and b = d − c 2 2 β b = d - \frac{c^{2}}{2\beta} b=d2βc2.

2.3.7 Student’s t-distribution

If we have a gaussian distribution N ( x ∣ μ , τ − 1 ) N(x | \mu, \tau^{-1}) N(xμ,τ1) and a Gamma Prior G a m ( τ ∣ a , b ) Gam(\tau | a, b) Gam(τa,b). Calculate the integral of τ \tau τ, we can get the marginal distribution of x x x:

p ( x ∣ μ , a , b ) = ∫ 0 ∞ N ( x ∣ μ , τ − 1 ) G a m ( τ ∣ a , b ) d τ p(x | \mu, a, b) = \int_{0}^{\infty} N(x | \mu, \tau^{-1})Gam(\tau | a, b)d\tau p(xμ,a,b)=0N(xμ,τ1)Gam(τa,b)dτ

Replace the parameters by v = 2 a v=2a v=2a and λ = a b \lambda=\frac{a}{b} λ=ba, we can get the Student’s t-distribution:

S t ( x ∣ μ , λ , ν ) = Γ ( ν 2 + 1 2 ) Γ ( ν 2 ) ( λ π ν ) 1 2 [ 1 + λ ( x − μ ) 2 ν ] − ν 2 − 1 2 St(x | \mu,\lambda, \nu) = \frac{\Gamma(\frac{\nu}{2} + \frac{1}{2})}{\Gamma(\frac{\nu}{2})}\left( \frac{\lambda}{\pi\nu} \right)^{\frac{1}{2}}\left[ 1 + \frac{\lambda(x-\mu)^{2}}{\nu} \right]^{-\frac{\nu}{2}-\frac{1}{2}} St(xμ,λ,ν)=Γ(2ν)Γ(2ν+21)(πνλ)21[1+νλ(xμ)2]2ν21

2.3.8 Periodic variables

2.3.9 Mixtures of Gaussians

Taking linear combinations of basic distributions such as Gaussian, almost any continuous density can be approximated to arbitrary accuracy.

p ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) p(x) = \sum_{k=1}^{K}\pi_{k}N(x | \mu_{k}, \Sigma_{k}) p(x)=k=1KπkN(xμk,Σk)

2.4 The Exponential Family

The distribution of variable x x x with parameter η \eta η can be defined as:

p ( x ∣ η ) = h ( x ) g ( η ) e x p { η T u ( x ) } p(x | \eta) = h(x)g(\eta)exp\{\eta^{T}u(x)\} p(xη)=h(x)g(η)exp{ηTu(x)}

And the formula satisfies:

g ( η ) ∫ h ( x ) e x p { η T u ( x ) } d x = 1 g(\eta)\int h(x)exp\{\eta^{T}u(x)\} dx = 1 g(η)h(x)exp{ηTu(x)}dx=1

2.4.1 Maximum likelihood and sufficient statistics

2.4.2 Conjugate priors

For any member of the exponential family, there exists a conjugate a conjugate prior that can be written in the form:

p ( η ∣ χ , ν ) = f ( χ , ν ) g ( η ) ν e x p ( ν η T χ ) p(\eta | \chi, \nu) = f(\chi, \nu)g(\eta)^{\nu}exp(\nu\eta^{T}\chi) p(ηχ,ν)=f(χ,ν)g(η)νexp(νηTχ)

2.4.3 Noninformative priors

Two simple examples of noninformative priors:

The location parameter: p ( x ∣ μ ) = f ( x − μ ) p(x| \mu) = f(x-\mu) p(xμ)=f(xμ)

The scale parameter: p ( x ∣ σ ) = 1 σ f ( x σ ) p(x | \sigma) = \frac{1}{\sigma}f(\frac{x}{\sigma}) p(xσ)=σ1f(σx)

2.5 Nonparametric Methods

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值