Gaussian Process(高斯过程) GPSS暑校笔记总计Ⅰ(英文版)——高斯过程介绍

What is Machine Learning?

data + model ⇒ prediction \text{data} + \text{model} \rArr \text{prediction} data+modelprediction

  • data \text{data} data : observations, could be actively or passively
    acquired (meta-data).
  • model \text{model} model : assumptions, based on previous experience (other data!
    transfer learning etc), or beliefs about the regularities of
    the universe. Inductive bias.
  • prediction \text{prediction} prediction : an action to be taken or a categorization or a
    quality score.

Two important Gaussian Properties

  • Sum of Gaussianv

    Sum of Gaussian variables is also Gaussian.

y i ∼ N ( μ i , σ i 2 ) y_{i} \sim \mathcal{N}(\mu_{i},\sigma_{i}^{2}) yiN(μi,σi2)

And the sum is distributed as

∑ y i ∼ N ( ∑ μ i , ∑ σ i 2 ) \sum y_{i} \sim \mathcal{N}(\sum \mu_{i},\sum \sigma_{i}^{2}) yiN(μi,σi2)

Aside: As sum increase, sum of non-Gaussian, finite variance variables is also Gaussian because of central limit theorem.

  • Scaling a Gaussian
    Scaling a Gaussian leads to a Gaussian.

    ω ∗ y ∼ N ( ω μ , ω 2 σ 2 ) \omega*y \sim \mathcal{N}(\omega \mu, \omega^{2}\sigma^{2}) ωyN(ωμ,ω2σ2)

The central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed.

Prior Distribution

  • Bayesian inference requires a prior on the parameters.
  • The prior represents your belief before you see the data of the
    likely value of the parameters.
  • For linear regression, consider a Gaussian prior on the intercept:

c ∼ N ( 0 , α 1 ) c\sim \mathcal{N}(0, \alpha_1) cN(0,α1)

Posterior Distribution

  • Posterior distribution is found by combining the prior with
    the likelihood.
  • Posterior distribution is your belief after you see the data of
    the likely value of the parameters.
  • The posterior is found through Bayes’ Rule
    p ( c ∣ y ) = p ( y ∣ c ) p ( c ) p ( y ) p(c|y) = \frac{p(y|c)p(c)}{p(y)} p(cy)=p(y)p(yc)p(c)
    The p ( y ∣ c ) p(y|c) p(yc) likelihood is not a density over c c c, it’s a function of c c c. Here, c c c is a parameter of this density. The normalization step, e.g. find the suitable way to compute the p ( y ) p(y) p(y) is the most difficult setp.

Bayes Update

The red line describe the probablity for c c c with:
p ( c ) = N ( c ∣ 0 , α 1 ) p(c) = \mathcal{N}(c | 0,\alpha_{1}) p(c)=N(c0,α1)
the blue line stands for the observation, with:
p ( y ∣ m , c , x , σ 2 ) = N ( y ∣ m x + c , σ 2 ) p(y|m,c,x,\sigma^{2}) = \mathcal{N}(y|mx+c,\sigma^{2}) p(ym,c,x,σ2)=N(ymx+c,σ2)
Note that, this is a likelihood function over c c c not a distribution.
based on the bayes’rule, the posterior could be written as:
P ( c ∣ y , m , x , σ 2 ) = N ( c ∣ y − m x 1 + σ 2 α , ( σ − 2 + α 1 − 1 ) − 1 ) P(c|y,m,x,\sigma^{2}) = \mathcal{N}(c|\frac{y-mx}{1+\sigma^{2}}\alpha,(\sigma^{-2}+\alpha_{1}^{-1})^{-1}) P(cy,m,x,σ2)=N(c1+σ2ymxα,(σ2+α11)1)
Math Trick:
p ( c ) = 1 2 π α 1 exp ⁡ ( − 1 2 α 1 c 2 ) p(c) = \frac{1}{\sqrt{2\pi\alpha_1}} \exp\left(-\frac{1}{2\alpha_1}c^2\right) p(c)=2πα1 1exp(2α11c2)
p ( y ∣ x , c , m , σ 2 ) = 1 ( 2 π σ 2 ) n 2 exp ⁡ ( − 1 2 σ 2 ∑ i = 1 n ( y i − m x i − c ) 2 ) p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2) = \frac{1}{\left(2\pi\sigma^2\right)^{\frac{n}{2}}} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - mx_i - c)^2\right) p(yx,c,m,σ2)=(2πσ2)2n1exp(2σ21i=1n(yimxic)2)
p ( c ∣ y , x , m , σ 2 ) = p ( y ∣ x , c , m , σ 2 ) p ( c ) p ( y ∣ x , m , σ 2 ) p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) = \frac{p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c)}{p(\mathbf{y}|\mathbf{x}, m, \sigma^2)} p(cy,x,m,σ2)=p(yx,m,σ2)p(yx,c,m,σ2)p(c)

p ( c ∣ y , x , m , σ 2 ) = p ( y ∣ x , c , m , σ 2 ) p ( c ) ∫ p ( y ∣ x , c , m , σ 2 ) p ( c ) d c p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) = \frac{p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c)}{\int p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c) \text{d} c} p(cy,x,m,σ2)=p(yx,c,m,σ2)p(c)dcp(yx,c,m,σ2)p(c)
p ( c ∣ y , x , m , σ 2 ) ∝ p ( y ∣ x , c , m , σ 2 ) p ( c ) p(c| \mathbf{y}, \mathbf{x}, m, \sigma^2) \propto p(\mathbf{y}|\mathbf{x}, c, m, \sigma^2)p(c) p(cy,x,m,σ2)p(yx,c,m,σ2)p(c)

log ⁡ p ( c ∣ y , x , m , σ 2 ) = − 1 2 σ 2 ∑ i = 1 n ( y i − c − m x i ) 2 − 1 2 α 1 c 2 + const = − 1 2 σ 2 ∑ i = 1 n ( y i − m x i ) 2 − ( n 2 σ 2 + 1 2 α 1 ) c 2 + c ∑ i = 1 n ( y i − m x i ) σ 2 , \begin{aligned} \log p(c | \mathbf{y}, \mathbf{x}, m, \sigma^2) =&-\frac{1}{2\sigma^2} \sum_{i=1}^n(y_i-c - mx_i)^2-\frac{1}{2\alpha_1} c^2 + \text{const}\\ = &-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-mx_i)^2 -\left(\frac{n}{2\sigma^2} + \frac{1}{2\alpha_1}\right)c^2\\ & + c\frac{\sum_{i=1}^n(y_i-mx_i)}{\sigma^2}, \end{aligned} logp(cy,x,m,σ2)==2σ21i=1n(yicmxi)22α11c2+const2σ21i=1n(yimxi)2(2σ2n+2α11)c2+cσ2i=1n(yimxi),

complete the square of the quadratic form to obtain
log ⁡ p ( c ∣ y , x , m , σ 2 ) = − 1 2 τ 2 ( c − μ ) 2 + const \log p(c | \mathbf{y}, \mathbf{x}, m, \sigma^2) = -\frac{1}{2\tau^2}(c - \mu)^2 +\text{const} logp(cy,x,m,σ2)=2τ21(cμ)2+const
where
τ 2 = ( n σ − 2 + α 1 − 1 ) − 1 \tau^2 = \left(n\sigma^{-2} +\alpha_1^{-1}\right)^{-1} τ2=(nσ2+α11)1
and

μ = τ 2 σ 2 ∑ i = 1 n ( y i − m x i ) \mu = \frac{\tau^2}{\sigma^2} \sum_{i=1}^n(y_i-mx_i) μ=σ2τ2i=1n(yimxi).

Piror comes from the model, where we think about it. And likelihood is coming from the data.

Stages to Derivation of the Posterior

  • Multiply likelihood by prior
    • they are “exponentiated quadratics”, the answer is always also an exponentiated quadratic because
      exp ⁡ ( a 2 ) exp ⁡ ( b 2 ) = exp ⁡ ( a 2 + b 2 ) \exp(a^2)\exp(b^2) = \exp(a^2 + b^2) exp(a2)exp(b2)=exp(a2+b2)
  • Complete the square to get the resulting density in the form of a Gaussian.
  • Recognise the mean and (co)variance of the Gaussian. This is the estimate of the posterior.

Multivariate Regression Likelihood

  • Noise corrupted data point
    y i = w T X i , : + ϵ i y_{i} = \mathcal{w}^{T}X_{i,:}+\epsilon_{i} yi=wTXi,:+ϵi
  • Multivariate regression likelihood:
    p ( y ∣ X , w ) = 1 ( 2 π σ 2 ) e x p ( − 1 2 σ 2 ∑ ( y i − w T x i , : ) 2 ) p(y|X,w)=\frac{1}{(2\pi\sigma^{2})}exp(-\frac{1}{2\sigma^{2}}\sum(y_{i}-w^{T}x_{i,:})^2) p(yX,w)=(2πσ2)1exp(2σ21(yiwTxi,:)2)
  • Multivariate Gaussian piroi:
    p ( w ) = 1 ( 2 π σ 2 ) p / 2 e x p ( − 1 2 σ 2 w T w ) p(w)=\frac{1}{(2\pi\sigma^{2})^{p/2}}exp(-\frac{1}{2\sigma^{2}}w^{T}w) p(w)=(2πσ2)p/21exp(2σ21wTw)

The independent multivariate Gaussian could be seen as the independent Gaussian and multiple them and rotate the results.

Independent Gaussians:

p ( w , h ) = 1 2 π α 1 2 π α 2 exp ⁡ ( − 1 2 ( ( w − μ 1 ) 2 σ 1 2 + ( h − μ 2 ) 2 σ 2 2 ) ) p(w,h)=\frac{1}{\sqrt{2\pi\alpha_1}\sqrt{2\pi\alpha_2}} \exp(-\frac{1}{2}(\frac{(w-\mu_1)^2}{\sigma_1^2}+\frac{(h-\mu_2)^2}{\sigma_2^2})) p(w,h)=2πα1 2πα2 1exp(21(σ12(wμ1)2+σ22(hμ2)2))

and we can write it with linear algebra form:
p ( w , h ) = 1 2 π α 1 2 π α 2 exp ⁡ ( − 1 2 ( [ w h ] − [ u 1 u 2 ] ) T ( [ σ 1 0 0 σ 2 ] [ w h ] − [ u 1 u 2 ] ) ) p(w,h)=\frac{1}{\sqrt{2\pi\alpha_1}\sqrt{2\pi\alpha_2}} \exp(-\frac{1}{2}(\begin{bmatrix} w\\ h \end{bmatrix}-\begin{bmatrix} u_1\\ u_2 \end{bmatrix})^T(\begin{bmatrix} \sigma_1 & 0\\ 0 & \sigma_2 \end{bmatrix}\begin{bmatrix} w\\ h \end{bmatrix}-\begin{bmatrix} u_1\\ u_2 \end{bmatrix})) p(w,h)=2πα1 2πα2 1exp(21([wh][u1u2])T([σ100σ2][wh][u1u2]))

and then, rename it:
p ( y ) = 1 ∣ 2 π ∗ D ∣ 1 2 exp ⁡ ( − 1 2 ( y − μ ) T D − 1 ( Y − μ ) ) p(y)=\frac{1}{|2 \pi*D|^\frac{1}{2}} \exp(-\frac{1}{2}(y-\mu)^TD^{-1}(Y-\mu)) p(y)=2πD211exp(21(yμ)TD1(Yμ))
∣ D ∣ |D| D means the determinant of the matrix.

Correlated Gaussian

Form correlated from original by rotating the data space using matrx R.
p ( y ) = 1 ∣ 2 π ∗ D ∣ 1 2 exp ⁡ ( − 1 2 ( R T y − R T μ ) T D − 1 ( R T Y − R T μ ) ) p(y)=\frac{1}{|2 \pi*D|^\frac{1}{2}} \exp(-\frac{1}{2}(R^Ty-R^T\mu)^TD^{-1}(R^TY-R^T\mu)) p(y)=2πD211exp(21(RTyRTμ)TD1(RTYRTμ))
p ( y ) = 1 ∣ 2 π ∗ D ∣ 1 2 exp ⁡ ( − 1 2 ( y − μ ) T R D − 1 R T ( Y − μ ) ) p(y)=\frac{1}{|2 \pi*D|^\frac{1}{2}} \exp(-\frac{1}{2}(y-\mu)^TRD^{-1}R^T(Y-\mu)) p(y)=2πD211exp(21(yμ)TRD1RT(Yμ))
this gives a covariance matric:
C − 1 = R D − 1 R T C^{-1}=RD^{-1}R^T C1=RD1RT
which in some view is the result of the principal compoment.

Multivariate Consequence

if
x ∼ N ( μ , Σ ) x \sim \mathcal{N}(\mu,\varSigma) xN(μ,Σ)
and
y = W x y=Wx y=Wx
then
y ∼ N ( W μ , W Σ W T ) y\sim \mathcal{N}(W\mu,W\varSigma W^T) yN(Wμ,WΣWT)

we can say the first equaption is the prior of x x x, the second is the likelihood, the last is the marginal of y y y. If we set μ = 0 , Σ = I \mu = 0, \varSigma = I μ=0,Σ=I, so it is just the inverse of PCA.

Prediction with Correlated Gaussians

  • Prediction of f ∗ \mathbf{f}_* f from f \mathbf{f} f requires multivariate conditional density.

  • Multivariate conditional density is also Gaussian.
    p ( f ∗ ∣ f ) = N ( f ∗ ∣ K ∗ , f K f , f − 1 f , K ∗ , ∗ − K ∗ , f K f , f − 1 K f , ∗ ) p(\mathbf{f}_*|\mathbf{f}) = \mathcal{N}\left(\mathbf{f}_*|\mathbf{K}_{*,\mathbf{f}}\mathbf{K}_{\mathbf{f},\mathbf{f}}^{-1}\mathbf{f},\mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{f}} \mathbf{K}_{\mathbf{f},\mathbf{f}}^{-1}\mathbf{K}_{\mathbf{f},*}\right) p(ff)=N(fK,fKf,f1f,K,K,fKf,f1Kf,)

  • Here covariance of joint density is given by
    K = [ K f , f K ∗ , f K f , ∗ K ∗ , ∗ ] \mathbf{K} = \begin{bmatrix} \mathbf{K}_{\mathbf{f}, \mathbf{f}} & \mathbf{K}_{*, \mathbf{f}}\\ \mathbf{K}_{\mathbf{f}, *} & \mathbf{K}_{*, *}\end{bmatrix} K=[Kf,fKf,K,fK,]

Prediction
Take the picture as the example:

since that example is in 1D, all the values are scalar. f = f 1 = − 0.4 f=f1=-0.4 f=f1=0.4, K f , f = 1 , K ∗ , f = 0.98 , K ∗ , ∗ = 1 K_{f,f}=1,K_{*,f}=0.98,K_{*,*}=1 Kf,f=1,K,f=0.98,K,=1, so
p ( p 2 ∣ p 1 ) = N ( p 2 ∣ 0.98 ∗ 1 ∗ ( − 0.4 ) , 1 − 0.98 ∗ 1 ∗ 0.98 ) p(p_2|p_1)=\mathcal{N}(p_2|0.98*1*(-0.4),1-0.98 *1*0.98) p(p2p1)=N(p20.981(0.4),10.9810.98)
p ( p 2 ∣ p 1 ) = N ( p 2 ∣ − 0.392 , 0.0396 ) p(p_2|p_1)= \mathcal{N}(p_2|-0.392,0.0396) p(p2p1)=N(p20.392,0.0396)
the variance is 0.0396, and the standard variance is 0.2.

Prediction

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值