Gaussian Process(高斯过程) GPSS暑校笔记总计Ⅱ(英文版)——高斯过程介绍二

Stochastic Process

a stochastic process is a collection of random variable indexed by some variable x ∈ X x \in \mathcal{X} xX.
f = { f ( x ) : x ∈ X } f = \begin{Bmatrix} f(x):x \in \mathcal{X} \end{Bmatrix} f={f(x):xX}
f f f can be thought of as a function of location x x x.

f f f is an infinite dimensional process. However, thankfully we only need consider the finite dimensional distributions (FDDs), i.e., for all x 1 , . . . x n x_1, . . . x_n x1,...xn and for all n ∈ N n ∈ N nN
P ( f ( x 1 ) ≤ y 1 , . . . , f ( x n ) ≤ y n ) P(f (x_1) ≤ y_1, . . . , f (x_n) ≤ y_n) P(f(x1)y1,...,f(xn)yn)
as these uniquely determine the law of f f f .

A Gaussian process is a stochastic process with Gaussian FDDs, i.e.
( f ( x 1 ) . . . f ( x n ) ) ∼ N ( μ , Σ ) (f(x_1)...f(x_n))\sim \mathcal{N}(\mu,\varSigma) (f(x1)...f(xn))N(μ,Σ)

Import properties

  • Property 1: x ∼ N ( μ , Σ ) x \sim \mathcal{N}(\mu,\varSigma) xN(μ,Σ) if and only if A X ∼ N p ( A μ , A Σ A T ) AX \sim \mathcal{N_p}(A\mu,A\varSigma A^T) AXNp(Aμ,AΣAT)
    So, sum of Gaussian is Gaussian (taking A A A as ones vector), and marginal distributions of multivariate are still Gaussian. (taking A A A is a vector with [0 0 0 1])
  • Property 2: Conditional distributions are still Gaussian.

Suppose:
X = ( X 1 X 2 ) ∼ N ( μ , Σ ) X = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} \sim \mathcal{N}(\mu,\varSigma) X=(X1X2)N(μ,Σ)
where,
μ = ( μ 1 μ 2 ) \mu = \begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix} μ=(μ1μ2)

Σ = ( Σ 11 Σ 12 Σ 21 Σ 22 ) \varSigma =\begin{pmatrix} \varSigma_{11} &\varSigma_{12} \\ \varSigma_{21} &\varSigma_{22} \end{pmatrix} Σ=(Σ11Σ21Σ12Σ22)

Then when we only observe only the part of the result, here X 1 X_1 X1:
( X 2 ∣ X 1 = x 1 ) ∼ N ( μ 2 + Σ 21 Σ 11 − 1 ( x 1 − μ 1 ) , Σ 22 − Σ 21 Σ 11 − 1 Σ 12 ) (X_2|X_1 = x_1 )\sim \mathcal{N}(\mu_2 + \varSigma_{21}\varSigma_{11}^{-1}(x_1 - \mu_1),\varSigma_{22}-\varSigma_{21}\varSigma_{11}^{-1}\varSigma_{12}) (X2X1=x1)N(μ2+Σ21Σ111(x1μ1),Σ22Σ21Σ111Σ12)

π ( x 2 ∣ x 1 ) = π ( x 1 , x 2 ) π ( x 1 ) ∝ π ( x 1 , x 2 ) ∝ exp ⁡ ( − 1 2 ( x − μ ) ⊤ Σ − 1 ( x − μ ) ) ∝ exp ⁡ ( − 1 2 [ ( x 2 − μ 2 ) ⊤ Q 22 ( x 2 − μ 2 ) + 2 ( x 2 − μ 2 ) ⊤ Q 21 ( x 1 − μ 1 ) ] ) ∝ exp ⁡ ( − 1 2 [ ( x 2 − μ 2 ) ⊤ Q 22 ( x 2 − μ 2 ) + 2 ( x 2 − μ 2 ) ⊤ Q 21 ( x 1 − μ 1 ) ] )  So  X 2 ∣ X 1 = x 1  is Gaussian.  \begin{aligned} \pi\left(x_{2} | x_{1}\right) &=\frac{\pi\left(x_{1}, x_{2}\right)}{\pi\left(x_{1}\right)} \propto \pi\left(x_{1}, x_{2}\right) \\ & \propto \exp \left(-\frac{1}{2}(x-\mu)^{\top} \Sigma^{-1}(x-\mu)\right) \\ & \propto \exp \left(-\frac{1}{2}\left[\left(x_{2}-\mu_{2}\right)^{\top} Q_{22}\left(x_{2}-\mu_{2}\right)+2\left(x_{2}-\mu_{2}\right)^{\top} Q_{21}\left(x_{1}-\mu_{1}\right)\right]\right) \\ & \propto \exp \left(-\frac{1}{2}\left[\left(x_{2}-\mu_{2}\right)^{\top} Q_{22}\left(x_{2}-\mu_{2}\right)+2\left(x_{2}-\mu_{2}\right)^{\top} Q_{21}\left(x_{1}-\mu_{1}\right)\right]\right) \\ \text { So } X_{2} | X_{1}=x_{1} \text { is Gaussian. } \end{aligned} π(x2x1) So X2X1=x1 is Gaussian. =π(x1)π(x1,x2)π(x1,x2)exp(21(xμ)Σ1(xμ))exp(21[(x2μ2)Q22(x2μ2)+2(x2μ2)Q21(x1μ1)])exp(21[(x2μ2)Q22(x2μ2)+2(x2μ2)Q21(x1μ1)])

Proof

Conditional Update of GP

So suppose f f f is a Gaussian Process, then,
f ( x 1 ) , f ( x 2 ) . . . f ( x n ) , f ( x ) ∼ N ( μ , Σ ) f(x_1),f(x_2)...f(x_n),f(x) \sim \mathcal{N}(\mu,\varSigma) f(x1),f(x2)...f(xn),f(x)N(μ,Σ)
if we observe its value at x 1 , . . . , x n x_1,...,x_n x1,...,xn, then
f ( x ) ∣ f ( x 1 ) . . . f ( x n ) ∼ N ( μ ∗ , σ ∗ ) f(x)|f(x_1)...f(x_n) \sim \mathcal{N}(\mu^*,\sigma^*) f(x)f(x1)...f(xn)N(μ,σ)
Proof

Why we use GP

  1. The GP class of models is closed under various operations.
  2. non-parametric/kernel regression
  3. Naturalness of GP Framework
  4. Uncertainty estimates from emulators

Why GPs: Closed under various operations

  • Closed under addition
    f 1 ( ⋅ ) , f 2 ( ⋅ ) ∼ G P f1(·), f2(·) ∼ GP f1(),f2()GP then ( f 1 + f 2 ) ( ⋅ ) ∼ G P (f1 + f2)(·) ∼ GP (f1+f2)()GP
  • Closed under Bayesian conditioning, i.e., if we observe
    D = ( f ( x 1 ) , . . . , f ( x n ) ) D = (f (x1), . . . , f (xn)) D=(f(x1),...,f(xn))
    then
    ( f ∣ D ) ∼ G P (f|D)\sim GP (fD)GP
  • Closed under any linear operation. If f ∼ G P ( m ( ⋅ ) , k ( ⋅ , ⋅ ) ) f ∼ GP(m(·), k(·, ·)) fGP(m(),k(,)), then if L L L is a linear operator
    L ◦ f ∼ G P ( L ◦ m , L 2 ◦ k ) L ◦ f ∼ GP(L ◦ m,L^2◦k) LfGP(Lm,L2k)

Why GPs: Non-Parametric Regression

Suppose that we are given data
( x i , y i ) i = 1 n (x_i,y_i)_{i=1}^{n} (xi,yi)i=1n
Linear Regression
y = x T β + ϵ y = x^T\beta + \epsilon y=xTβ+ϵ
can be written in form of inner products
x T x x^Tx xTx

β ^ = a r g m i n ∣ ∣ y − X β ∣ ∣ 2 2 + σ 2 ∣ ∣ β ∣ ∣ 2 β ^ = ( X T X + Σ 2 I ) − 1 X T y β ^ = X T ( X X T + σ 2 I ) − 1 y           ( t h e    d u a l    f o r m ) \hat{\beta} = argmin || y - X\beta ||^2_{2} + \sigma^2 ||\beta||^2 \\ \hat{\beta} = (X^TX+\Sigma^2I)^{-1}X^Ty\\ \hat{\beta} =X^T(XX^T+\sigma^2I)^{-1}y \:\; \; \; \; (the\; dual \; form) β^=argminyXβ22+σ2β2β^=(XTX+Σ2I)1XTyβ^=XT(XXT+σ2I)1y(thedualform)

At first, the dual form looks like we’ve made the problem harder:
$$
XX^T ;is ;n*n \

X^TX; is ; p*p
$$
but the dual form makes clear that linear regression only uses inner products.

The best prediction of y y y at a new location x ′ x^{'} x is:
y ^ = x ′ T β ^ y ^ = x ′ T X T ( X X T + σ 2 I ) − 1 y y ^ = k ( x ′ ) ( K + σ 2 I ) − 1 y \hat{y} = x^{'T} \hat{\beta}\\ \hat{y} = x^{'T}X^T(XX^T+\sigma^2I)^{-1}y\\ \hat{y} = k(x^{'}) (K + \sigma^{2} I)^{-1}y y^=xTβ^y^=xTXT(XXT+σ2I)1yy^=k(x)(K+σ2I)1y
where,
k ( x ′ ) : = ( x ′ x 1 , . . . , x ′ x n ) k(x^{'}) :=(x^{'}x_1,...,x^{'}x_n) k(x):=(xx1,...,xxn)
and
K i j : = x i T x j K_{ij} := x_i^T x_j Kij:=xiTxj
these are kernel matrices, every element is the inner product between two rows of training points. And note the similarity to the GP conditional mean we derived before. If:
( y 1 y 2 ) ∼ N ( 0 , ( Σ 11 Σ 12 Σ 21 Σ 22 ) ) \begin{pmatrix} y_1 \\ y_2 \end{pmatrix} \sim \mathcal{N} \begin{pmatrix} 0, & \begin{pmatrix} \varSigma_{11} &\varSigma_{12} \\ \varSigma_{21} &\varSigma_{22} \end{pmatrix} \end{pmatrix} (y1y2)N(0,(Σ11Σ21Σ12Σ22))
then,
E ( y ′ ∣ y ) = Σ 21 Σ 11 − 1 y E(y^{'}|y) = \varSigma_{21}\varSigma_{11}^{-1}y E(yy)=Σ21Σ111y
where, Σ 11 = K + σ 2 I \Sigma_{11}=K+\sigma^{2} I Σ11=K+σ2I,and Σ 12 = Cov ⁡ ( y , y ′ ) \Sigma_{12}=\operatorname{Cov}\left(y, y^{\prime}\right) Σ12=Cov(y,y) then we can see that linear regression and GP regression are equivalent for the kernel/covariance function k ( x , x ′ ) = x ⊤ x ′ k\left(x, x^{\prime}\right)=x^{\top} x^{\prime} k(x,x)=xx

And we can replace the x x x by a feature vector in linear regression, e.g. ϕ ( x ) = ( 1    x    x 2 ) \phi(x)= (1 \;x \;x^{2}) ϕ(x)=(1xx2)

Then
K i j = ϕ ( X i ) T ϕ ( X j ) K_{ij} = \phi(X_{i})^{T} \phi(X_{j}) Kij=ϕ(Xi)Tϕ(Xj)
Generally, we don’t think about these features, we just choose a kernel. But any kernel is implicitly choosing a set of features, and our model only includes functions that are linear combinations of this set of features (this space is called the Reproducing Kernel Hilbert Space (RKHS) of k).

Although our simulator may not lie in the RKHS defined by k, this space is much richer than any parametric regression model (and can be dense in some sets of continuous bounded functions), and is thus more likely to contain an element close to the true functional form than any class of models that contains only a finite number of features. This is the motivation for non-parametric methods. When we do GP, we are always assuming that the feature vector is in a much rich feature space.

Example

ϕ ( x ) = ( e − ( x − c 1 ) 2 2 λ 2 , … , e − ( x − c N ) 2 2 λ 2 ) \phi(x)=\left(e^{-\frac{\left(x-c_{1}\right)^{2}}{2 \lambda^{2}}}, \ldots, e^{-\frac{\left(x-c_{N}\right)^{2}}{2 \lambda^{2}}}\right) ϕ(x)=(e2λ2(xc1)2,,e2λ2(xcN)2)

then as $N \rightarrow \infty $
ϕ ( x ) ⊤ ϕ ( x ) = exp ⁡ ( − ( x − x ′ ) 2 2 λ 2 ) \phi(x)^{\top} \phi(x)=\exp \left(-\frac{\left(x-x^{\prime}\right)^{2}}{2 \lambda^{2}}\right) ϕ(x)ϕ(x)=exp(2λ2(xx)2)

Why GPs: Naturalness of GP framework

It has been shown, using coherency arguments, or geometric arguments, or…, that the best second-order inference we can do to update our beliefs about X given Y is
E ( X ∣ Y ) = E ( X ) + Cov ⁡ ( X , Y ) Var ⁡ ( Y ) − 1 ( Y − E ( Y ) ) \mathbb{E}(X | Y)=\mathbb{E}(X)+\operatorname{Cov}(X, Y) \operatorname{Var}(Y)^{-1}(Y-\mathbb{E}(Y)) E(XY)=E(X)+Cov(X,Y)Var(Y)1(YE(Y))
i.e., exactly the Gaussian process update for the posterior mean.
So GPs are in some sense second-order optimal. The Second Order means we only consider about the mean and the variance.

Why GPs: Uncertainty estimates from emulators

We often think of our prediction as consisting of two parts

  • point estimate
  • uncertainty in that estimate

Difficult of Using GPs

Kernel

We do not know what the covariance function, e.g. the kernel like. So what we can do is that we pick a covariance function from a small set, based usually on differentiability considerations.
Possibly try a few (plus combinations of a few) covariance functions, and attempt to make a good choice using some sort of empirical evaluation.

Assume of the GPs

Assuming a GP model for your data imposes a complex structure on the data. The number of parameters in a GP is essentially infinite, and so they are not identified even asymptotically.
So the posterior can concentrate not on a point, but on some submanifold of parameter space, and the projection of the prior on this space continues to impact the posterior even as more and more data are collected.

Hyper-parameter optimization

As well as problems of identifiability, the likelihood surface that is being maximized is often flat and multi-modal, and thus the optimizer can sometimes fail to converge, or gets stuck in local-maxima.

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值