Gaussian Process(高斯过程) GPSS暑校笔记五(英文版)——无监督学习与高斯过程

p ( y ) = ∫ p ( y ∣ f ) p ( f ∣ x ) p ( x ) d f d x p ( x ∣ y ) = p ( y ∣ x ) p ( x ) p ( y ) \begin{array}{c}{p(y)=\int p(y | f) p(f | x) p(x) \mathrm{d} f \mathrm{d} x} \\ {p(x | y)=p(y | x) \frac{p(x)}{p(y)}}\end{array} p(y)=p(yf)p(fx)p(x)dfdxp(xy)=p(yx)p(y)p(x)

  1. Priors that makes sense:

    • p(f) describes our belief/assumptions and defines our notion of complexity in the function
    • p(x) expresses our belief/assumptions and defines our notion of complexity in the latent space
  2. The priors are balanced.

    GP prior:
    p ( f ∣ x ) ∼ N ( 0 , K ) ∝ e − 1 2 ( f T K − 1 f ) K i j = e − ( x i − x j ) T M T M ( x i − x j ) \begin{aligned} p(f | x) & \sim \mathcal{N}(0, K) \propto e^{-\frac{1}{2}\left(f^{\mathrm{T}} K^{-1} f\right)} \\ K_{i j} &=e^{-\left(x_{i}-x_{j}\right)^{\mathrm{T}} M^{\mathrm{T}} M\left(x_{i}-x_{j}\right)} \end{aligned} p(fx)KijN(0,K)e21(fTK1f)=e(xixj)TMTM(xixj)
    Likelihood:

p ( y ∣ f ) ∼ N ( y ∣ f , β ) ∝ e − 1 2 β tr ⁡ ( y − f ) T ( y − f ) p(y | f) \sim N(y | f, \beta) \propto e^{-\frac{1}{2 \beta} \operatorname{tr}(y-f)^{\mathrm{T}}(y-f)} p(yf)N(yf,β)e2β1tr(yf)T(yf)

Analytically intractable (Non Elementary Integral) and infinitely differentiable. One way to avoid the Integral is to use:
x ^ = argmax ⁡ x ∫ p ( y ∣ f ) p ( f ∣ x ) d f p ( x ) = argmin ⁡ x 1 2 y T K − 1 y + 1 2 ∣ K ∣ − log ⁡ p ( x ) \begin{array}{c}{\hat{x}=\operatorname{argmax}_{x} \int p(y | f) p(f | x) \mathrm{d} f p(x)} \\ {=\operatorname{argmin}_{x} \frac{1}{2} y^{\mathrm{T}} \mathbf{K}^{-1} y+\frac{1}{2}|\mathbf{K}|-\log p(x)}\end{array} x^=argmaxxp(yf)p(fx)dfp(x)=argminx21yTK1y+21Klogp(x)
Challenges with ML estimation:

  • how to initialize x?
  • What is the dimensionality q? which means how complex the latent space should be to represent y?

Variational Bayes:

log ⁡ p ( Y ) = log ⁡ ∫ p ( Y , X ) d X = log ⁡ ∫ p ( X ∣ Y ) p ( Y ) d X = log ⁡ ∫ q ( X ) q ( X ) p ( X ∣ Y ) p ( Y ) d X \begin{aligned} \log p(\mathbf{Y}) &=\log \int p(\mathbf{Y}, \mathbf{X}) \mathrm{d} \mathbf{X}=\log \int p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X} \\ &=\log \int \frac{q(\mathbf{X})}{q(\mathbf{X})} p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X} \end{aligned} logp(Y)=logp(Y,X)dX=logp(XY)p(Y)dX=logq(X)q(X)p(XY)p(Y)dX

For a convex function:
λ f ( x 0 ) + ( 1 − λ ) f ( x 1 ) ≥ f ( λ x 0 + ( 1 − λ ) x 1 ) x ∈ [ x min ⁡ , x max ⁡ ] λ ∈ [ 0 , 1 ] ] \begin{aligned} \lambda f\left(x_{0}\right)+(1-\lambda) f\left(x_{1}\right) & \geq f\left(\lambda x_{0}+(1-\lambda) x_{1}\right) \\ x & \in\left[x_{\min }, x_{\max }\right] \\ \lambda & \in[0,1] ] \end{aligned} λf(x0)+(1λ)f(x1)xλf(λx0+(1λ)x1)[xmin,xmax][0,1]]
In probability, that means:
E [ f ( x ) ] ≥ f ( E [ x ] ) ∫ f ( x ) p ( x ) d x ≥ f ( ∫ x p ( x ) d x ) ∫ log ⁡ ( x ) p ( x ) d x ≤ log ⁡ ( ∫ x p ( x ) d x ) \begin{aligned} \mathbb{E}[f(x)] & \geq f(\mathbb{E}[x]) \\ \int f(x) p(x) \mathrm{d} x & \geq f\left(\int x p(x) \mathrm{d} x\right) \end{aligned}\\ \int \log (x) p(x) \mathrm{d} x \leq \log \left(\int x p(x) \mathrm{d} x\right) E[f(x)]f(x)p(x)dxf(E[x])f(xp(x)dx)log(x)p(x)dxlog(xp(x)dx)
thus,
log ⁡ p ( Y ) = log ⁡ ∫ q ( X ) q ( X ) p ( X ∣ Y ) p ( Y ) d X = ≥ ∫ q ( X ) log ⁡ p ( X ∣ Y ) p ( Y ) q ( X ) d X = ∫ q ( X ) log ⁡ p ( X ∣ Y ) q ( X ) d X + ∫ q ( X ) d X log ⁡ p ( Y ) = − K L ( q ( X ) ∥ p ( X ∣ Y ) ) + log ⁡ p ( Y ) \begin{aligned} \log p(\mathbf{Y}) &=\log \int \frac{q(\mathbf{X})}{q(\mathbf{X})} p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X}=\\ & \geq \int q(\mathbf{X}) \log \frac{p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y})}{q(\mathbf{X})} \mathrm{d} \mathbf{X} \\ &=\int q(\mathbf{X}) \log \frac{p(\mathbf{X} | \mathbf{Y})}{q(\mathbf{X})} \mathrm{d} \mathbf{X}+\int q(\mathbf{X}) \mathrm{d} \mathbf{X} \log p(\mathbf{Y}) \\ &=-\mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y}))+\log p(\mathbf{Y}) \end{aligned} logp(Y)=logq(X)q(X)p(XY)p(Y)dX=q(X)logq(X)p(XY)p(Y)dX=q(X)logq(X)p(XY)dX+q(X)dXlogp(Y)=KL(q(X)p(XY))+logp(Y)
K L KL KL is KL-Divergence that is a measure of how one probability distribution is different from a second, reference probability distribution.

If q ( x ) q(x) q(x) is the true posterior we have an equality, therefore match the distributions.
K L ( q ( X ) ∥ p ( X ∣ Y ) ) = ∫ q ( X ) log ⁡ q ( X ) p ( X ∣ Y ) d X = ∫ q ( X ) log ⁡ q ( X ) p ( X , Y ) d X + log ⁡ p ( Y ) = H ( q ( X ) ) − E q ( X ) [ log ⁡ p ( X , Y ) ] + log ⁡ p ( Y ) \begin{aligned} \mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y})) &=\int q(\mathbf{X}) \log \frac{q(\mathbf{X})}{p(\mathbf{X} | \mathbf{Y})} \mathrm{d} \mathbf{X} \\ &=\int q(\mathbf{X}) \log \frac{q(\mathbf{X})}{p(\mathbf{X}, \mathbf{Y})} \mathrm{d} \mathbf{X}+\log p(\mathbf{Y}) \\ &=H(q(\mathbf{X}))-\mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]+\log p(\mathbf{Y}) \end{aligned} KL(q(X)p(XY))=q(X)logp(XY)q(X)dX=q(X)logp(X,Y)q(X)dX+logp(Y)=H(q(X))Eq(X)[logp(X,Y)]+logp(Y)
And we rearrange it:
log ⁡ p ( Y ) = K L ( q ( X ) ∥ p ( X ∣ Y ) ) + E q ( X ) [ log ⁡ p ( X , Y ) ] − H ( q ( X ) ) ⎵  ELBO  ≥ E q ( X ) [ log ⁡ p ( X , Y ) ] − H ( q ( X ) ) = L ( q ( X ) ) \begin{aligned} \log p(\mathbf{Y}) &=\mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y}))+\underbrace{\mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]-H(q(\mathbf{X}))}_{\text { ELBO }} \\ & \geq \mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]-H(q(\mathbf{X}))=\mathcal{L}(q(\mathbf{X})) \end{aligned} logp(Y)=KL(q(X)p(XY))+ ELBO  Eq(X)[logp(X,Y)]H(q(X))Eq(X)[logp(X,Y)]H(q(X))=L(q(X))
if we maximize the ELBO, it means:

  • find an approximate posterior
  • get an approximate to the marginal likelihood

Maximizing p ( Y ) p(Y) p(Y) is learning

finding p ( X ∣ Y ) ∼ p ( X ) p(X|Y) \sim p(X) p(XY)p(X) is prediction

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值