分别从权重空间角度和函数空间角度看高斯过程回归

高斯过程回归——权重空间角度

线性回归的高斯过程(贝叶斯线性回归)

假设有 n n n个观察值组成的训练数据 D = { x i , y i } i = 1 n \mathcal{D}=\{\mathbf{x}_i, y_i\}_{i=1}^{n} D={xi,yi}i=1n,其中 x i \mathbf{x}_i xi是第 i i i个维度为 d d d的输入, y i y_i yi是其对应的输出。我们假设输入和输出之间的潜在关系是带有高斯噪声的标准线性回归模型
f ( x ) = x T w , y = f ( x ) + e (1.1) f(\mathbf{x})=\mathbf{x}^{T} \mathbf{w}, \quad \mathbf{y}=f(\mathbf{x})+e \tag {1.1} f(x)=xTw,y=f(x)+e(1.1)
其中, w \mathbf{w} w是线性模型的权重向量, f f f是函数, y y y是观察到的输出。 e e e是服从独立、同分布的均值为零,方差为 σ n 2 \sigma_n^2 σn2的高斯附加噪声。根据以上假设,给定输入和权重向量后,输出的似然(likelihood)为
p ( y ∣ X , w ) = ∏ i = 1 n p ( y i ∣ x i , w ) = ∏ i = 1 n 1 2 π σ n 2 exp ⁡ ( − ( y i − x i T w ) 2 2 σ n 2 ) = 1 ( 2 π σ n 2 ) n / 2 exp ⁡ ( − ∑ i = 1 n ( y i − x i T w ) 2 2 σ n 2 ) = 1 ( 2 π σ n 2 ) n / 2 exp ⁡ ( − ∥ y − X w ∥ 2 2 σ n 2 ) = N ( X w , σ n 2 I ) (1.2) \begin{aligned} p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) &=\prod_{i=1}^{n} p(y_i \mid \mathbf{x}_i, \mathbf{w}) \\ &=\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma_{n}^{2}}} \exp \left(-\frac{\left(y_i-\mathbf{x}_i^{T} \mathbf{w}\right)^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\frac{1}{\left(2 \pi \sigma_{n}^{2}\right)^{n / 2}} \exp \left(-\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{x}_i^{T} \mathbf{w}\right)^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\frac{1}{\left(2 \pi \sigma_{n}^{2}\right)^{n / 2}} \exp \left(-\frac{\left\|\mathbf{y}-\mathbf{X} \mathbf{w}\right\|^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\mathcal{N}\left(\mathbf{X} \mathbf{w}, \sigma_{n}^{2} \mathbf{I}\right) \tag {1.2} \end{aligned} p(yX,w)=i=1np(yixi,w)=i=1n2πσn2 1exp(2σn2(yixiTw)2)=(2πσn2)n/21exp(2σn2i=1n(yixiTw)2)=(2πσn2)n/21exp(2σn2yXw2)=N(Xw,σn2I)(1.2)
其中 X = [ x 1 , … , x n ] T ∈ R n × d \mathbf{X}=\left[\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right]^{T} \in \mathbb{R}^{n \times d} X=[x1,,xn]TRn×d y = [ y 1 , … , y n ] T \mathbf{y}=\left[y_{1}, \ldots, y_{n}\right]^{T} y=[y1,,yn]T
根据贝叶斯公式
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)} P(AB)=P(B)P(BA)P(A)
其中

P ( A ∣ B ) P(A \mid B) P(AB)称为后验概率(posterior),这是我们需要结合先验概率和证据计算之后才能知道的。

P ( B ∣ A ) P(B \mid A) P(BA)称为似然(likelihood),在事件A发生的情况下,事件B(或evidence)的概率有多大

P ( A ) P(A) P(A)称为先验概率(prior), 事件A发生的概率有多大

P ( B ) P(B) P(B)称为证据(evidence),即无论事件如何,事件B(或evidence)的可能性有多大

因此
p ( w ∣ X , y ) = p ( y ∣ X , w ) × p ( w ) p ( y ∣ X ) (1.3) p(\mathbf{w} \mid \mathbf{X}, \mathbf{y})=\frac{p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w})}{p(\mathbf{y} \mid \mathbf{X})} \tag {1.3} p(wX,y)=p(yX)p(yX,w)×p(w)(1.3)
显然,为了推导权重向量的后验分布,我们需要权重向量的先验。我们假设一个协方差矩阵为 σ w 2 I \sigma_\mathbf{w}^2 \mathbf{I} σw2I的零均值高斯先验
p ( w ) = N ( 0 , σ w 2 I ) p(\mathbf{w})=\mathcal{N}\left(\mathbf{0}, \sigma_{\mathbf{w}}^{2} \mathbf{I}\right) p(w)=N(0,σw2I)
其中 I \mathbf{I} I d × d d\times d d×d的单位矩阵。等式 ( 1.3 ) (1.3) (1.3)的分母称为证据或边际似然,它独立于 w \mathbf{w} w,可由以下公式得出:
p ( y ∣ X ) = ∫ [ p ( y ∣ X , w ) × p ( w ) ] d w (1.4) p(\mathbf{y} \mid \mathbf{X})=\int[p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w})] d \mathbf{w} \tag{1.4} p(yX)=[p(yX,w)×p(w)]dw(1.4)
因为边际似然只是一个归一化常数,我们可以导出后验概率
p ( w ∣ X , y ) ∝ p ( y ∣ X , w ) × p ( w ) ∝ exp ⁡ ( − ∥ y − X w ∥ 2 2 σ n 2 ) exp ⁡ ( − 1 2 w T ( σ w 2 I ) − 1 w ) ∝ exp ⁡ [ − 1 2 ( w − w ^ ) T ( 1 σ n 2 X T X + σ w − 2 I ) ( w − w ^ ) ] (1.5) \begin{aligned} p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) & \propto p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w}) \\ & \propto \exp \left(-\frac{\left\|\mathbf{y}-\mathbf{X} \mathbf{w}\right\|^{2}}{2 \sigma_{n}^{2}}\right) \exp \left(-\frac{1}{2} \mathbf{w}^{T}\left(\sigma_{\mathbf{w}}^{2} \mathbf{I}\right)^{-1} \mathbf{w}\right) \\ & \propto \exp \left[-\frac{1}{2}(\mathbf{w}-\widehat{\mathbf{w}})^{T}\left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)(\mathbf{w}-\widehat{\mathbf{w}})\right] \tag{1.5} \end{aligned} p(wX,y)p(yX,w)×p(w)exp(2σn2yXw2)exp(21wT(σw2I)1w)exp[21(ww )T(σn21XTX+σw2I)(ww )](1.5)
其中
w ^ = ( X T X + σ n 2 σ w − 2 I ) − 1 X T y (1.6) \widehat{\mathbf{w}}=\left(\mathbf{X}^{T} \mathbf{X}+\sigma_{n}^{2} \sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1} \mathbf{X}^{T} \mathbf{y} \tag{1.6} w =(XTX+σn2σw2I)1XTy(1.6)
从等式 ( 1.5 ) (1.5) (1.5)可以看出,后验分布是高斯分布,其均值为 w ^ \widehat{\mathbf{w}} w ,协方差矩阵为 ( 1 σ n 2 X T X + σ w − 2 I ) − 1 \left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1} (σn21XTX+σw2I)1
为了预测新样本 x n e w \mathbf{x}_{new} xnew,我们有 f ( x n e w ) f(\mathbf{x}_{new}) f(xnew)的后验分布
p ( f ( x n e w ) ∣ x n e w , D ) = ∫ p ( f ( x n e w ) ∣ x n e w , w ) p ( w ∣ D ) d w p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathcal{D}\right)=\int p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathbf{w}\right) p(\mathbf{w} \mid \mathcal{D}) d \mathbf{w} p(f(xnew)xnew,D)=p(f(xnew)xnew,w)p(wD)dw

这相当于对所有可能的权重向量预测得到的 f ( x n e w ) f\left(\mathbf{x}_{new}\right) f(xnew)根据后验分布 p ( w ∣ D ) p(\mathbf{w} \mid \mathcal{D}) p(wD)进行加权平均。预测的后验分布 p ( f ( x n e w ) ∣ x n e w , D ) p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathcal{D}\right) p(f(xnew)xnew,D)再次为高斯分布,其均值为 x n e w T w ^ \mathbf{x}_{new}^T \widehat{\mathbf{w}} xnewTw ,方差为 x n e w T ( 1 σ n 2 X T X + σ w − 2 I ) − 1 x n e w \mathbf{x}_{new}^T\left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1}\mathbf{x}_{new} xnewT(σn21XTX+σw2I)1xnew。在这里我们也可以简单的理解为一个常量 x n e w \mathbf{x}_{new} xnew乘以一个服从高斯分布 p ( w ∣ X , y ) p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) p(wX,y)的随机变量 w \mathbf{w} w

非线性回归的高斯过程

将原始的 X \mathbf{X} X映射到高维空间中得到 ϕ ( X ) = [ ϕ ( x 1 ) , … , ϕ ( x n ) ] T ∈ R n × d ′ \phi(\mathbf{X})=\left[\phi\left(\mathbf{x}_{1}\right), \ldots, \phi\left(\mathbf{x}_{n}\right)\right]^{T} \in \mathbb{R}^{n \times d^{\prime}} ϕ(X)=[ϕ(x1),,ϕ(xn)]TRn×d d ′ > d d'>d d>d D ′ = { x i , y i } i = 1 n \mathcal{D'}=\{\mathbf{x}_i, y_i\}_{i=1}^{n} D={xi,yi}i=1n重复线性回归的高斯过程可得 f ( ϕ ( x n e w ) ) f(\phi(\mathbf{x}_{new})) f(ϕ(xnew))的后验分布
p ( f ( ϕ ( x n e w ) ) ∣ ϕ ( x n e w ) , D ′ ) = ∫ p ( f ( ϕ ( x n e w ) ) ∣ ϕ ( x n e w ) , ω ) p ( ω ∣ D ′ ) d ω (2.1) p\left(f\left(\phi(\mathbf{x}_{new})\right) \mid \phi(\mathbf{x}_{new}), \mathcal{D'}\right)=\int p\left(f\left(\phi(\mathbf{x}_{new})\right) \mid \phi(\mathbf{x}_{new}), \boldsymbol{\omega}\right) p(\boldsymbol{\omega} \mid \mathcal{D'}) d \boldsymbol{\omega} \tag{2.1} p(f(ϕ(xnew))ϕ(xnew),D)=p(f(ϕ(xnew))ϕ(xnew),ω)p(ωD)dω(2.1)
其均值
m ( f ( ϕ ( x n e w ) ) ) = ϕ ( x n e w ) T ω ^ (2.2) m(f(\phi(\mathbf{x}_{new})))=\phi(\mathbf{x}_{new})^T \widehat{\boldsymbol{\omega}} \tag{2.2} m(f(ϕ(xnew)))=ϕ(xnew)Tω (2.2)
方差
σ 2 ( f ( ϕ ( x n e w ) ) ) = ϕ ( x n e w ) T ( 1 σ n 2 ϕ ( X ) T ϕ ( X ) + σ w − 2 I ) − 1 ϕ ( x n e w ) (2.3) \sigma^2(f(\phi(\mathbf{x}_{new})))=\phi(\mathbf{x}_{new})^T\left(\frac{1}{\sigma_{n}^{2}} \phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{x}_{new}) \tag{2.3} σ2(f(ϕ(xnew)))=ϕ(xnew)T(σn21ϕ(X)Tϕ(X)+σw2I)1ϕ(xnew)(2.3)

利用矩阵求逆引理, ( 2.2 ) (2.2) (2.2) ( 2.3 ) (2.3) (2.3)可进一步整理成如 ( 2.4 ) (2.4) (2.4) ( 2.5 ) (2.5) (2.5)的形式
m ( f ( ϕ ( x n e w ) ) ) = ϕ ( x n e w ) T ω ^ = ϕ ( x n e w ) T ( ϕ ( X ) T ϕ ( X ) + σ n 2 σ ω − 2 I ) − 1 ϕ ( X ) T y = ϕ ( x n e w ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ n 2 σ ω − 2 I ) − 1 y (2.4) \begin{aligned} m\left(f\left(\phi(\mathbf{x}_{new})\right)\right) &=\phi\left(\mathbf{x}_{new}\right)^{T} \widehat{\boldsymbol{\omega}} \\ &=\phi\left(\mathbf{x}_{new}\right)^{T}\left(\phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{n}^{2} \sigma_{\boldsymbol{\omega}}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{X})^{T} \mathbf{y} \\ &=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}\left(\phi(\mathbf{X}) \phi(\mathbf{X})^{T}+\sigma_{n}^{2} \sigma_{\boldsymbol{\omega}}^{-2} \mathbf{I}\right)^{-1} \mathbf{y} \end{aligned} \tag{2.4} m(f(ϕ(xnew)))=ϕ(xnew)Tω =ϕ(xnew)T(ϕ(X)Tϕ(X)+σn2σω2I)1ϕ(X)Ty=ϕ(xnew)Tϕ(X)T(ϕ(X)ϕ(X)T+σn2σω2I)1y(2.4)

σ 2 ( f ( ϕ ( x n e w ) ) ) = ϕ ( x n e w ) T ( 1 σ n 2 ϕ ( X ) T ϕ ( X ) + σ ω − 2 I ) − 1 ϕ ( x n e w ) = σ ω 2 ϕ ( x n e w ) T ϕ ( x n e w ) − σ ω 2 ϕ ( x n e w ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ n 2 σ ω − 2 I ) − 1 ϕ ( X ) ϕ ( x n e w ) (2.5) \begin{aligned} \sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right) &=\phi\left(\mathbf{x}_{new}\right)^{T}\left(\frac{1}{\sigma_{n}^{2}} \phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \phi\left(\mathbf{x}_{new}\right) \\ &=\sigma_{\omega}^{2} \phi\left(\mathbf{x}_{new}\right)^{T} \phi\left(\mathbf{x}_{new}\right)-\sigma_{\omega}^{2} \phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}\left(\phi(\mathbf{X}) \phi(\mathbf{X})^{T}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{X}) \phi\left(\mathbf{x}_{new}\right) \end{aligned} \tag{2.5} σ2(f(ϕ(xnew)))=ϕ(xnew)T(σn21ϕ(X)Tϕ(X)+σω2I)1ϕ(xnew)=σω2ϕ(xnew)Tϕ(xnew)σω2ϕ(xnew)Tϕ(X)T(ϕ(X)ϕ(X)T+σn2σω2I)1ϕ(X)ϕ(xnew)(2.5)

( 2.4 ) (2.4) (2.4) ( 2.5 ) (2.5) (2.5)采用核技巧,令
h ~ ∗ T = ϕ ( x n e w ) T ϕ ( X ) T = [ κ ~ ( x n e w , x 1 ) , ⋯   , κ ~ ( x n e w , x n ) ] \widetilde{\mathbf{h}}_{*}^{T}=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}=\left[\widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{1}\right), \cdots, \widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{n}\right)\right] h T=ϕ(xnew)Tϕ(X)T=[κ (xnew,x1),,κ (xnew,xn)]

K ~ = ϕ ( X ) ϕ ( X ) T = [ κ ~ ( x 1 , x 1 ) ⋯ κ ~ ( x n , x 1 ) ⋮ ⋱ ⋮ κ ~ ( x 1 , x n ) ⋯ κ ~ ( x n , x n ) ] \widetilde{\mathbf{K}}=\phi(\mathbf{X}) \phi(\mathbf{X})^{T}=\left[\begin{array}{ccc} \widetilde{\kappa}(\mathbf{x}_{1}, \mathbf{x}_{1}) & \cdots & \widetilde{\kappa}(\mathbf{x}_{n}, \mathbf{x}_{1}) \\ \vdots & \ddots & \vdots \\ \widetilde{\kappa}(\mathbf{x}_{1}, \mathbf{x}_{n}) & \cdots & \widetilde{\kappa}(\mathbf{x}_{n}, \mathbf{x}_{n})\end{array}\right] K =ϕ(X)ϕ(X)T= κ (x1,x1)κ (x1,xn)κ (xn,x1)κ (xn,xn)

m ( f ( ϕ ( x n e w ) ) ) = h ~ ∗ T ( K ~ + σ n 2 σ ω − 2 I ) − 1 y = σ ω 2 h ~ ∗ T ( σ ω 2 K ~ + σ n 2 I ) − 1 y (2.6) m\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\widetilde{\mathbf{h}}_{*}^{T}\left(\widetilde{\mathbf{K}}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \mathbf{y} = \sigma _\omega ^2\widetilde {\mathbf{h}}_*^T{\left( {\sigma _\omega ^2\widetilde {\mathbf{K}} + \sigma _n^2{\mathbf{I}}} \right)^{ - 1}}{\mathbf{y}} \tag{2.6} m(f(ϕ(xnew)))=h T(K +σn2σω2I)1y=σω2h T(σω2K +σn2I)1y(2.6)

σ 2 ( f ( ϕ ( x n e w ) ) ) = σ ω 2 [ κ ~ ( x n e w , x n e w ) − h ~ ∗ T ( K ~ + σ n 2 σ ω − 2 I ) − 1 h ~ ∗ ] (2.7) \sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\sigma_{\omega}^{2}\left[\widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{new}\right)-\widetilde{\mathbf{h}}_{*}^{T}\left(\widetilde{\mathbf{K}}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \widetilde{\mathbf{h}}_{*}\right] \tag{2.7} σ2(f(ϕ(xnew)))=σω2[κ (xnew,xnew)h T(K +σn2σω2I)1h ](2.7)

将常数 σ ω 2 \sigma_{\omega}^{2} σω2吸收到核中并定义新核为 κ = σ ω 2 κ ~ \kappa=\sigma_{\omega}^{2}\widetilde{\kappa} κ=σω2κ 。对于新核, ( 2.6 ) (2.6) (2.6) ( 2.7 ) (2.7) (2.7)可重新表示成如下形式:
m ( f ( ϕ ( x n e w ) ) ) = h ∗ T ( K + σ n 2 I ) − 1 y (2.8) m\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\mathbf{h}_{*}^{T}\left(\mathbf{K}+\sigma_{n}^{2} \mathbf{I}\right)^{-1} \mathbf{y} \tag{2.8} m(f(ϕ(xnew)))=hT(K+σn2I)1y(2.8)

σ 2 ( f ( ϕ ( x n e w ) ) ) = σ ω 2 [ κ ( x n e w , x n e w ) − h ∗ T ( K + σ n 2 I ) − 1 h ∗ ] (2.9) \sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\sigma_{\omega}^{2}\left[\kappa\left(\mathbf{x}_{new}, \mathbf{x}_{new}\right)-\mathbf{h}_{*}^{T}\left(\mathbf{K}+\sigma_{n}^{2} \mathbf{I}\right)^{-1} \mathbf{h}_{*}\right] \tag{2.9} σ2(f(ϕ(xnew)))=σω2[κ(xnew,xnew)hT(K+σn2I)1h](2.9)
其中
h ∗ T = ϕ ( x n e w ) T ϕ ( X ) T = [ κ ( x n e w , x 1 ) , ⋯   , κ ( x n e w , x n ) ] \mathbf{h}_{*}^{T}=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}=\left[\kappa\left(\mathbf{x}_{new}, \mathbf{x}_{1}\right), \cdots, \kappa\left(\mathbf{x}_{new}, \mathbf{x}_{n}\right)\right] hT=ϕ(xnew)Tϕ(X)T=[κ(xnew,x1),,κ(xnew,xn)]

K = ϕ ( X ) ϕ ( X ) T = [ κ ( x 1 , x 1 ) ⋯ κ ( x n , x 1 ) ⋮ ⋱ ⋮ κ ( x 1 , x n ) ⋯ κ ( x n , x n ) ] \mathbf{K}=\phi(\mathbf{X}) \phi(\mathbf{X})^{T}=\left[\begin{array}{ccc} \kappa(\mathbf{x}_{1}, \mathbf{x}_{1}) & \cdots & \kappa(\mathbf{x}_{n}, \mathbf{x}_{1}) \\ \vdots & \ddots & \vdots \\ \kappa(\mathbf{x}_{1}, \mathbf{x}_{n}) & \cdots & \kappa(\mathbf{x}_{n}, \mathbf{x}_{n})\end{array}\right] K=ϕ(X)ϕ(X)T= κ(x1,x1)κ(x1,xn)κ(xn,x1)κ(xn,xn)

高斯过程回归——函数空间角度

假设有 n n n个观察值组成的训练数据 D = { x i , y i } i = 1 n \mathcal{D}=\{\mathbf{\mathbf{x}}_i, y_i\}_{i=1}^{n} D={xi,yi}i=1n,其中 x i \mathbf{\mathbf{x}}_i xi是第 i i i个维度为 d d d的输入, y i y_i yi是其对应的输出。我们假设输入和输出之间的潜在关系是带有高斯噪声的标准线性回归模型
f ( x ) = ϕ ( x ) T w , y = f ( x ) + e (3.1) f(\mathbf{\mathbf{x}})=\phi(\mathbf{\mathbf{x}})^{T} \mathbf{w}, \quad \mathbf{y}=f(\mathbf{\mathbf{x}})+e \tag {3.1} f(x)=ϕ(x)Tw,y=f(x)+e(3.1)
其中, w \mathbf{w} w是线性模型的权重向量, f f f是函数, y y y是观察到的输出。 e e e是服从独立、同分布的均值为零,方差为 σ n 2 \sigma_n^2 σn2的高斯附加噪声。
我们假设一个协方差矩阵为 σ w 2 I \sigma_\mathbf{w}^2 \mathbf{I} σw2I的零均值高斯先验
p ( w ) ∼ N ( 0 , σ w 2 I ) p(\mathbf{w}) \sim \mathcal{N}\left(\mathbf{0}, \sigma_{\mathbf{w}}^{2} \mathbf{I}\right) p(w)N(0,σw2I)
E w [ f ( X ) ] = E w [ ϕ ( X ) w ] = ϕ ( X ) E w [ w ] = 0 cov ⁡ ( f ( X ) , f ( X ) T ) = E [ ( f ( X ) − E [ f ( X ) ] ) ( f ( X ) T − E [ f ( X ) T ] ) ] = E [ f ( X ) f ( X ) T ] = E [ ϕ ( X ) w w T ϕ ( X ) T ] = ϕ ( X ) E [ w w T ] ϕ ( X ) T = ϕ ( X ) σ w 2 I ϕ ( X ) T = σ w 2 ϕ ( X ) ϕ ( X ) T \begin{aligned} E_{\mathbf{w}}[f(\mathbf{X})] &=E_{\mathbf{w}}\left[\phi(\mathbf{X}) \mathbf{w}\right]=\phi(\mathbf{X}) E_{\mathbf{w}}[\mathbf{w}]=0 \\ \operatorname{cov}\left(f(\mathbf{X}), f\left(\mathbf{X}\right)^{T}\right) &=E\left[(f(\mathbf{X})-E[f(\mathbf{X})])\left(f\left(\mathbf{X}\right)^{T}-E\left[f\left(\mathbf{X}\right)^{T}\right]\right)\right]\\ &=E\left[f(\mathbf{X}) f\left(\mathbf{X}\right)^{T}\right] \\ &=E\left[\phi(\mathbf{X}) \mathbf{w} \mathbf{w}^{T} \phi\left(\mathbf{X}\right)^{T}\right] \\ &=\phi(\mathbf{X}) E\left[\mathbf{w} \mathbf{w}^{T}\right] \phi\left(\mathbf{X}\right)^{T} \\ &=\phi(\mathbf{X}) \sigma_{\mathbf{w}}^{2} \mathbf{I} \phi\left(\mathbf{X}\right)^{T} \\ &=\sigma_{\mathbf{w}}^{2} \phi(\mathbf{X}) \phi\left(\mathbf{X}\right)^{T} \end{aligned} Ew[f(X)]cov(f(X),f(X)T)=Ew[ϕ(X)w]=ϕ(X)Ew[w]=0=E[(f(X)E[f(X)])(f(X)TE[f(X)T])]=E[f(X)f(X)T]=E[ϕ(X)wwTϕ(X)T]=ϕ(X)E[wwT]ϕ(X)T=ϕ(X)σw2Iϕ(X)T=σw2ϕ(X)ϕ(X)T
K = σ w 2 ϕ ( X ) ϕ ( X ) T K=\sigma_{\mathbf{w}}^{2} \phi(\mathbf{X}) \phi\left(\mathbf{X}\right)^{T} K=σw2ϕ(X)ϕ(X)T, 因此
f ( X ) ∼ N ( 0 , K ) f(\mathbf{X}) \sim \mathcal{N}(0,K) f(X)N(0,K)
如果我们把0看作特殊的函数,即此时 μ ( X ) = 0 \mu(\mathbf{X})=0 μ(X)=0,上式可写为
f ( X ) ∼ N ( μ ( X ) , K ) f(\mathbf{X}) \sim \mathcal{N}(\mu(\mathbf{X}),K) f(X)N(μ(X),K)
由上式可知,对于任意一个输入 x i \mathbf{x}_i xi,把 f ( x i ) f(\mathbf{x}_i) f(xi)看作一个服从高斯分布的随机变量,若干个 f ( x i ) f(\mathbf{x}_i) f(xi)的联合分布仍然是高斯分布。
根据两个服从高斯分布的变量相加仍为高斯分布,可以得到
y = f ( X ) + e ∼ N ( μ ( X ) , K + σ n 2 I ) \mathbf{y}=f(\mathbf{X})+e \sim \mathcal{N}(\mu(\mathbf{X}),K+\sigma_n^2\mathbf{I}) y=f(X)+eN(μ(X),K+σn2I)

因此,对于训练集 X \mathbf{X} X y ∼ N ( μ ( X ) , K + σ n 2 I ) \mathbf{y} \sim \mathcal{N}(\mu(\mathbf{X}),K+\sigma_n^2\mathbf{I}) yN(μ(X),K+σn2I),对于待预测的预测集 X n e w \mathbf{X}_{new} Xnew, f ( X n e w ) ∼ N ( μ ( X n e w ) , K ( X n e w , X n e w ) ) f(\mathbf{\mathbf{X}_{new}}) \sim \mathcal{N}(\mu(\mathbf{X}_{new}),K(\mathbf{X}_{new},\mathbf{X}_{new})) f(Xnew)N(μ(Xnew),K(Xnew,Xnew))
由高斯分布的性质可知, ( y , f ( X n e w ) ) (\mathbf{y},f(\mathbf{\mathbf{X}_{new}})) (y,f(Xnew))的联合分布仍然是高斯分布:
( y f ( X n e w ) ) = N ( ( μ ( X ) μ ( X n e w ) ) , ( K ( X , X ) + σ 2 I K ( X , X n e w ) K ( X n e w , X ) K ( X n e w , X n e w ) ) ) \left(\begin{array}{c} \mathbf{y} \\ f\left(\mathbf{X}_{new}\right) \end{array}\right)=N\left(\left(\begin{array}{c} \mu(\mathbf{X}) \\ \mu\left(\mathbf{X}_{new}\right) \end{array}\right),\left(\begin{array}{cc} K(\mathbf{X}, \mathbf{X})+\sigma^{2} I & K\left(\mathbf{X}, \mathbf{X}_{new}\right) \\ K\left(\mathbf{X}_{new}, \mathbf{X}\right) & K\left(\mathbf{X}_{new}, \mathbf{X}_{new}\right) \end{array}\right)\right) (yf(Xnew))=N((μ(X)μ(Xnew)),(K(X,X)+σ2IK(Xnew,X)K(X,Xnew)K(Xnew,Xnew)))

P ( f ( X n e w ) ∣ y ) = N ( μ ∗ , Σ ∗ ) μ ∗ = K ( X n e w , X ) ( K ( X , X ) + σ 2 I ) − 1 ( y − μ ( X ) ) + μ ( X n e w ) Σ ∗ = K ( X n e w , X n e w ) − K ( X n e w , X ) ( K ( X , X ) + σ 2 I ) − 1 K ( X , X n e w ) \begin{array}{c} P\left(f\left(\mathbf{X}_{new}\right) \mid \mathbf{y}\right)=N\left(\mu^{*}, \Sigma^{*}\right) \\ \mu^{*}=K\left(\mathbf{X}_{new}, \mathbf{X}\right)\left(K(\mathbf{X}, \mathbf{X})+\sigma^{2} I\right)^{-1}(\mathbf{y}-\mu(\mathbf{X}))+\mu\left(\mathbf{X}_{new}\right) \\ \Sigma^{*}=K\left(\mathbf{X}_{new}, \mathbf{X}_{new}\right)-K\left(\mathbf{X}_{new}, \mathbf{X}\right)\left(K(\mathbf{X}, \mathbf{X})+\sigma^{2} I\right)^{-1}K\left(\mathbf{X}, \mathbf{X}_{new}\right) \end{array} P(f(Xnew)y)=N(μ,Σ)μ=K(Xnew,X)(K(X,X)+σ2I)1(yμ(X))+μ(Xnew)Σ=K(Xnew,Xnew)K(Xnew,X)(K(X,X)+σ2I)1K(X,Xnew)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值