高斯过程回归——权重空间角度
线性回归的高斯过程(贝叶斯线性回归)
假设有
n
n
n个观察值组成的训练数据
D
=
{
x
i
,
y
i
}
i
=
1
n
\mathcal{D}=\{\mathbf{x}_i, y_i\}_{i=1}^{n}
D={xi,yi}i=1n,其中
x
i
\mathbf{x}_i
xi是第
i
i
i个维度为
d
d
d的输入,
y
i
y_i
yi是其对应的输出。我们假设输入和输出之间的潜在关系是带有高斯噪声的标准线性回归模型
f
(
x
)
=
x
T
w
,
y
=
f
(
x
)
+
e
(1.1)
f(\mathbf{x})=\mathbf{x}^{T} \mathbf{w}, \quad \mathbf{y}=f(\mathbf{x})+e \tag {1.1}
f(x)=xTw,y=f(x)+e(1.1)
其中,
w
\mathbf{w}
w是线性模型的权重向量,
f
f
f是函数,
y
y
y是观察到的输出。
e
e
e是服从独立、同分布的均值为零,方差为
σ
n
2
\sigma_n^2
σn2的高斯附加噪声。根据以上假设,给定输入和权重向量后,输出的似然(likelihood)为
p
(
y
∣
X
,
w
)
=
∏
i
=
1
n
p
(
y
i
∣
x
i
,
w
)
=
∏
i
=
1
n
1
2
π
σ
n
2
exp
(
−
(
y
i
−
x
i
T
w
)
2
2
σ
n
2
)
=
1
(
2
π
σ
n
2
)
n
/
2
exp
(
−
∑
i
=
1
n
(
y
i
−
x
i
T
w
)
2
2
σ
n
2
)
=
1
(
2
π
σ
n
2
)
n
/
2
exp
(
−
∥
y
−
X
w
∥
2
2
σ
n
2
)
=
N
(
X
w
,
σ
n
2
I
)
(1.2)
\begin{aligned} p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) &=\prod_{i=1}^{n} p(y_i \mid \mathbf{x}_i, \mathbf{w}) \\ &=\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma_{n}^{2}}} \exp \left(-\frac{\left(y_i-\mathbf{x}_i^{T} \mathbf{w}\right)^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\frac{1}{\left(2 \pi \sigma_{n}^{2}\right)^{n / 2}} \exp \left(-\frac{\sum_{i=1}^{n}\left(y_i-\mathbf{x}_i^{T} \mathbf{w}\right)^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\frac{1}{\left(2 \pi \sigma_{n}^{2}\right)^{n / 2}} \exp \left(-\frac{\left\|\mathbf{y}-\mathbf{X} \mathbf{w}\right\|^{2}}{2 \sigma_{n}^{2}}\right) \\ &=\mathcal{N}\left(\mathbf{X} \mathbf{w}, \sigma_{n}^{2} \mathbf{I}\right) \tag {1.2} \end{aligned}
p(y∣X,w)=i=1∏np(yi∣xi,w)=i=1∏n2πσn21exp(−2σn2(yi−xiTw)2)=(2πσn2)n/21exp(−2σn2∑i=1n(yi−xiTw)2)=(2πσn2)n/21exp(−2σn2∥y−Xw∥2)=N(Xw,σn2I)(1.2)
其中
X
=
[
x
1
,
…
,
x
n
]
T
∈
R
n
×
d
\mathbf{X}=\left[\mathbf{x}_{1}, \ldots, \mathbf{x}_{n}\right]^{T} \in \mathbb{R}^{n \times d}
X=[x1,…,xn]T∈Rn×d,
y
=
[
y
1
,
…
,
y
n
]
T
\mathbf{y}=\left[y_{1}, \ldots, y_{n}\right]^{T}
y=[y1,…,yn]T
根据贝叶斯公式
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
)
P(A \mid B)=\frac{P(B \mid A) P(A)}{P(B)}
P(A∣B)=P(B)P(B∣A)P(A)
其中
P ( A ∣ B ) P(A \mid B) P(A∣B)称为后验概率(posterior),这是我们需要结合先验概率和证据计算之后才能知道的。
P ( B ∣ A ) P(B \mid A) P(B∣A)称为似然(likelihood),在事件A发生的情况下,事件B(或evidence)的概率有多大
P ( A ) P(A) P(A)称为先验概率(prior), 事件A发生的概率有多大
P ( B ) P(B) P(B)称为证据(evidence),即无论事件如何,事件B(或evidence)的可能性有多大
因此
p
(
w
∣
X
,
y
)
=
p
(
y
∣
X
,
w
)
×
p
(
w
)
p
(
y
∣
X
)
(1.3)
p(\mathbf{w} \mid \mathbf{X}, \mathbf{y})=\frac{p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w})}{p(\mathbf{y} \mid \mathbf{X})} \tag {1.3}
p(w∣X,y)=p(y∣X)p(y∣X,w)×p(w)(1.3)
显然,为了推导权重向量的后验分布,我们需要权重向量的先验。我们假设一个协方差矩阵为
σ
w
2
I
\sigma_\mathbf{w}^2 \mathbf{I}
σw2I的零均值高斯先验
p
(
w
)
=
N
(
0
,
σ
w
2
I
)
p(\mathbf{w})=\mathcal{N}\left(\mathbf{0}, \sigma_{\mathbf{w}}^{2} \mathbf{I}\right)
p(w)=N(0,σw2I)
其中
I
\mathbf{I}
I是
d
×
d
d\times d
d×d的单位矩阵。等式
(
1.3
)
(1.3)
(1.3)的分母称为证据或边际似然,它独立于
w
\mathbf{w}
w,可由以下公式得出:
p
(
y
∣
X
)
=
∫
[
p
(
y
∣
X
,
w
)
×
p
(
w
)
]
d
w
(1.4)
p(\mathbf{y} \mid \mathbf{X})=\int[p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w})] d \mathbf{w} \tag{1.4}
p(y∣X)=∫[p(y∣X,w)×p(w)]dw(1.4)
因为边际似然只是一个归一化常数,我们可以导出后验概率
p
(
w
∣
X
,
y
)
∝
p
(
y
∣
X
,
w
)
×
p
(
w
)
∝
exp
(
−
∥
y
−
X
w
∥
2
2
σ
n
2
)
exp
(
−
1
2
w
T
(
σ
w
2
I
)
−
1
w
)
∝
exp
[
−
1
2
(
w
−
w
^
)
T
(
1
σ
n
2
X
T
X
+
σ
w
−
2
I
)
(
w
−
w
^
)
]
(1.5)
\begin{aligned} p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) & \propto p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) \times p(\mathbf{w}) \\ & \propto \exp \left(-\frac{\left\|\mathbf{y}-\mathbf{X} \mathbf{w}\right\|^{2}}{2 \sigma_{n}^{2}}\right) \exp \left(-\frac{1}{2} \mathbf{w}^{T}\left(\sigma_{\mathbf{w}}^{2} \mathbf{I}\right)^{-1} \mathbf{w}\right) \\ & \propto \exp \left[-\frac{1}{2}(\mathbf{w}-\widehat{\mathbf{w}})^{T}\left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)(\mathbf{w}-\widehat{\mathbf{w}})\right] \tag{1.5} \end{aligned}
p(w∣X,y)∝p(y∣X,w)×p(w)∝exp(−2σn2∥y−Xw∥2)exp(−21wT(σw2I)−1w)∝exp[−21(w−w
)T(σn21XTX+σw−2I)(w−w
)](1.5)
其中
w
^
=
(
X
T
X
+
σ
n
2
σ
w
−
2
I
)
−
1
X
T
y
(1.6)
\widehat{\mathbf{w}}=\left(\mathbf{X}^{T} \mathbf{X}+\sigma_{n}^{2} \sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1} \mathbf{X}^{T} \mathbf{y} \tag{1.6}
w
=(XTX+σn2σw−2I)−1XTy(1.6)
从等式
(
1.5
)
(1.5)
(1.5)可以看出,后验分布是高斯分布,其均值为
w
^
\widehat{\mathbf{w}}
w
,协方差矩阵为
(
1
σ
n
2
X
T
X
+
σ
w
−
2
I
)
−
1
\left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1}
(σn21XTX+σw−2I)−1
为了预测新样本
x
n
e
w
\mathbf{x}_{new}
xnew,我们有
f
(
x
n
e
w
)
f(\mathbf{x}_{new})
f(xnew)的后验分布
p
(
f
(
x
n
e
w
)
∣
x
n
e
w
,
D
)
=
∫
p
(
f
(
x
n
e
w
)
∣
x
n
e
w
,
w
)
p
(
w
∣
D
)
d
w
p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathcal{D}\right)=\int p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathbf{w}\right) p(\mathbf{w} \mid \mathcal{D}) d \mathbf{w}
p(f(xnew)∣xnew,D)=∫p(f(xnew)∣xnew,w)p(w∣D)dw
这相当于对所有可能的权重向量预测得到的 f ( x n e w ) f\left(\mathbf{x}_{new}\right) f(xnew)根据后验分布 p ( w ∣ D ) p(\mathbf{w} \mid \mathcal{D}) p(w∣D)进行加权平均。预测的后验分布 p ( f ( x n e w ) ∣ x n e w , D ) p\left(f\left(\mathbf{x}_{new}\right) \mid \mathbf{x}_{new}, \mathcal{D}\right) p(f(xnew)∣xnew,D)再次为高斯分布,其均值为 x n e w T w ^ \mathbf{x}_{new}^T \widehat{\mathbf{w}} xnewTw ,方差为 x n e w T ( 1 σ n 2 X T X + σ w − 2 I ) − 1 x n e w \mathbf{x}_{new}^T\left(\frac{1}{\sigma_{n}^{2}} \mathbf{X}^{T} \mathbf{X}+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1}\mathbf{x}_{new} xnewT(σn21XTX+σw−2I)−1xnew。在这里我们也可以简单的理解为一个常量 x n e w \mathbf{x}_{new} xnew乘以一个服从高斯分布 p ( w ∣ X , y ) p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) p(w∣X,y)的随机变量 w \mathbf{w} w。
非线性回归的高斯过程
将原始的
X
\mathbf{X}
X映射到高维空间中得到
ϕ
(
X
)
=
[
ϕ
(
x
1
)
,
…
,
ϕ
(
x
n
)
]
T
∈
R
n
×
d
′
\phi(\mathbf{X})=\left[\phi\left(\mathbf{x}_{1}\right), \ldots, \phi\left(\mathbf{x}_{n}\right)\right]^{T} \in \mathbb{R}^{n \times d^{\prime}}
ϕ(X)=[ϕ(x1),…,ϕ(xn)]T∈Rn×d′,
d
′
>
d
d'>d
d′>d,
D
′
=
{
x
i
,
y
i
}
i
=
1
n
\mathcal{D'}=\{\mathbf{x}_i, y_i\}_{i=1}^{n}
D′={xi,yi}i=1n重复线性回归的高斯过程可得
f
(
ϕ
(
x
n
e
w
)
)
f(\phi(\mathbf{x}_{new}))
f(ϕ(xnew))的后验分布
p
(
f
(
ϕ
(
x
n
e
w
)
)
∣
ϕ
(
x
n
e
w
)
,
D
′
)
=
∫
p
(
f
(
ϕ
(
x
n
e
w
)
)
∣
ϕ
(
x
n
e
w
)
,
ω
)
p
(
ω
∣
D
′
)
d
ω
(2.1)
p\left(f\left(\phi(\mathbf{x}_{new})\right) \mid \phi(\mathbf{x}_{new}), \mathcal{D'}\right)=\int p\left(f\left(\phi(\mathbf{x}_{new})\right) \mid \phi(\mathbf{x}_{new}), \boldsymbol{\omega}\right) p(\boldsymbol{\omega} \mid \mathcal{D'}) d \boldsymbol{\omega} \tag{2.1}
p(f(ϕ(xnew))∣ϕ(xnew),D′)=∫p(f(ϕ(xnew))∣ϕ(xnew),ω)p(ω∣D′)dω(2.1)
其均值
m
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
ϕ
(
x
n
e
w
)
T
ω
^
(2.2)
m(f(\phi(\mathbf{x}_{new})))=\phi(\mathbf{x}_{new})^T \widehat{\boldsymbol{\omega}} \tag{2.2}
m(f(ϕ(xnew)))=ϕ(xnew)Tω
(2.2)
方差
σ
2
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
ϕ
(
x
n
e
w
)
T
(
1
σ
n
2
ϕ
(
X
)
T
ϕ
(
X
)
+
σ
w
−
2
I
)
−
1
ϕ
(
x
n
e
w
)
(2.3)
\sigma^2(f(\phi(\mathbf{x}_{new})))=\phi(\mathbf{x}_{new})^T\left(\frac{1}{\sigma_{n}^{2}} \phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{\mathbf{w}}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{x}_{new}) \tag{2.3}
σ2(f(ϕ(xnew)))=ϕ(xnew)T(σn21ϕ(X)Tϕ(X)+σw−2I)−1ϕ(xnew)(2.3)
利用矩阵求逆引理,
(
2.2
)
(2.2)
(2.2)和
(
2.3
)
(2.3)
(2.3)可进一步整理成如
(
2.4
)
(2.4)
(2.4)和
(
2.5
)
(2.5)
(2.5)的形式
m
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
ϕ
(
x
n
e
w
)
T
ω
^
=
ϕ
(
x
n
e
w
)
T
(
ϕ
(
X
)
T
ϕ
(
X
)
+
σ
n
2
σ
ω
−
2
I
)
−
1
ϕ
(
X
)
T
y
=
ϕ
(
x
n
e
w
)
T
ϕ
(
X
)
T
(
ϕ
(
X
)
ϕ
(
X
)
T
+
σ
n
2
σ
ω
−
2
I
)
−
1
y
(2.4)
\begin{aligned} m\left(f\left(\phi(\mathbf{x}_{new})\right)\right) &=\phi\left(\mathbf{x}_{new}\right)^{T} \widehat{\boldsymbol{\omega}} \\ &=\phi\left(\mathbf{x}_{new}\right)^{T}\left(\phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{n}^{2} \sigma_{\boldsymbol{\omega}}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{X})^{T} \mathbf{y} \\ &=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}\left(\phi(\mathbf{X}) \phi(\mathbf{X})^{T}+\sigma_{n}^{2} \sigma_{\boldsymbol{\omega}}^{-2} \mathbf{I}\right)^{-1} \mathbf{y} \end{aligned} \tag{2.4}
m(f(ϕ(xnew)))=ϕ(xnew)Tω
=ϕ(xnew)T(ϕ(X)Tϕ(X)+σn2σω−2I)−1ϕ(X)Ty=ϕ(xnew)Tϕ(X)T(ϕ(X)ϕ(X)T+σn2σω−2I)−1y(2.4)
σ 2 ( f ( ϕ ( x n e w ) ) ) = ϕ ( x n e w ) T ( 1 σ n 2 ϕ ( X ) T ϕ ( X ) + σ ω − 2 I ) − 1 ϕ ( x n e w ) = σ ω 2 ϕ ( x n e w ) T ϕ ( x n e w ) − σ ω 2 ϕ ( x n e w ) T ϕ ( X ) T ( ϕ ( X ) ϕ ( X ) T + σ n 2 σ ω − 2 I ) − 1 ϕ ( X ) ϕ ( x n e w ) (2.5) \begin{aligned} \sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right) &=\phi\left(\mathbf{x}_{new}\right)^{T}\left(\frac{1}{\sigma_{n}^{2}} \phi(\mathbf{X})^{T} \phi(\mathbf{X})+\sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \phi\left(\mathbf{x}_{new}\right) \\ &=\sigma_{\omega}^{2} \phi\left(\mathbf{x}_{new}\right)^{T} \phi\left(\mathbf{x}_{new}\right)-\sigma_{\omega}^{2} \phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}\left(\phi(\mathbf{X}) \phi(\mathbf{X})^{T}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \phi(\mathbf{X}) \phi\left(\mathbf{x}_{new}\right) \end{aligned} \tag{2.5} σ2(f(ϕ(xnew)))=ϕ(xnew)T(σn21ϕ(X)Tϕ(X)+σω−2I)−1ϕ(xnew)=σω2ϕ(xnew)Tϕ(xnew)−σω2ϕ(xnew)Tϕ(X)T(ϕ(X)ϕ(X)T+σn2σω−2I)−1ϕ(X)ϕ(xnew)(2.5)
对
(
2.4
)
(2.4)
(2.4)和
(
2.5
)
(2.5)
(2.5)采用核技巧,令
h
~
∗
T
=
ϕ
(
x
n
e
w
)
T
ϕ
(
X
)
T
=
[
κ
~
(
x
n
e
w
,
x
1
)
,
⋯
,
κ
~
(
x
n
e
w
,
x
n
)
]
\widetilde{\mathbf{h}}_{*}^{T}=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}=\left[\widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{1}\right), \cdots, \widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{n}\right)\right]
h
∗T=ϕ(xnew)Tϕ(X)T=[κ
(xnew,x1),⋯,κ
(xnew,xn)]
K
~
=
ϕ
(
X
)
ϕ
(
X
)
T
=
[
κ
~
(
x
1
,
x
1
)
⋯
κ
~
(
x
n
,
x
1
)
⋮
⋱
⋮
κ
~
(
x
1
,
x
n
)
⋯
κ
~
(
x
n
,
x
n
)
]
\widetilde{\mathbf{K}}=\phi(\mathbf{X}) \phi(\mathbf{X})^{T}=\left[\begin{array}{ccc} \widetilde{\kappa}(\mathbf{x}_{1}, \mathbf{x}_{1}) & \cdots & \widetilde{\kappa}(\mathbf{x}_{n}, \mathbf{x}_{1}) \\ \vdots & \ddots & \vdots \\ \widetilde{\kappa}(\mathbf{x}_{1}, \mathbf{x}_{n}) & \cdots & \widetilde{\kappa}(\mathbf{x}_{n}, \mathbf{x}_{n})\end{array}\right]
K
=ϕ(X)ϕ(X)T=⎣
⎡κ
(x1,x1)⋮κ
(x1,xn)⋯⋱⋯κ
(xn,x1)⋮κ
(xn,xn)⎦
⎤
则
m
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
h
~
∗
T
(
K
~
+
σ
n
2
σ
ω
−
2
I
)
−
1
y
=
σ
ω
2
h
~
∗
T
(
σ
ω
2
K
~
+
σ
n
2
I
)
−
1
y
(2.6)
m\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\widetilde{\mathbf{h}}_{*}^{T}\left(\widetilde{\mathbf{K}}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \mathbf{y} = \sigma _\omega ^2\widetilde {\mathbf{h}}_*^T{\left( {\sigma _\omega ^2\widetilde {\mathbf{K}} + \sigma _n^2{\mathbf{I}}} \right)^{ - 1}}{\mathbf{y}} \tag{2.6}
m(f(ϕ(xnew)))=h
∗T(K
+σn2σω−2I)−1y=σω2h
∗T(σω2K
+σn2I)−1y(2.6)
σ 2 ( f ( ϕ ( x n e w ) ) ) = σ ω 2 [ κ ~ ( x n e w , x n e w ) − h ~ ∗ T ( K ~ + σ n 2 σ ω − 2 I ) − 1 h ~ ∗ ] (2.7) \sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\sigma_{\omega}^{2}\left[\widetilde{\kappa}\left(\mathbf{x}_{new}, \mathbf{x}_{new}\right)-\widetilde{\mathbf{h}}_{*}^{T}\left(\widetilde{\mathbf{K}}+\sigma_{n}^{2} \sigma_{\omega}^{-2} \mathbf{I}\right)^{-1} \widetilde{\mathbf{h}}_{*}\right] \tag{2.7} σ2(f(ϕ(xnew)))=σω2[κ (xnew,xnew)−h ∗T(K +σn2σω−2I)−1h ∗](2.7)
将常数
σ
ω
2
\sigma_{\omega}^{2}
σω2吸收到核中并定义新核为
κ
=
σ
ω
2
κ
~
\kappa=\sigma_{\omega}^{2}\widetilde{\kappa}
κ=σω2κ
。对于新核,
(
2.6
)
(2.6)
(2.6)和
(
2.7
)
(2.7)
(2.7)可重新表示成如下形式:
m
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
h
∗
T
(
K
+
σ
n
2
I
)
−
1
y
(2.8)
m\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\mathbf{h}_{*}^{T}\left(\mathbf{K}+\sigma_{n}^{2} \mathbf{I}\right)^{-1} \mathbf{y} \tag{2.8}
m(f(ϕ(xnew)))=h∗T(K+σn2I)−1y(2.8)
σ
2
(
f
(
ϕ
(
x
n
e
w
)
)
)
=
σ
ω
2
[
κ
(
x
n
e
w
,
x
n
e
w
)
−
h
∗
T
(
K
+
σ
n
2
I
)
−
1
h
∗
]
(2.9)
\sigma^{2}\left(f\left(\phi(\mathbf{x}_{new})\right)\right)=\sigma_{\omega}^{2}\left[\kappa\left(\mathbf{x}_{new}, \mathbf{x}_{new}\right)-\mathbf{h}_{*}^{T}\left(\mathbf{K}+\sigma_{n}^{2} \mathbf{I}\right)^{-1} \mathbf{h}_{*}\right] \tag{2.9}
σ2(f(ϕ(xnew)))=σω2[κ(xnew,xnew)−h∗T(K+σn2I)−1h∗](2.9)
其中
h
∗
T
=
ϕ
(
x
n
e
w
)
T
ϕ
(
X
)
T
=
[
κ
(
x
n
e
w
,
x
1
)
,
⋯
,
κ
(
x
n
e
w
,
x
n
)
]
\mathbf{h}_{*}^{T}=\phi\left(\mathbf{x}_{new}\right)^{T} \phi(\mathbf{X})^{T}=\left[\kappa\left(\mathbf{x}_{new}, \mathbf{x}_{1}\right), \cdots, \kappa\left(\mathbf{x}_{new}, \mathbf{x}_{n}\right)\right]
h∗T=ϕ(xnew)Tϕ(X)T=[κ(xnew,x1),⋯,κ(xnew,xn)]
K = ϕ ( X ) ϕ ( X ) T = [ κ ( x 1 , x 1 ) ⋯ κ ( x n , x 1 ) ⋮ ⋱ ⋮ κ ( x 1 , x n ) ⋯ κ ( x n , x n ) ] \mathbf{K}=\phi(\mathbf{X}) \phi(\mathbf{X})^{T}=\left[\begin{array}{ccc} \kappa(\mathbf{x}_{1}, \mathbf{x}_{1}) & \cdots & \kappa(\mathbf{x}_{n}, \mathbf{x}_{1}) \\ \vdots & \ddots & \vdots \\ \kappa(\mathbf{x}_{1}, \mathbf{x}_{n}) & \cdots & \kappa(\mathbf{x}_{n}, \mathbf{x}_{n})\end{array}\right] K=ϕ(X)ϕ(X)T=⎣ ⎡κ(x1,x1)⋮κ(x1,xn)⋯⋱⋯κ(xn,x1)⋮κ(xn,xn)⎦ ⎤
高斯过程回归——函数空间角度
假设有
n
n
n个观察值组成的训练数据
D
=
{
x
i
,
y
i
}
i
=
1
n
\mathcal{D}=\{\mathbf{\mathbf{x}}_i, y_i\}_{i=1}^{n}
D={xi,yi}i=1n,其中
x
i
\mathbf{\mathbf{x}}_i
xi是第
i
i
i个维度为
d
d
d的输入,
y
i
y_i
yi是其对应的输出。我们假设输入和输出之间的潜在关系是带有高斯噪声的标准线性回归模型
f
(
x
)
=
ϕ
(
x
)
T
w
,
y
=
f
(
x
)
+
e
(3.1)
f(\mathbf{\mathbf{x}})=\phi(\mathbf{\mathbf{x}})^{T} \mathbf{w}, \quad \mathbf{y}=f(\mathbf{\mathbf{x}})+e \tag {3.1}
f(x)=ϕ(x)Tw,y=f(x)+e(3.1)
其中,
w
\mathbf{w}
w是线性模型的权重向量,
f
f
f是函数,
y
y
y是观察到的输出。
e
e
e是服从独立、同分布的均值为零,方差为
σ
n
2
\sigma_n^2
σn2的高斯附加噪声。
我们假设一个协方差矩阵为
σ
w
2
I
\sigma_\mathbf{w}^2 \mathbf{I}
σw2I的零均值高斯先验
p
(
w
)
∼
N
(
0
,
σ
w
2
I
)
p(\mathbf{w}) \sim \mathcal{N}\left(\mathbf{0}, \sigma_{\mathbf{w}}^{2} \mathbf{I}\right)
p(w)∼N(0,σw2I)
E
w
[
f
(
X
)
]
=
E
w
[
ϕ
(
X
)
w
]
=
ϕ
(
X
)
E
w
[
w
]
=
0
cov
(
f
(
X
)
,
f
(
X
)
T
)
=
E
[
(
f
(
X
)
−
E
[
f
(
X
)
]
)
(
f
(
X
)
T
−
E
[
f
(
X
)
T
]
)
]
=
E
[
f
(
X
)
f
(
X
)
T
]
=
E
[
ϕ
(
X
)
w
w
T
ϕ
(
X
)
T
]
=
ϕ
(
X
)
E
[
w
w
T
]
ϕ
(
X
)
T
=
ϕ
(
X
)
σ
w
2
I
ϕ
(
X
)
T
=
σ
w
2
ϕ
(
X
)
ϕ
(
X
)
T
\begin{aligned} E_{\mathbf{w}}[f(\mathbf{X})] &=E_{\mathbf{w}}\left[\phi(\mathbf{X}) \mathbf{w}\right]=\phi(\mathbf{X}) E_{\mathbf{w}}[\mathbf{w}]=0 \\ \operatorname{cov}\left(f(\mathbf{X}), f\left(\mathbf{X}\right)^{T}\right) &=E\left[(f(\mathbf{X})-E[f(\mathbf{X})])\left(f\left(\mathbf{X}\right)^{T}-E\left[f\left(\mathbf{X}\right)^{T}\right]\right)\right]\\ &=E\left[f(\mathbf{X}) f\left(\mathbf{X}\right)^{T}\right] \\ &=E\left[\phi(\mathbf{X}) \mathbf{w} \mathbf{w}^{T} \phi\left(\mathbf{X}\right)^{T}\right] \\ &=\phi(\mathbf{X}) E\left[\mathbf{w} \mathbf{w}^{T}\right] \phi\left(\mathbf{X}\right)^{T} \\ &=\phi(\mathbf{X}) \sigma_{\mathbf{w}}^{2} \mathbf{I} \phi\left(\mathbf{X}\right)^{T} \\ &=\sigma_{\mathbf{w}}^{2} \phi(\mathbf{X}) \phi\left(\mathbf{X}\right)^{T} \end{aligned}
Ew[f(X)]cov(f(X),f(X)T)=Ew[ϕ(X)w]=ϕ(X)Ew[w]=0=E[(f(X)−E[f(X)])(f(X)T−E[f(X)T])]=E[f(X)f(X)T]=E[ϕ(X)wwTϕ(X)T]=ϕ(X)E[wwT]ϕ(X)T=ϕ(X)σw2Iϕ(X)T=σw2ϕ(X)ϕ(X)T
令
K
=
σ
w
2
ϕ
(
X
)
ϕ
(
X
)
T
K=\sigma_{\mathbf{w}}^{2} \phi(\mathbf{X}) \phi\left(\mathbf{X}\right)^{T}
K=σw2ϕ(X)ϕ(X)T, 因此
f
(
X
)
∼
N
(
0
,
K
)
f(\mathbf{X}) \sim \mathcal{N}(0,K)
f(X)∼N(0,K)
如果我们把0看作特殊的函数,即此时
μ
(
X
)
=
0
\mu(\mathbf{X})=0
μ(X)=0,上式可写为
f
(
X
)
∼
N
(
μ
(
X
)
,
K
)
f(\mathbf{X}) \sim \mathcal{N}(\mu(\mathbf{X}),K)
f(X)∼N(μ(X),K)
由上式可知,对于任意一个输入
x
i
\mathbf{x}_i
xi,把
f
(
x
i
)
f(\mathbf{x}_i)
f(xi)看作一个服从高斯分布的随机变量,若干个
f
(
x
i
)
f(\mathbf{x}_i)
f(xi)的联合分布仍然是高斯分布。
根据两个服从高斯分布的变量相加仍为高斯分布,可以得到
y
=
f
(
X
)
+
e
∼
N
(
μ
(
X
)
,
K
+
σ
n
2
I
)
\mathbf{y}=f(\mathbf{X})+e \sim \mathcal{N}(\mu(\mathbf{X}),K+\sigma_n^2\mathbf{I})
y=f(X)+e∼N(μ(X),K+σn2I)
因此,对于训练集
X
\mathbf{X}
X,
y
∼
N
(
μ
(
X
)
,
K
+
σ
n
2
I
)
\mathbf{y} \sim \mathcal{N}(\mu(\mathbf{X}),K+\sigma_n^2\mathbf{I})
y∼N(μ(X),K+σn2I),对于待预测的预测集
X
n
e
w
\mathbf{X}_{new}
Xnew,
f
(
X
n
e
w
)
∼
N
(
μ
(
X
n
e
w
)
,
K
(
X
n
e
w
,
X
n
e
w
)
)
f(\mathbf{\mathbf{X}_{new}}) \sim \mathcal{N}(\mu(\mathbf{X}_{new}),K(\mathbf{X}_{new},\mathbf{X}_{new}))
f(Xnew)∼N(μ(Xnew),K(Xnew,Xnew))
由高斯分布的性质可知,
(
y
,
f
(
X
n
e
w
)
)
(\mathbf{y},f(\mathbf{\mathbf{X}_{new}}))
(y,f(Xnew))的联合分布仍然是高斯分布:
(
y
f
(
X
n
e
w
)
)
=
N
(
(
μ
(
X
)
μ
(
X
n
e
w
)
)
,
(
K
(
X
,
X
)
+
σ
2
I
K
(
X
,
X
n
e
w
)
K
(
X
n
e
w
,
X
)
K
(
X
n
e
w
,
X
n
e
w
)
)
)
\left(\begin{array}{c} \mathbf{y} \\ f\left(\mathbf{X}_{new}\right) \end{array}\right)=N\left(\left(\begin{array}{c} \mu(\mathbf{X}) \\ \mu\left(\mathbf{X}_{new}\right) \end{array}\right),\left(\begin{array}{cc} K(\mathbf{X}, \mathbf{X})+\sigma^{2} I & K\left(\mathbf{X}, \mathbf{X}_{new}\right) \\ K\left(\mathbf{X}_{new}, \mathbf{X}\right) & K\left(\mathbf{X}_{new}, \mathbf{X}_{new}\right) \end{array}\right)\right)
(yf(Xnew))=N((μ(X)μ(Xnew)),(K(X,X)+σ2IK(Xnew,X)K(X,Xnew)K(Xnew,Xnew)))
P ( f ( X n e w ) ∣ y ) = N ( μ ∗ , Σ ∗ ) μ ∗ = K ( X n e w , X ) ( K ( X , X ) + σ 2 I ) − 1 ( y − μ ( X ) ) + μ ( X n e w ) Σ ∗ = K ( X n e w , X n e w ) − K ( X n e w , X ) ( K ( X , X ) + σ 2 I ) − 1 K ( X , X n e w ) \begin{array}{c} P\left(f\left(\mathbf{X}_{new}\right) \mid \mathbf{y}\right)=N\left(\mu^{*}, \Sigma^{*}\right) \\ \mu^{*}=K\left(\mathbf{X}_{new}, \mathbf{X}\right)\left(K(\mathbf{X}, \mathbf{X})+\sigma^{2} I\right)^{-1}(\mathbf{y}-\mu(\mathbf{X}))+\mu\left(\mathbf{X}_{new}\right) \\ \Sigma^{*}=K\left(\mathbf{X}_{new}, \mathbf{X}_{new}\right)-K\left(\mathbf{X}_{new}, \mathbf{X}\right)\left(K(\mathbf{X}, \mathbf{X})+\sigma^{2} I\right)^{-1}K\left(\mathbf{X}, \mathbf{X}_{new}\right) \end{array} P(f(Xnew)∣y)=N(μ∗,Σ∗)μ∗=K(Xnew,X)(K(X,X)+σ2I)−1(y−μ(X))+μ(Xnew)Σ∗=K(Xnew,Xnew)−K(Xnew,X)(K(X,X)+σ2I)−1K(X,Xnew)