机器学习笔记之高斯过程——高斯过程回归[函数空间角度]
引言
上一节介绍了从权重空间角度认识高斯过程回归。本节将介绍从函数空间角度认识高斯过程回归。
回顾:高维转换处理非线性回归任务过程
从权重空间(Weight-Space)视角观察高斯过程回归和高斯过程(Gaussian Process)本身没有直接联系。其本质上是 针对非线性回归任务,使用贝叶斯线性回归与核技巧(Kernal Trick)相结合的方式进行求解:
-
针对非线性回归任务,使用非线性转换(Non-Linear Transformation) ϕ ( ⋅ ) \phi(\cdot) ϕ(⋅)将原始特征空间 X ∈ R p \mathcal X \in \mathbb R^p X∈Rp映射到高维空间:
X ∈ R p → ϕ ( X ) ∈ R q q ≫ p \begin{aligned} \mathcal X \in \mathbb R^p \to \phi(\mathcal X) \in \mathbb R^q \quad q \gg p \end{aligned} X∈Rp→ϕ(X)∈Rqq≫p -
由于样本特征空间的变化,因而影响随机变量 W \mathcal W W的后验概率分布 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(W∣Data):
P ( W ∣ D a t a ) ∼ N ( μ W , Σ W ) → { μ W = A − 1 [ ϕ ( X ) ] T Y σ 2 Σ W = A − 1 A = [ ϕ ( X ) ] T ϕ ( X ) σ 2 + [ Σ p r i o r − 1 ] q × q \mathcal P(\mathcal W \mid Data) \sim \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) \to \begin{cases} \mu_{\mathcal W} = \frac{\mathcal A^{-1}[\phi(\mathcal X)]^T\mathcal Y}{\sigma^2} \\ \Sigma_{\mathcal W} = \mathcal A^{-1} \\ \mathcal A = \frac{[\phi(\mathcal X)]^T\phi(\mathcal X)}{\sigma^2} + [\Sigma_{prior}^{-1}]_{q \times q} \end{cases} P(W∣Data)∼N(μW,ΣW)→⎩⎪⎨⎪⎧μW=σ2A−1[ϕ(X)]TYΣW=A−1A=σ2[ϕ(X)]Tϕ(X)+[Σprior−1]q×q -
从而对经过非线性转换后的给定(未知)样本 ϕ ( x ^ ) \phi(\hat x) ϕ(x^)的标签 f [ ϕ ( x ^ ) ] f[\phi(\hat x)] f[ϕ(x^)]进行预测(Prediction):
推导过程复杂的部分是
A − 1 \mathcal A^{-1} A−1的求解,关于
A − 1 \mathcal A^{-1} A−1的求解过程详见
上一节.这里预测的是'不含高斯噪声'的
f [ ϕ ( x ^ ) ] f[\phi(\hat x)] f[ϕ(x^)]而不是
y ^ \hat y y^,如果要预测
y ^ \hat y y^需要在协方差中加上
σ 2 \sigma^2 σ2.
P [ f [ ϕ ( x ^ ) ] ∣ D a t a , ϕ ( x ^ ) ] ∼ N ( [ ϕ ( x ^ ) ] T μ W , [ ϕ ( x ^ ) ] T Σ W ⋅ ϕ ( x ^ ) ) = N { [ ϕ ( x ^ ) ] T ( A − 1 [ ϕ ( X ) ] T Y σ 2 ) , [ ϕ ( x ^ ) ] T A − 1 ⋅ ϕ ( x ^ ) } \begin{aligned} \mathcal P[f[\phi(\hat x)] \mid Data,\phi(\hat x)] & \sim \mathcal N([\phi(\hat x)]^T \mu_{\mathcal W},[\phi(\hat x)]^T \Sigma_{\mathcal W} \cdot \phi(\hat x)) \\ & = \mathcal N \left\{[\phi(\hat x)]^T \left(\frac{\mathcal A^{-1} [\phi(\mathcal X)]^T\mathcal Y}{\sigma^2}\right),[\phi(\hat x)]^T\mathcal A^{-1} \cdot \phi(\hat x)\right\} \end{aligned} P[f[ϕ(x^)]∣Data,ϕ(x^)]∼N([ϕ(x^)]TμW,[ϕ(x^)]TΣW⋅ϕ(x^))=N{[ϕ(x^)]T(σ2A−1[ϕ(X)]TY),[ϕ(x^)]TA−1⋅ϕ(x^)}
最终展开结果表示如下:
其中
[ Σ p r i o r ] q × q [\Sigma_{prior}]_{q \times q} [Σprior]q×q表示先验分布的协方差矩阵;
I q × q \mathcal I_{q \times q} Iq×q表示单位矩阵。
K ( X , X ) q × q \mathcal K(\mathcal X,\mathcal X)_{q \times q} K(X,X)q×q表示
[ ϕ ( X ) ] T Σ p r i o r ϕ ( X ) [\phi(\mathcal X)]^T\Sigma_{prior}\phi(\mathcal X) [ϕ(X)]TΣpriorϕ(X).
P [ f ( x ^ ) ∣ D a t a , x ^ ] ∼ N ( μ x ^ . Σ x ^ ) { μ x ^ = [ ϕ ( x ^ ) ] T Σ p r i o r [ ϕ ( X ) ] T [ K ( X , X ) + σ 2 I ] − 1 Σ x ^ = [ ϕ ( x ^ ) ] T ⋅ { Σ p r i o r − Σ p r i o r [ ϕ ( X ) ] T [ K ( X , X ) + σ 2 I ] − 1 ϕ ( X ) Σ p r i o r } ⋅ ϕ ( x ^ ) \mathcal P[f(\hat x) \mid Data,\hat x] \sim \mathcal N(\mu_{\hat x}.\Sigma_{\hat x}) \\ \begin{cases} \mu_{\hat x} = [\phi(\hat x)]^T \Sigma_{prior} [\phi(\mathcal X)]^T [\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I]^{-1} \\ \Sigma_{\hat x} = [\phi(\hat x)]^T \cdot \left\{\Sigma_{prior} - \Sigma_{prior} [\phi(\mathcal X)]^T \left[\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I\right]^{-1} \phi(\mathcal X) \Sigma_{prior}\right\} \cdot \phi(\hat x) \end{cases} P[f(x^)∣Data,x^]∼N(μx^.Σx^){μx^=[ϕ(x^)]TΣprior[ϕ(X)]T[K(X,X)+σ2I]−1Σx^=[ϕ(x^)]T⋅{Σprior−Σprior[ϕ(X)]T[K(X,X)+σ2I]−1ϕ(X)Σprior}⋅ϕ(x^) -
针对公式中出现的复杂的内积问题,使用核技巧(Kernal Trick)进行处理。假设存在关于变量 x , x ′ x,x' x,x′的核函数 K ( x , x ′ ) \mathcal K(x,x') K(x,x′)表示如下:
这里
[ Σ p r i o r ] q × q [\Sigma_{prior}]_{q \times q} [Σprior]q×q至少是半正定矩阵。
K ( x , x ′ ) = [ ϕ ( x ) ] T Σ p r i o r ϕ ( x ′ ) = [ Σ p r i o r ϕ ( x ) ] T [ Σ p r i o r ϕ ( x ′ ) ] = ⟨ Σ p r i o r ϕ ( x ) , Σ p r i o r ϕ ( x ′ ) ⟩ \begin{aligned} \mathcal K(x,x') & = [\phi(x)]^T \Sigma_{prior} \phi(x') \\ & = \left[\sqrt{\Sigma_{prior}} \text{ } \phi(x)\right]^T[\sqrt{\Sigma_{prior}} \text{ } \phi(x')] \\ & = \left\langle\sqrt{\Sigma_{prior}} \text{ } \phi(x) ,\sqrt{\Sigma_{prior}} \text{ } \phi(x')\right\rangle \end{aligned} K(x,x′)=[ϕ(x)]TΣpriorϕ(x′)=[Σprior ϕ(x)]T[Σprior ϕ(x′)]=⟨Σprior ϕ(x),Σprior ϕ(x′)⟩
与核函数的处理方式相同,直接规避了非线性函数 ϕ ( ⋅ ) \phi(\cdot) ϕ(⋅)的高维复杂运算。直接对其内积进行求解。
回顾:高斯过程
高斯过程(Gaussian Process)本质上式一组高维随机变量组成的集合:
{
ξ
t
}
t
∈
T
=
{
⋯
,
ξ
t
1
,
ξ
t
2
,
⋯
,
ξ
t
n
,
⋯
}
(
t
1
,
t
2
⋯
,
t
n
∈
T
)
\{\xi_{t}\}_{t \in \mathcal T} = \{\cdots,\xi_{t_1},\xi_{t_2},\cdots,\xi_{t_n},\cdots\} \quad (t_1,t_2\cdots,t_n \in \mathcal T)
{ξt}t∈T={⋯,ξt1,ξt2,⋯,ξtn,⋯}(t1,t2⋯,tn∈T)
其中
T
\mathcal T
T表示连续域,它可能是时间/空间中的连续域。对于高斯过程的定义可描述为:对于任意
{
t
1
,
t
2
,
⋯
,
t
n
}
∈
T
\{t_1,t_2,\cdots,t_n\} \in \mathcal T
{t1,t2,⋯,tn}∈T对应随机过程
{
ξ
t
}
t
∈
T
\{\xi_t\}_{t \in \mathcal T}
{ξt}t∈T的子集:
ξ
t
1
→
t
n
=
{
ξ
t
1
,
ξ
t
2
,
⋯
,
ξ
t
n
}
\xi_{t_1 \to t_n} = \{\xi_{t_1},\xi_{t_2},\cdots,\xi_{t_n}\}
ξt1→tn={ξt1,ξt2,⋯,ξtn}服从某一高斯分布
N
(
μ
t
1
→
t
n
,
Σ
t
1
→
t
n
)
\mathcal N(\mu_{t_1 \to t_n},\Sigma_{t_1 \to t_n})
N(μt1→tn,Σt1→tn),那么称
{
ξ
t
}
t
∈
T
\{\xi_{t}\}_{t \in \mathcal T}
{ξt}t∈T是高斯过程:
由于
t
∈
T
t \in \mathcal T
t∈T是稠密的(可以理解为‘时间间隔无限趋近于0,依然存在随机变量’),从而可以看作是连续域
T
\mathcal T
T内的‘无限维’高斯分布
。
{
ξ
t
}
t
∈
T
∼
G
P
[
m
(
t
)
,
K
(
t
,
s
)
]
(
s
,
t
∈
T
)
\{\xi_t\}_{t \in \mathcal T} \sim \mathcal G\mathcal P[m(t),\mathcal K(t,s)] \quad (s,t \in \mathcal T)
{ξt}t∈T∼GP[m(t),K(t,s)](s,t∈T)
需要注意的是,均值函数(Mean-Function)
m
(
t
)
m(t)
m(t)和 方差函数(Covariance Function)
K
(
s
,
t
)
\mathcal K(s,t)
K(s,t)它们均是基于函数形式的表达,这说明:不同时刻/状态下的均值/协方差结果不是固定值,而是表示为关于
s
,
t
s,t
s,t的函数。
X
∈
R
p
→
X
∼
N
(
μ
p
,
Σ
p
×
p
)
\mathcal X \in \mathbb R^p \to \mathcal X \sim \mathcal N(\mu_p,\Sigma_{p \times p})
X∈Rp→X∼N(μp,Σp×p)
相反,如高斯网络(Gaussian Network),一旦随机变量集合 X \mathcal X X确定了,那么对应的概率图模型就是静态模型,对应的期望结果 μ p \mu_p μp和协方差矩阵 Σ p × p \Sigma_{p \times p} Σp×p就是恒定不变的,从概率图的角度观察各随机变量结点之间的关联关系也是确定的。
权重空间视角——模型参数 W \mathcal W W的变化
基于线性回归模型(无高斯噪声)
f
(
X
)
=
X
T
W
f(\mathcal X) = \mathcal X^T\mathcal W
f(X)=XTW,对特征空间
X
∈
R
p
\mathcal X \in \mathbb R^p
X∈Rp进行非线性高维转换:
X
→
ϕ
(
X
)
∈
R
q
\mathcal X \to \phi(\mathcal X) \in \mathbb R^q
X→ϕ(X)∈Rq;
给定模型参数
W
\mathcal W
W一个先验分布:
由于
X
\mathcal X
X已经执行了‘非线性转换’,因此此时的
W
\mathcal W
W是
q
q
q维随机变量,对应的协方差矩阵
Σ
p
r
i
o
r
\Sigma_{prior}
Σprior同样需要时
q
×
q
q \times q
q×q的格式。
W
∼
N
(
0
,
[
Σ
p
r
i
o
r
]
q
×
q
)
\mathcal W \sim \mathcal N(0,[\Sigma_{prior}]_{q \times q})
W∼N(0,[Σprior]q×q)
因此,线性模型
f
(
X
)
f(\mathcal X)
f(X)的期望
E
[
f
(
X
)
]
\mathbb E[f(\mathcal X)]
E[f(X)]可表示如下:
这里关注的是
W
\mathcal W
W的变化,因此这里将
ϕ
(
X
)
\phi(\mathcal X)
ϕ(X)看作常数。
E
[
f
(
X
)
]
=
E
{
[
ϕ
(
X
)
]
T
W
}
=
[
ϕ
(
X
)
]
T
E
[
W
]
=
[
ϕ
(
X
)
]
T
⋅
0
=
0
\mathbb E[f(\mathcal X)] = \mathbb E\left\{[\phi(\mathcal X)]^T \mathcal W\right\} = [\phi(\mathcal X)]^T \mathbb E[\mathcal W] = [\phi(\mathcal X)]^T \cdot 0 = 0
E[f(X)]=E{[ϕ(X)]TW}=[ϕ(X)]TE[W]=[ϕ(X)]T⋅0=0
对于任意
x
(
i
)
,
x
(
j
)
∈
R
p
x^{(i)},x^{(j)} \in \mathbb R^p
x(i),x(j)∈Rp,对应函数结果的协方差
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
Cov \left[f(x^{(i)}),f(x^{(j)})\right]
Cov[f(x(i)),f(x(j))]表示如下:
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
=
E
{
[
f
(
x
(
i
)
)
−
E
[
f
(
x
(
i
)
)
]
]
⋅
[
f
(
x
(
j
)
)
−
E
[
f
(
x
(
j
)
)
]
]
}
=
E
{
[
f
(
x
(
i
)
)
−
0
]
⋅
[
f
(
x
(
j
)
)
−
0
]
}
=
E
[
f
(
x
(
i
)
)
⋅
f
(
x
(
j
)
)
]
=
E
[
ϕ
(
x
(
i
)
)
T
W
⋅
ϕ
(
x
(
j
)
)
T
W
]
\begin{aligned} Cov \left[f(x^{(i)}),f(x^{(j)})\right] & = \mathbb E \left\{\left[f(x^{(i)}) -\mathbb E[f(x^{(i)})] \right] \cdot \left[f(x^{(j)}) -\mathbb E[f(x^{(j)})] \right] \right\} \\ & = \mathbb E \left\{\left[f(x^{(i)}) -0 \right] \cdot \left[f(x^{(j)}) -0 \right] \right\} \\ & = \mathbb E \left[f(x^{(i)}) \cdot f(x^{(j)})\right] \\ & = \mathbb E \left[\phi(x^{(i)})^T\mathcal W \cdot \phi(x^{(j)})^T\mathcal W\right] \end{aligned}
Cov[f(x(i)),f(x(j))]=E{[f(x(i))−E[f(x(i))]]⋅[f(x(j))−E[f(x(j))]]}=E{[f(x(i))−0]⋅[f(x(j))−0]}=E[f(x(i))⋅f(x(j))]=E[ϕ(x(i))TW⋅ϕ(x(j))TW]
由于
ϕ
(
x
(
j
)
)
T
W
\phi(x^{(j)})^T \mathcal W
ϕ(x(j))TW结果是一个实数,因而
[
ϕ
(
x
(
j
)
)
T
W
]
T
=
W
T
ϕ
(
x
(
j
)
)
\left[\phi(x^{(j)})^T \mathcal W\right]^T = \mathcal W^T\phi(x^{(j)})
[ϕ(x(j))TW]T=WTϕ(x(j))等于
ϕ
(
x
(
j
)
)
T
W
\phi(x^{(j)})^T \mathcal W
ϕ(x(j))TW自身。因而有:
Δ
\Delta
Δ表示上述推导结果。
Δ
=
E
[
ϕ
(
x
(
i
)
)
T
W
⋅
W
T
ϕ
(
x
(
j
)
)
]
=
[
ϕ
(
x
(
i
)
)
]
T
⋅
E
[
W
⋅
W
T
]
⋅
ϕ
(
x
(
j
)
)
\begin{aligned} \Delta & = \mathbb E \left[\phi(x^{(i)})^T\mathcal W \cdot \mathcal W^T \phi(x^{(j)})\right] \\ & = [\phi(x^{(i)})]^T \cdot \mathbb E[\mathcal W \cdot \mathcal W^T] \cdot \phi(x^{(j)}) \end{aligned}
Δ=E[ϕ(x(i))TW⋅WTϕ(x(j))]=[ϕ(x(i))]T⋅E[W⋅WT]⋅ϕ(x(j))
观察
E
[
W
⋅
W
T
]
\mathbb E[\mathcal W \cdot \mathcal W^T]
E[W⋅WT],它实际上就是:
E
[
W
⋅
W
T
]
=
E
[
(
W
−
0
)
⋅
(
W
T
−
0
)
]
=
E
{
[
W
−
E
[
W
]
]
⋅
[
W
−
E
[
W
]
]
T
}
=
C
o
v
(
W
,
W
)
=
Σ
p
r
i
o
r
\begin{aligned} \mathbb E[\mathcal W \cdot \mathcal W^T] & = \mathbb E \left[(\mathcal W - 0) \cdot (\mathcal W^T - 0)\right] \\ & = \mathbb E\left\{[\mathcal W - \mathbb E[\mathcal W]] \cdot [\mathcal W - \mathbb E[\mathcal W]]^T\right\} \\ & = Cov(\mathcal W,\mathcal W) \\ & = \Sigma_{prior} \end{aligned}
E[W⋅WT]=E[(W−0)⋅(WT−0)]=E{[W−E[W]]⋅[W−E[W]]T}=Cov(W,W)=Σprior
至此,关于
f
(
x
(
i
)
)
f(x^{(i)})
f(x(i))和
f
(
x
(
j
)
)
f(x^{(j)})
f(x(j))的协方差结果
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
Cov \left[f(x^{(i)}),f(x^{(j)})\right]
Cov[f(x(i)),f(x(j))]表示如下:
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
=
[
ϕ
(
x
(
i
)
)
]
1
×
q
T
⋅
[
Σ
p
r
i
o
r
]
q
×
q
⋅
[
ϕ
(
x
(
j
)
)
]
q
×
1
=
K
(
x
(
i
)
,
x
(
j
)
)
\begin{aligned} Cov\left[f(x^{(i)}),f(x^{(j)})\right] & = [\phi(x^{(i)})]_{1 \times q}^T \cdot [\Sigma_{prior}]_{q \times q} \cdot [\phi(x^{(j)})]_{q \times 1} \\ & = \mathcal K(x^{(i)},x^{(j)}) \end{aligned}
Cov[f(x(i)),f(x(j))]=[ϕ(x(i))]1×qT⋅[Σprior]q×q⋅[ϕ(x(j))]q×1=K(x(i),x(j))
小插曲:记号函数 K \mathcal K K是核函数的必要性证明
继续将
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
Cov\left[f(x^{(i)}),f(x^{(j)})\right]
Cov[f(x(i)),f(x(j))]展开,有:
在
权重空间角度文章的末尾介绍的是‘记号函数’
K
(
⋅
,
⋅
)
\mathcal K(\cdot,\cdot)
K(⋅,⋅)的充分性证明。这里顺势补充一下必要性证明。
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
=
(
x
1
(
i
)
,
x
2
(
i
)
,
⋯
,
x
q
(
i
)
)
(
Σ
p
r
i
o
r
11
,
Σ
p
r
i
o
r
12
,
⋯
,
Σ
p
r
i
o
r
1
q
Σ
p
r
i
o
r
21
,
Σ
p
r
i
o
r
22
,
⋯
,
Σ
p
r
i
o
r
2
q
⋮
Σ
p
r
i
o
r
q
1
,
Σ
p
r
i
o
r
q
2
,
⋯
,
Σ
p
r
i
o
r
q
q
)
(
x
1
(
j
)
x
2
(
j
)
⋮
x
q
(
j
)
)
Σ
p
r
i
o
r
i
j
=
C
o
v
(
w
i
,
w
j
)
;
w
i
,
w
j
∈
W
=
[
∑
k
=
1
q
x
k
(
i
)
Σ
p
r
i
o
r
k
1
,
⋯
,
∑
k
=
1
q
x
k
(
i
)
Σ
p
r
i
o
r
k
q
]
(
x
1
(
j
)
x
2
(
j
)
⋮
x
q
(
j
)
)
=
∑
l
=
1
q
∑
k
=
1
q
x
k
(
i
)
⋅
Σ
p
r
i
o
r
k
l
⋅
x
l
(
j
)
\begin{aligned} Cov\left[f(x^{(i)}),f(x^{(j)})\right] & = (x_1^{(i)},x_2^{(i)},\cdots,x_q^{(i)})\begin{pmatrix} \Sigma_{prior}^{11},\Sigma_{prior}^{12},\cdots,\Sigma_{prior}^{1q} \\ \Sigma_{prior}^{21},\Sigma_{prior}^{22},\cdots,\Sigma_{prior}^{2q} \\ \vdots \\ \Sigma_{prior}^{q1},\Sigma_{prior}^{q2},\cdots,\Sigma_{prior}^{qq} \\ \end{pmatrix}\begin{pmatrix} x_1^{(j)} \\ x_2^{(j)} \\ \vdots \\ x_q^{(j)} \end{pmatrix} \quad \Sigma_{prior}^{ij} = Cov(w_i,w_j);w_i,w_j \in \mathcal W \\ & = \left[\sum_{k=1}^qx_k^{(i)}\Sigma_{prior}^{k1},\cdots,\sum_{k=1}^qx_k^{(i)}\Sigma_{prior}^{kq}\right]\begin{pmatrix} x_1^{(j)} \\ x_2^{(j)} \\ \vdots \\ x_q^{(j)} \end{pmatrix} \\ & = \sum_{l=1}^q\sum_{k=1}^q x_k^{(i)} \cdot \Sigma_{prior}^{kl} \cdot x_l^{(j)} \end{aligned}
Cov[f(x(i)),f(x(j))]=(x1(i),x2(i),⋯,xq(i))⎝⎜⎜⎜⎛Σprior11,Σprior12,⋯,Σprior1qΣprior21,Σprior22,⋯,Σprior2q⋮Σpriorq1,Σpriorq2,⋯,Σpriorqq⎠⎟⎟⎟⎞⎝⎜⎜⎜⎜⎛x1(j)x2(j)⋮xq(j)⎠⎟⎟⎟⎟⎞Σpriorij=Cov(wi,wj);wi,wj∈W=[k=1∑qxk(i)Σpriork1,⋯,k=1∑qxk(i)Σpriorkq]⎝⎜⎜⎜⎜⎛x1(j)x2(j)⋮xq(j)⎠⎟⎟⎟⎟⎞=l=1∑qk=1∑qxk(i)⋅Σpriorkl⋅xl(j)
其中,
x
k
(
i
)
,
Σ
p
r
i
o
r
k
l
,
x
l
(
j
)
x_k^{(i)},\Sigma_{prior}^{kl},x_l^{(j)}
xk(i),Σpriorkl,xl(j)均表示实数,因而有:
∑
l
=
1
q
∑
k
=
1
q
x
k
(
i
)
⋅
Σ
p
r
i
o
r
k
l
⋅
x
l
(
j
)
=
∑
l
=
1
q
∑
k
=
1
q
x
l
(
j
)
⋅
Σ
p
r
i
o
r
k
l
⋅
x
k
(
i
)
⇒
C
o
v
[
f
(
x
(
i
)
)
,
f
(
x
(
j
)
)
]
=
C
o
v
[
f
(
x
(
j
)
)
,
f
(
x
(
i
)
)
]
⇒
K
(
x
(
i
)
,
x
(
j
)
)
=
K
(
x
(
j
)
,
x
(
i
)
)
\begin{aligned} & \sum_{l=1}^q\sum_{k=1}^q x_k^{(i)} \cdot \Sigma_{prior}^{kl} \cdot x_l^{(j)} = \sum_{l=1}^q\sum_{k=1}^q x_l^{(j)} \cdot \Sigma_{prior}^{kl} \cdot x_k^{(i)} \\ & \Rightarrow Cov \left[f(x^{(i)}),f(x^{(j)})\right] = Cov \left[f(x^{(j)}),f(x^{(i)})\right] \\ & \Rightarrow \mathcal K(x^{(i)},x^{(j)}) = \mathcal K(x^{(j)},x^{(i)}) \end{aligned}
l=1∑qk=1∑qxk(i)⋅Σpriorkl⋅xl(j)=l=1∑qk=1∑qxl(j)⋅Σpriorkl⋅xk(i)⇒Cov[f(x(i)),f(x(j))]=Cov[f(x(j)),f(x(i))]⇒K(x(i),x(j))=K(x(j),x(i))
这意味着核矩阵
K
\mathbb K
K是实对称矩阵,那么它必然是半正定的:
K
=
[
K
(
x
(
1
)
,
x
(
1
)
)
,
K
(
x
(
1
)
,
x
(
2
)
)
,
⋯
,
K
(
x
(
1
)
,
x
(
N
)
)
K
(
x
(
2
)
,
x
(
1
)
)
,
K
(
x
(
2
)
,
x
(
2
)
)
,
⋯
,
K
(
x
(
2
)
,
x
(
N
)
)
⋮
K
(
x
(
N
)
,
x
(
1
)
)
,
K
(
x
(
N
)
,
x
(
2
)
)
,
⋯
,
K
(
x
(
N
)
,
x
(
N
)
)
]
N
×
N
\mathbb K = \begin{bmatrix} \mathcal K(x^{(1)},x^{(1)}),\mathcal K(x^{(1)},x^{(2)}),\cdots,\mathcal K(x^{(1)},x^{(N)}) \\ \mathcal K(x^{(2)},x^{(1)}),\mathcal K(x^{(2)},x^{(2)}),\cdots,\mathcal K(x^{(2)},x^{(N)}) \\ \vdots \\ \mathcal K(x^{(N)},x^{(1)}),\mathcal K(x^{(N)},x^{(2)}),\cdots,\mathcal K(x^{(N)},x^{(N)}) \\ \end{bmatrix}_{N \times N}
K=⎣⎢⎢⎢⎡K(x(1),x(1)),K(x(1),x(2)),⋯,K(x(1),x(N))K(x(2),x(1)),K(x(2),x(2)),⋯,K(x(2),x(N))⋮K(x(N),x(1)),K(x(N),x(2)),⋯,K(x(N),x(N))⎦⎥⎥⎥⎤N×N
至此,证明记号
K
\mathcal K
K函数是正定核函数。
正定核函数必要性证明参考
传送门
言归正传
根据 C o v [ f ( x ( i ) ) , f ( x ( j ) ) ] = K ( x ( i ) , x ( j ) ) Cov\left[f(x^{(i)}),f(x^{(j)})\right] = \mathcal K(x^{(i)},x^{(j)}) Cov[f(x(i)),f(x(j))]=K(x(i),x(j)),这意味着:如果将 { f ( X ) } x ∈ R p = { f ( x 1 ) , f ( x 2 ) , ⋯ , f ( x p ) } \{f(\mathcal X)\}_{x \in \mathbb R^p} = \{f(x_1),f(x_2),\cdots,f(x_p)\} {f(X)}x∈Rp={f(x1),f(x2),⋯,f(xp)}本身看做一个随机变量集合,那么这个随机变量本身的协方差结果可以由核函数表示。
回顾高斯过程的定义式:
{
ξ
t
}
t
∈
T
∼
G
P
[
m
(
t
)
,
K
(
t
,
s
)
]
(
s
,
t
∈
T
)
\{\xi_t\}_{t \in \mathcal T} \sim \mathcal G\mathcal P[m(t),\mathcal K(t,s)] \quad (s,t \in \mathcal T)
{ξt}t∈T∼GP[m(t),K(t,s)](s,t∈T),其中
s
,
t
s,t
s,t本身不是随机变量,它们仅是描述连续域中状态/时刻的下标(index),和随机变量
ξ
\xi
ξ之间不存在关系。因而可以将高斯过程定义式表示为如下形式:
{
{
f
(
X
)
}
X
∈
R
p
∼
G
P
[
m
(
X
)
,
K
(
x
(
i
)
,
x
(
j
)
)
]
x
(
i
)
,
x
(
j
)
∈
X
{
ξ
t
}
t
∈
T
∼
G
P
[
m
(
t
)
,
K
(
t
,
s
)
]
(
s
,
t
∈
T
)
\begin{cases} \{f(\mathcal X)\}_{\mathcal X \in \mathbb R^p} \sim \mathcal G\mathcal P[m(\mathcal X),\mathcal K(x^{(i)},x^{(j)})] \quad x^{(i)},x^{(j)} \in \mathcal X \\ \{\xi_t\}_{t \in \mathcal T} \sim \mathcal G\mathcal P[m(t),\mathcal K(t,s)] \quad (s,t \in \mathcal T) \end{cases}
{{f(X)}X∈Rp∼GP[m(X),K(x(i),x(j))]x(i),x(j)∈X{ξt}t∈T∼GP[m(t),K(t,s)](s,t∈T)
小结
对比一下两种高斯过程的表达:
- t t t和 ξ t \xi_t ξt之间不存在关联关系,只是一个下标的表示;而 X \mathcal X X和 f ( X ) f(\mathcal X) f(X)之间存在明确的函数关系;
- ξ t \xi_t ξt表示连续域 T \mathcal T T中 t t t时刻的一个高维随机变量;而 f ( X ) f(\mathcal X) f(X)表示 p p p维实数域 R p \mathbb R^p Rp中某随机变量 X \mathcal X X对应的高维随机变量;
- 均值函数、方差函数:这里以方差函数为例,它们均表示连续域中随机变量集合的核矩阵:
K ( s , t ) ⇒ [ K ( ξ t 1 , ξ t 1 ) , K ( ξ t 1 , ξ t 2 ) , ⋯ , K ( ξ t 1 , ξ t n ) K ( ξ t 2 , ξ t 1 ) , K ( ξ t 2 , ξ t 2 ) , ⋯ , K ( ξ t 2 , ξ t n ) ⋮ K ( ξ t n , ξ t 1 ) , K ( ξ t n , ξ t 2 ) , ⋯ , K ( ξ t n , ξ t n ) ] n × n s , t ∈ { t 1 , t 2 , ⋯ , t n } K ( x ( i ) , x ( j ) ) ⇒ [ K ( x ( 1 ) , x ( 1 ) ) , K ( x ( 1 ) , x ( 2 ) ) , ⋯ , K ( x ( 1 ) , x ( N ) ) K ( x ( 2 ) , x ( 1 ) ) , K ( x ( 2 ) , x ( 2 ) ) , ⋯ , K ( x ( 2 ) , x ( N ) ) ⋮ K ( x ( N ) , x ( 1 ) ) , K ( x ( N ) , x ( 2 ) ) , ⋯ , K ( x ( N ) , x ( N ) ) ] N × N x ( i ) , x ( j ) ∈ X \begin{aligned} \mathcal K(s,t) & \Rightarrow \begin{bmatrix} \mathcal K(\xi_{t_1},\xi_{t_1}),\mathcal K(\xi_{t_1},\xi_{t_2}),\cdots,\mathcal K(\xi_{t_1},\xi_{t_n}) \\ \mathcal K(\xi_{t_2},\xi_{t_1}),\mathcal K(\xi_{t_2},\xi_{t_2}),\cdots,\mathcal K(\xi_{t_2},\xi_{t_n}) \\ \vdots \\ \mathcal K(\xi_{t_n},\xi_{t_1}),\mathcal K(\xi_{t_n},\xi_{t_2}),\cdots,\mathcal K(\xi_{t_n},\xi_{t_n}) \\ \end{bmatrix}_{n \times n} \quad s,t \in \{t_1,t_2,\cdots,t_n\} \\ \mathcal K(x^{(i)},x^{(j)}) & \Rightarrow \begin{bmatrix} \mathcal K(x^{(1)},x^{(1)}),\mathcal K(x^{(1)},x^{(2)}),\cdots,\mathcal K(x^{(1)},x^{(N)}) \\ \mathcal K(x^{(2)},x^{(1)}),\mathcal K(x^{(2)},x^{(2)}),\cdots,\mathcal K(x^{(2)},x^{(N)}) \\ \vdots \\ \mathcal K(x^{(N)},x^{(1)}),\mathcal K(x^{(N)},x^{(2)}),\cdots,\mathcal K(x^{(N)},x^{(N)}) \\ \end{bmatrix}_{N \times N} \quad x^{(i)},x^{(j)} \in \mathcal X \end{aligned} K(s,t)K(x(i),x(j))⇒⎣⎢⎢⎢⎡K(ξt1,ξt1),K(ξt1,ξt2),⋯,K(ξt1,ξtn)K(ξt2,ξt1),K(ξt2,ξt2),⋯,K(ξt2,ξtn)⋮K(ξtn,ξt1),K(ξtn,ξt2),⋯,K(ξtn,ξtn)⎦⎥⎥⎥⎤n×ns,t∈{t1,t2,⋯,tn}⇒⎣⎢⎢⎢⎡K(x(1),x(1)),K(x(1),x(2)),⋯,K(x(1),x(N))K(x(2),x(1)),K(x(2),x(2)),⋯,K(x(2),x(N))⋮K(x(N),x(1)),K(x(N),x(2)),⋯,K(x(N),x(N))⎦⎥⎥⎥⎤N×Nx(i),x(j)∈X
关于给定样本 x ^ \hat x x^的预测任务中:
- 权重空间角度关注模型参数
W
\mathcal W
W,对预测任务的表达式如下:
P ( y ^ ∣ x ^ , D a t a ) = ∫ W ∣ D a t a P ( y ^ ∣ W , x ^ ) ⋅ P ( W ∣ D a t a ) d W \mathcal P(\hat y \mid \hat x,Data) = \int_{\mathcal W \mid Data} \mathcal P(\hat y \mid \mathcal W,\hat x) \cdot \mathcal P(\mathcal W \mid Data) d\mathcal W P(y^∣x^,Data)=∫W∣DataP(y^∣W,x^)⋅P(W∣Data)dW - 函数空间角度关注
f
(
X
)
f(\mathcal X)
f(X)自身,将
f
(
X
)
=
[
ϕ
(
X
)
]
T
W
f(\mathcal X) = [\phi(\mathcal X)]^T \mathcal W
f(X)=[ϕ(X)]TW自身看作随机变量,对预测任务的表达式如下:
P ( y ^ ∣ D a t a , x ^ ) = ∫ f ( X ) P ( y ^ ∣ f ( X ) , x ^ ) ⋅ P [ f ( X ) ∣ D a t a ] d f ( X ) \mathcal P(\hat y \mid Data,\hat x) = \int_{f(\mathcal X)} \mathcal P(\hat y \mid f(\mathcal X),\hat x) \cdot \mathcal P[f(\mathcal X) \mid Data]\text{ }df(\mathcal X) P(y^∣Data,x^)=∫f(X)P(y^∣f(X),x^)⋅P[f(X)∣Data] df(X)
函数空间角度与权重空间角度的核心差别在于
K
(
x
(
i
)
,
x
(
j
)
)
\mathcal K(x^{(i)},x^{(j)})
K(x(i),x(j))的表示上。
权重空间角度需要将
x ( i ) , x ( j ) → ϕ ( x ( i ) ) , ϕ ( x ( j ) ) x^{(i)},x^{(j)} \to \phi(x^{(i)}),\phi(x^{(j)}) x(i),x(j)→ϕ(x(i)),ϕ(x(j)),然后通过高维转换后的样本维度重新对
W \mathcal W W的先验分布
P ( W ) \mathcal P(\mathcal W) P(W)进行设定
→ N ( 0 , Σ p r i o r ) \to \mathcal N(0,\Sigma_{prior}) →N(0,Σprior)。再凑成
K ( x ( i ) , x ( j ) ) = ϕ ( x ( i ) ) Σ p r i o r ϕ ( x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) = \phi(x^{(i)})\Sigma_{prior}\phi(x^{(j)}) K(x(i),x(j))=ϕ(x(i))Σpriorϕ(x(j))的格式,去求解
W \mathcal W W的后验概率分布
P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(W∣Data);函数空间角度直接用
C o v [ f ( x ( i ) ) , f ( x ( j ) ) ] Cov[f(x^{(i)}),f(x^{(j)})] Cov[f(x(i)),f(x(j))]表示
K ( x ( i ) , x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) K(x(i),x(j)),从而并不需要单独求解
W \mathcal W W,而是直接求解
f ( x ( i ) ) = [ ϕ ( x ( i ) ) ] T W , f ( x ( j ) ) = [ ϕ ( x ( j ) ) ] T W f(x^{(i)}) = [\phi(x^{(i)})]^T\mathcal W,f(x^{(j)}) = [\phi(x^{(j)})]^T\mathcal W f(x(i))=[ϕ(x(i))]TW,f(x(j))=[ϕ(x(j))]TW即可。在预测任务中,直接通过
[ ϕ ( x ) ] T W [\phi(x)]^T\mathcal W [ϕ(x)]TW替代
W \mathcal W W执行预测任务。
相关参考:
机器学习-高斯过程回归-权重空间到函数空间(From Weight-Space To Function-Space)