p ( y ) = ∫ p ( y ∣ f ) p ( f ∣ x ) p ( x ) d f d x p ( x ∣ y ) = p ( y ∣ x ) p ( x ) p ( y ) \begin{array}{c}{p(y)=\int p(y | f) p(f | x) p(x) \mathrm{d} f \mathrm{d} x} \\ {p(x | y)=p(y | x) \frac{p(x)}{p(y)}}\end{array} p(y)=∫p(y∣f)p(f∣x)p(x)dfdxp(x∣y)=p(y∣x)p(y)p(x)
-
Priors that makes sense:
p(f)
describes our belief/assumptions and defines our notion of complexity in the function- p(x) expresses our belief/assumptions and defines our notion of complexity in the latent space
-
The priors are balanced.
GP prior:
p ( f ∣ x ) ∼ N ( 0 , K ) ∝ e − 1 2 ( f T K − 1 f ) K i j = e − ( x i − x j ) T M T M ( x i − x j ) \begin{aligned} p(f | x) & \sim \mathcal{N}(0, K) \propto e^{-\frac{1}{2}\left(f^{\mathrm{T}} K^{-1} f\right)} \\ K_{i j} &=e^{-\left(x_{i}-x_{j}\right)^{\mathrm{T}} M^{\mathrm{T}} M\left(x_{i}-x_{j}\right)} \end{aligned} p(f∣x)Kij∼N(0,K)∝e−21(fTK−1f)=e−(xi−xj)TMTM(xi−xj)
Likelihood:
p ( y ∣ f ) ∼ N ( y ∣ f , β ) ∝ e − 1 2 β tr ( y − f ) T ( y − f ) p(y | f) \sim N(y | f, \beta) \propto e^{-\frac{1}{2 \beta} \operatorname{tr}(y-f)^{\mathrm{T}}(y-f)} p(y∣f)∼N(y∣f,β)∝e−2β1tr(y−f)T(y−f)
Analytically intractable (Non Elementary Integral) and infinitely differentiable. One way to avoid the Integral is to use:
x
^
=
argmax
x
∫
p
(
y
∣
f
)
p
(
f
∣
x
)
d
f
p
(
x
)
=
argmin
x
1
2
y
T
K
−
1
y
+
1
2
∣
K
∣
−
log
p
(
x
)
\begin{array}{c}{\hat{x}=\operatorname{argmax}_{x} \int p(y | f) p(f | x) \mathrm{d} f p(x)} \\ {=\operatorname{argmin}_{x} \frac{1}{2} y^{\mathrm{T}} \mathbf{K}^{-1} y+\frac{1}{2}|\mathbf{K}|-\log p(x)}\end{array}
x^=argmaxx∫p(y∣f)p(f∣x)dfp(x)=argminx21yTK−1y+21∣K∣−logp(x)
Challenges with ML estimation:
- how to initialize
x
? - What is the dimensionality
q
? which means how complex the latent space should be to representy
?
Variational Bayes:
log p ( Y ) = log ∫ p ( Y , X ) d X = log ∫ p ( X ∣ Y ) p ( Y ) d X = log ∫ q ( X ) q ( X ) p ( X ∣ Y ) p ( Y ) d X \begin{aligned} \log p(\mathbf{Y}) &=\log \int p(\mathbf{Y}, \mathbf{X}) \mathrm{d} \mathbf{X}=\log \int p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X} \\ &=\log \int \frac{q(\mathbf{X})}{q(\mathbf{X})} p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X} \end{aligned} logp(Y)=log∫p(Y,X)dX=log∫p(X∣Y)p(Y)dX=log∫q(X)q(X)p(X∣Y)p(Y)dX
For a convex function:
λ
f
(
x
0
)
+
(
1
−
λ
)
f
(
x
1
)
≥
f
(
λ
x
0
+
(
1
−
λ
)
x
1
)
x
∈
[
x
min
,
x
max
]
λ
∈
[
0
,
1
]
]
\begin{aligned} \lambda f\left(x_{0}\right)+(1-\lambda) f\left(x_{1}\right) & \geq f\left(\lambda x_{0}+(1-\lambda) x_{1}\right) \\ x & \in\left[x_{\min }, x_{\max }\right] \\ \lambda & \in[0,1] ] \end{aligned}
λf(x0)+(1−λ)f(x1)xλ≥f(λx0+(1−λ)x1)∈[xmin,xmax]∈[0,1]]
In probability, that means:
E
[
f
(
x
)
]
≥
f
(
E
[
x
]
)
∫
f
(
x
)
p
(
x
)
d
x
≥
f
(
∫
x
p
(
x
)
d
x
)
∫
log
(
x
)
p
(
x
)
d
x
≤
log
(
∫
x
p
(
x
)
d
x
)
\begin{aligned} \mathbb{E}[f(x)] & \geq f(\mathbb{E}[x]) \\ \int f(x) p(x) \mathrm{d} x & \geq f\left(\int x p(x) \mathrm{d} x\right) \end{aligned}\\ \int \log (x) p(x) \mathrm{d} x \leq \log \left(\int x p(x) \mathrm{d} x\right)
E[f(x)]∫f(x)p(x)dx≥f(E[x])≥f(∫xp(x)dx)∫log(x)p(x)dx≤log(∫xp(x)dx)
thus,
log
p
(
Y
)
=
log
∫
q
(
X
)
q
(
X
)
p
(
X
∣
Y
)
p
(
Y
)
d
X
=
≥
∫
q
(
X
)
log
p
(
X
∣
Y
)
p
(
Y
)
q
(
X
)
d
X
=
∫
q
(
X
)
log
p
(
X
∣
Y
)
q
(
X
)
d
X
+
∫
q
(
X
)
d
X
log
p
(
Y
)
=
−
K
L
(
q
(
X
)
∥
p
(
X
∣
Y
)
)
+
log
p
(
Y
)
\begin{aligned} \log p(\mathbf{Y}) &=\log \int \frac{q(\mathbf{X})}{q(\mathbf{X})} p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y}) \mathrm{d} \mathbf{X}=\\ & \geq \int q(\mathbf{X}) \log \frac{p(\mathbf{X} | \mathbf{Y}) p(\mathbf{Y})}{q(\mathbf{X})} \mathrm{d} \mathbf{X} \\ &=\int q(\mathbf{X}) \log \frac{p(\mathbf{X} | \mathbf{Y})}{q(\mathbf{X})} \mathrm{d} \mathbf{X}+\int q(\mathbf{X}) \mathrm{d} \mathbf{X} \log p(\mathbf{Y}) \\ &=-\mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y}))+\log p(\mathbf{Y}) \end{aligned}
logp(Y)=log∫q(X)q(X)p(X∣Y)p(Y)dX=≥∫q(X)logq(X)p(X∣Y)p(Y)dX=∫q(X)logq(X)p(X∣Y)dX+∫q(X)dXlogp(Y)=−KL(q(X)∥p(X∣Y))+logp(Y)
K
L
KL
KL is KL-Divergence that is a measure of how one probability distribution is different from a second, reference probability distribution.
If
q
(
x
)
q(x)
q(x) is the true posterior we have an equality, therefore match the distributions.
K
L
(
q
(
X
)
∥
p
(
X
∣
Y
)
)
=
∫
q
(
X
)
log
q
(
X
)
p
(
X
∣
Y
)
d
X
=
∫
q
(
X
)
log
q
(
X
)
p
(
X
,
Y
)
d
X
+
log
p
(
Y
)
=
H
(
q
(
X
)
)
−
E
q
(
X
)
[
log
p
(
X
,
Y
)
]
+
log
p
(
Y
)
\begin{aligned} \mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y})) &=\int q(\mathbf{X}) \log \frac{q(\mathbf{X})}{p(\mathbf{X} | \mathbf{Y})} \mathrm{d} \mathbf{X} \\ &=\int q(\mathbf{X}) \log \frac{q(\mathbf{X})}{p(\mathbf{X}, \mathbf{Y})} \mathrm{d} \mathbf{X}+\log p(\mathbf{Y}) \\ &=H(q(\mathbf{X}))-\mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]+\log p(\mathbf{Y}) \end{aligned}
KL(q(X)∥p(X∣Y))=∫q(X)logp(X∣Y)q(X)dX=∫q(X)logp(X,Y)q(X)dX+logp(Y)=H(q(X))−Eq(X)[logp(X,Y)]+logp(Y)
And we rearrange it:
log
p
(
Y
)
=
K
L
(
q
(
X
)
∥
p
(
X
∣
Y
)
)
+
E
q
(
X
)
[
log
p
(
X
,
Y
)
]
−
H
(
q
(
X
)
)
⎵
ELBO
≥
E
q
(
X
)
[
log
p
(
X
,
Y
)
]
−
H
(
q
(
X
)
)
=
L
(
q
(
X
)
)
\begin{aligned} \log p(\mathbf{Y}) &=\mathrm{KL}(q(\mathbf{X}) \| p(\mathbf{X} | \mathbf{Y}))+\underbrace{\mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]-H(q(\mathbf{X}))}_{\text { ELBO }} \\ & \geq \mathbb{E}_{q(\mathbf{X})}[\log p(\mathbf{X}, \mathbf{Y})]-H(q(\mathbf{X}))=\mathcal{L}(q(\mathbf{X})) \end{aligned}
logp(Y)=KL(q(X)∥p(X∣Y))+ ELBO
Eq(X)[logp(X,Y)]−H(q(X))≥Eq(X)[logp(X,Y)]−H(q(X))=L(q(X))
if we maximize the ELBO, it means:
- find an approximate posterior
- get an approximate to the marginal likelihood
Maximizing p ( Y ) p(Y) p(Y) is learning
finding p ( X ∣ Y ) ∼ p ( X ) p(X|Y) \sim p(X) p(X∣Y)∼p(X) is prediction