一、 标题CSDN 中编写公式:
1.我们定义(行内公式:$...$
)
f
(
x
)
=
∑
i
=
0
N
∫
a
b
g
(
t
,
i
)
d
t
f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t
f(x)=∑i=0N∫abg(t,i) dt.
2.定义
f
(
x
)
f(x)
f(x)如下(行间公式):$$ ... $$
(1)
f
(
x
)
=
∑
i
=
0
N
∫
a
b
g
(
t
,
i
)
d
t
f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t\tag{1}
f(x)=i=0∑N∫abg(t,i) dt(1)
二、KL divergence
K L ( p ∣ ∣ q ) = ∑ p ( x ) log p ( x ) q ( x ) K L ( p ∣ ∣ q ) = ∫ p ( x ) log p ( x ) q ( x ) d x \begin{aligned} KL(p||q)=\sum p(x) \log \frac{p(x)}{q(x)} \\ KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \end{aligned} KL(p∣∣q)=∑p(x)logq(x)p(x)KL(p∣∣q)=∫p(x)logq(x)p(x)dx
三、Variational Graph Auto-Encoders
K
L
(
q
(
z
)
∣
∣
p
(
z
∣
X
)
)
=
∫
q
(
z
)
log
q
(
z
)
p
(
z
∣
X
)
d
z
=
∫
q
(
z
)
[
log
q
(
z
)
−
log
p
(
z
∣
X
)
]
d
z
=
∫
q
(
z
)
[
log
q
(
z
)
−
log
p
(
X
∣
z
)
−
log
(
z
)
+
log
p
(
X
)
]
d
z
=
∫
q
(
z
)
[
log
q
(
z
)
−
log
p
(
X
∣
z
)
−
log
p
(
z
)
]
d
z
+
log
p
(
X
)
\begin{aligned} KL(q(z)||p(z|X)) &= \int q(z) \log \frac{q(z)}{p(z|X)}dz\\ &=\int q(z)[\log q(z)-\log p(z|X)]dz \\ &=\int q(z)[\log q(z)-\log p(X|z)- \log(z) + \log p(X)]dz\\ &=\int q(z)[\log q(z) -\log p(X|z) - \log p(z)]dz+\log p(X) \end{aligned}
KL(q(z)∣∣p(z∣X))=∫q(z)logp(z∣X)q(z)dz=∫q(z)[logq(z)−logp(z∣X)]dz=∫q(z)[logq(z)−logp(X∣z)−log(z)+logp(X)]dz=∫q(z)[logq(z)−logp(X∣z)−logp(z)]dz+logp(X)
log
p
(
X
)
−
K
L
(
q
(
z
)
∣
∣
p
(
z
∣
X
)
)
=
∫
q
(
z
)
log
p
(
X
∣
z
)
d
z
−
K
L
(
q
(
z
)
∣
∣
p
(
z
)
)
\begin{aligned} \log p(X) - KL(q(z)||p(z|X))= \int q(z)\log p(X|z)dz -KL(q(z)||p(z)) \end{aligned}
logp(X)−KL(q(z)∣∣p(z∣X))=∫q(z)logp(X∣z)dz−KL(q(z)∣∣p(z))
我们虽然不大容易求出p(X),但我们知道当X给定的情况下,p(X)是个固定值。那么如果我们希望KL(q(z)||p(z|X))尽可能地小,也就相当于让等号右边的那部分尽可能地大。
Given generative model
(2)
p
(
A
∣
Z
)
=
∏
i
=
1
N
∏
j
=
1
N
p
(
A
i
j
∣
z
i
,
z
j
)
p(A|Z)=\prod_{i=1}^{N}\prod_{j=1}^{N}p(A_{ij}|z_i,z_j)\tag{2}
p(A∣Z)=i=1∏Nj=1∏Np(Aij∣zi,zj)(2),with
(3)
p
(
A
i
j
=
1
∣
z
i
,
z
j
)
=
σ
(
z
i
T
z
j
)
p(A_{ij}=1|z_i,z_j)=\sigma(z_i^\Tau z_j)\tag{3}
p(Aij=1∣zi,zj)=σ(ziTzj)(3)
Learning We optimize the variational lower bound
L
\mathcal L
L w.r.t. the variatiomnal parameters
W
i
W_i
Wi:
(4)
L
=
E
q
(
Z
∣
X
,
A
)
[
log
p
(
A
∣
Z
)
]
−
K
L
[
q
(
Z
∣
X
,
A
)
∣
∣
p
(
Z
)
]
\mathcal L = \mathbb E_{q(Z|X,A)}[\log p(A|Z)]-KL[q(Z|X,A)||p(Z)]\tag{4}
L=Eq(Z∣X,A)[logp(A∣Z)]−KL[q(Z∣X,A)∣∣p(Z)](4)
where
K
L
[
q
(
⋅
)
∣
∣
p
(
⋅
)
]
KL[q(\cdot)||p(\cdot)]
KL[q(⋅)∣∣p(⋅)] is the Kullback-Leibler divergence between
q
(
⋅
)
q(\cdot)
q(⋅) and
p
(
⋅
)
p(\cdot)
p(⋅).We further take a Gaussian prior
p
(
Z
)
=
∏
i
p
(
z
i
)
=
∏
i
N
(
z
i
∣
0
,
1
)
p(Z)=\prod_ip(z_i)=\prod_i\mathcal N(z_i|0,1)
p(Z)=∏ip(zi)=∏iN(zi∣0,1). For very sparse
A
A
A, it can be beneficial to re-weight terns with
A
i
j
=
1
A_{ij}=1
Aij=1 in
L
\mathcal L
L or alternatively sub-sample terms with
A
i
j
=
0
A_{ij}=0
Aij=0.We choose the former for the following experiments. We perform full-batch gradient descent and make use of the reparameterization trick for training. For a featureless approach, we simply drop the dependence on
X
X
X and replace
X
X
X with the identity matrix in the GCN.
The Eq.4 second term on the right:
∫
q
θ
(
z
)
log
p
(
z
)
d
z
=
∫
N
(
z
;
μ
,
σ
2
)
log
N
(
z
;
0
,
1
)
d
z
=
−
J
2
log
(
2
π
)
−
1
2
∑
j
=
1
J
(
μ
j
2
+
σ
j
2
)
\begin{aligned} \int q_{\theta}(z) \log p(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;0,1)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(\mu_j^2+\sigma_j^2) \end{aligned}\\
∫qθ(z)logp(z)dz=∫N(z;μ,σ2)logN(z;0,1)dz=−2Jlog(2π)−21j=1∑J(μj2+σj2)
And
∫
q
θ
(
z
)
log
q
θ
(
z
)
d
z
=
∫
N
(
z
;
μ
,
σ
2
)
log
N
(
z
;
μ
,
σ
2
)
d
z
=
−
J
2
log
(
2
π
)
−
1
2
∑
j
=
1
J
(
1
+
log
σ
j
2
)
\begin{aligned} \int q_{\theta}(z) \log q_\theta(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;\mu,\sigma^2)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(1+\log \sigma_j^2) \end{aligned}\\
∫qθ(z)logqθ(z)dz=∫N(z;μ,σ2)logN(z;μ,σ2)dz=−2Jlog(2π)−21j=1∑J(1+logσj2)
Therefore:
−
D
K
L
(
(
q
ϕ
(
z
)
∣
∣
p
ϕ
(
z
)
)
=
∫
q
θ
(
z
)
(
log
p
θ
(
z
)
−
log
q
θ
(
z
)
)
d
z
=
1
2
∑
j
=
1
J
(
1
+
log
(
(
σ
j
)
2
)
−
(
μ
j
)
2
−
(
σ
j
)
2
)
\begin{aligned} -D_{KL}((q_\phi(z)||p_\phi(z))&=\int q_\theta(z)(\log p_{\theta}(z)-\log q_\theta(z))dz\\ &=\frac{1}{2}\sum _{j=1}^J(1+\log((\sigma_j)^2)-(\mu_j)^2-(\sigma_j)^2) \end{aligned}\\
−DKL((qϕ(z)∣∣pϕ(z))=∫qθ(z)(logpθ(z)−logqθ(z))dz=21j=1∑J(1+log((σj)2)−(μj)2−(σj)2)
Here,the variational lower bound (the objective to be maximized) contains a KL term that can often be integrated analytically. Here we give the solution when both the prior
p
θ
(
z
)
=
N
(
0
;
I
)
p_\theta(z) = \mathcal N(0; I)
pθ(z)=N(0;I) and the posterior approximation
q
ϕ
(
z
∣
x
i
)
q_\phi(z|x^i)
qϕ(z∣xi) are Gaussian. Let
J
J
J be the dimensionality of z. Let
μ
\mu
μand
σ
\sigma
σ denote the variational mean and s.d. evaluated at datapoint
i
i
i, and let
μ
j
\mu _j
μj and
σ
j
\sigma_j
σj simply denote the j-th element of these vectors.
The Eq.4 first term on the right:
E
q
(
Z
∣
X
,
A
)
[
log
p
(
A
∣
Z
)
]
=
∫
q
(
Z
∣
X
,
A
)
log
p
(
A
∣
Z
)
d
Z
=
∫
∏
i
=
1
N
q
(
z
i
∣
X
,
A
)
log
p
(
A
∣
Z
)
d
Z
=
∫
∏
i
=
1
N
q
(
z
i
∣
X
,
A
)
∏
i
N
∏
j
N
log
p
(
A
i
j
∣
z
i
,
z
j
)
d
Z
=
∫
∏
i
=
1
N
N
(
z
i
∣
μ
i
,
d
i
a
g
(
σ
i
2
)
)
∏
i
=
1
N
∏
j
=
1
N
log
σ
(
z
i
T
z
j
)
d
Z
=
∫
∏
i
=
1
N
N
(
η
i
∣
0
,
1
)
f
(
η
i
;
W
)
d
η
i
=
1
S
∑
s
=
1
S
f
(
η
i
;
W
)
\begin{aligned} \mathbb E_{q(Z|X,A)}[\log p(A|Z)] &= \int q(Z|X,A) \log p(A|Z) dZ \\ &= \int \prod_{i=1}^N q(z_i|X,A) \log p(A|Z) dZ \\ &=\int \prod_{i=1}^N q(z_i|X,A) \prod _i^N \prod _j^N \log p(A_{ij}|z_i,z_j)dZ\\ &=\int \prod_{i=1}^N \mathcal N(z_i| \mu_i, diag(\sigma_i^2) ) \prod _{i=1}^N \prod _{j=1}^N \log \sigma(z_i^\Tau z_j)dZ\\ &=\int \prod _{i=1} ^N \mathcal N(\eta_i|0,1)f(\eta_i ;W)d\eta_i\\ &=\frac{1}{S} \sum_{s=1}^{S} f(\eta_i;W) \end{aligned}
Eq(Z∣X,A)[logp(A∣Z)]=∫q(Z∣X,A)logp(A∣Z)dZ=∫i=1∏Nq(zi∣X,A)logp(A∣Z)dZ=∫i=1∏Nq(zi∣X,A)i∏Nj∏Nlogp(Aij∣zi,zj)dZ=∫i=1∏NN(zi∣μi,diag(σi2))i=1∏Nj=1∏Nlogσ(ziTzj)dZ=∫i=1∏NN(ηi∣0,1)f(ηi;W)dηi=S1s=1∑Sf(ηi;W)
where
η
i
=
(
z
i
−
μ
i
(
W
)
)
/
σ
i
(
W
)
∼
N
(
η
∣
0
,
1
)
\begin{aligned} \eta_i=(z_i-\mu_i(W))/\sigma_i(W)\sim N(\eta|0,1) \end{aligned}
ηi=(zi−μi(W))/σi(W)∼N(η∣0,1)
E
q
(
Z
∣
X
,
A
)
[
log
p
(
y
,
A
∣
Z
)
]
=
E
q
(
Z
∣
X
,
A
)
[
log
[
p
(
y
∣
Z
)
p
(
A
∣
Z
)
]
]
=
E
q
(
Z
∣
X
,
A
)
[
log
p
(
y
∣
Z
)
]
+
E
q
(
Z
∣
X
,
A
)
[
log
p
(
A
∣
Z
)
E
q
(
Z
∣
X
,
A
)
[
log
p
(
A
∣
Z
)
]
≈
1
L
∑
l
=
1
L
(
log
p
θ
x
(
i
)
∣
z
(
i
,
l
)
)
\begin{aligned} \mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log p(\mathbf{y},\mathbf{A}\mid \mathbf{Z})\right] &=\mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log [p(\mathbf{y}\mid \mathbf{Z})p(\mathbf{A} \mid \mathbf{Z})]\right]\\ &=\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf y|\mathbf Z)]+\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf A \mid \mathbf Z)\\ \mathbb E_{q(Z|X,A)}[\log p(A|Z)]& \approx \frac{1}{L} \sum_{l=1}^{L}(\log p_\theta x^{(i)}|z^{(i,l)})\\ \end{aligned}
Eq(Z∣X,A)[logp(y,A∣Z)]Eq(Z∣X,A)[logp(A∣Z)]=Eq(Z∣X,A)[log[p(y∣Z)p(A∣Z)]]=Eq(Z∣X,A)[logp(y∣Z)]+Eq(Z∣X,A)[logp(A∣Z)≈L1l=1∑L(logpθx(i)∣z(i,l))
proof of The reparameterization trick
我们调用了另一种从
q
ϕ
(
z
∣
x
)
q_\phi(z|x)
qϕ(z∣x)生成样本的方法,基本参数化非常简单。让
z
z
z 作为一个连续的随机变量,并且服从
z
∼
q
ϕ
(
z
∣
x
)
z \sim q_\phi(z|x)
z∼qϕ(z∣x) 条件分布。通常可以将随机变量表示为确定性变量
z
=
g
ϕ
(
ϵ
,
x
)
z=g_\phi(\epsilon,x)
z=gϕ(ϵ,x),这里的
ϵ
\epsilon
ϵ 是一个具有独立边际的辅助变量
p
(
ϵ
)
p(\epsilon)
p(ϵ).