更多的内容请参考苏剑林老师的科学空间
变分推断
令
x
x
x为显变量,
z
z
z为隐变量,
p
~
(
x
)
\tilde{p}(x)
p~(x)为
x
x
x的证据分布,有
q
(
x
)
=
q
θ
(
x
)
=
∫
q
θ
(
x
,
z
)
d
z
q(x)=q_{\theta}(x)=\int{q_{\theta}(x,z)\rm{d}z}
q(x)=qθ(x)=∫qθ(x,z)dz
通常,我们希望用
q
θ
(
x
)
q_{\theta}(x)
qθ(x)近似
p
~
(
x
)
\tilde{p}(x)
p~(x),即最小化KL散度(等同于最大化似然函数,最小化交叉熵)
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
=
∫
p
~
(
x
)
log
p
~
(
x
)
q
(
x
)
d
x
KL(\tilde{p}(x)||q(x))=\int{\tilde{p}(x)\log{\frac{\tilde{p}(x)}{q(x)}}\rm{d}x}
KL(p~(x)∣∣q(x))=∫p~(x)logq(x)p~(x)dx
此时引入联合分布
p
(
x
,
z
)
p(x,z)
p(x,z),由联合分布和边缘分布的关系可知,
p
~
(
x
)
=
∫
p
(
x
,
z
)
d
z
\tilde{p}(x)=\int{p(x,z)\rm{d}z}
p~(x)=∫p(x,z)dz,变分推断的本质就是将边缘分布的KL散度
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
KL(\tilde{p}(x)||q(x))
KL(p~(x)∣∣q(x))改为边缘分布
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
KL(p(x,z)||q(x,z))
KL(p(x,z)∣∣q(x,z)),从而有
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
=
∬
p
(
x
,
z
)
log
p
(
x
,
z
)
q
(
x
,
z
)
d
x
d
z
KL(p(x,z)||q(x,z)) = \iint{p(x,z)\log{\frac{p(x,z)}{q(x,z)}}\mathrm{d}x\mathrm{d}z}
KL(p(x,z)∣∣q(x,z))=∬p(x,z)logq(x,z)p(x,z)dxdz
由贝叶斯公式可以知道,
p
(
x
,
z
)
=
p
(
z
∣
x
)
p
~
(
x
)
p(x,z)=p(z|x)\tilde{p}(x)
p(x,z)=p(z∣x)p~(x)和
q
(
x
,
z
)
=
q
(
z
∣
x
)
q
(
x
)
q(x,z)=q(z|x)q(x)
q(x,z)=q(z∣x)q(x),则
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
=
∬
p
(
z
∣
x
)
p
~
(
x
)
log
p
(
z
∣
x
)
p
~
(
x
)
q
(
z
∣
x
)
q
(
x
)
d
x
d
z
KL(p(x,z)||q(x,z)) =\iint{p(z|x)\tilde{p}(x) \log{\frac{p(z|x)\tilde{p}(x)}{q(z|x)q(x)}} \mathrm{d}x\mathrm{d}z}
KL(p(x,z)∣∣q(x,z))=∬p(z∣x)p~(x)logq(z∣x)q(x)p(z∣x)p~(x)dxdz
将
log
\log
log拆分可以得到
∬
p
(
z
∣
x
)
p
~
(
x
)
log
p
~
(
x
)
q
(
x
)
d
x
d
z
+
∬
p
(
z
∣
x
)
p
~
(
x
)
log
p
(
z
∣
x
)
q
(
z
∣
x
)
d
x
d
z
\iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} + \iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z}
∬p(z∣x)p~(x)logq(x)p~(x)dxdz+∬p(z∣x)p~(x)logq(z∣x)p(z∣x)dxdz
其中,
∬
p
(
z
∣
x
)
p
~
(
x
)
log
p
~
(
x
)
q
(
x
)
d
x
d
z
=
∫
p
~
(
x
)
log
p
~
(
x
)
q
(
x
)
d
x
∫
p
(
z
∣
x
)
d
z
=
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
\iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} = \int{\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x} \int{ p(z|x)\mathrm{dz}}=KL(\tilde{p}(x)||q(x))
∬p(z∣x)p~(x)logq(x)p~(x)dxdz=∫p~(x)logq(x)p~(x)dx∫p(z∣x)dz=KL(p~(x)∣∣q(x))
∬
p
(
z
∣
x
)
p
~
(
x
)
log
p
(
z
∣
x
)
q
(
z
∣
x
)
d
x
d
z
=
∫
p
~
(
x
)
∫
p
(
z
∣
x
)
log
p
(
z
∣
x
)
q
(
z
∣
x
)
d
z
d
x
=
∫
p
~
(
x
)
K
L
(
p
(
z
∣
x
)
∣
∣
q
(
z
∣
x
)
)
d
x
\iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z}=\int{\tilde{p}(x) \int{ p(z|x)\log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}z} \mathrm{d}x}=\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x}
∬p(z∣x)p~(x)logq(z∣x)p(z∣x)dxdz=∫p~(x)∫p(z∣x)logq(z∣x)p(z∣x)dzdx=∫p~(x)KL(p(z∣x)∣∣q(z∣x))dx
因此
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
=
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
+
∫
p
~
(
x
)
K
L
(
p
(
z
∣
x
)
∣
∣
q
(
z
∣
x
)
)
d
x
≥
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
KL(p(x,z)||q(x,z))=KL(\tilde{p}(x)||q(x))+\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x} \ge KL(\tilde{p}(x)||q(x))
KL(p(x,z)∣∣q(x,z))=KL(p~(x)∣∣q(x))+∫p~(x)KL(p(z∣x)∣∣q(z∣x))dx≥KL(p~(x)∣∣q(x))
意味着联合分布的KL是一个更强的上界。通常情况下,
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
KL(p(x,z)||q(x,z))
KL(p(x,z)∣∣q(x,z))要比
K
L
(
p
~
(
x
)
∣
∣
q
(
x
)
)
KL(\tilde{p}(x)||q(x))
KL(p~(x)∣∣q(x))容易计算,所以变分推断提供了一个可计算的方案。
VAE
令
q
(
x
,
z
)
=
q
(
x
∣
z
)
q
(
z
)
q(x,z)=q(x|z)q(z)
q(x,z)=q(x∣z)q(z),
p
(
x
,
z
)
=
p
~
(
x
)
p
(
z
∣
x
)
p(x,z)=\tilde{p}(x)p(z|x)
p(x,z)=p~(x)p(z∣x),带入联合分布KL散度,
K
L
(
p
(
x
,
z
)
∣
∣
q
(
x
,
z
)
)
=
∬
p
~
(
x
)
p
(
z
∣
x
)
log
p
~
(
x
)
p
(
z
∣
x
)
q
(
x
∣
z
)
q
(
z
)
d
x
d
z
KL(p(x,z)||q(x,z))=\iint{\tilde{p}(x)p(z|x) \log{\frac{\tilde{p}(x)p(z|x)}{q(x|z)q(z)}} \mathrm{d}x \mathrm{d}z}
KL(p(x,z)∣∣q(x,z))=∬p~(x)p(z∣x)logq(x∣z)q(z)p~(x)p(z∣x)dxdz
将
log
\log
log拆开可以得到
∬
p
~
(
x
)
p
(
z
∣
x
)
log
p
~
(
x
)
d
x
d
z
−
∬
p
~
(
x
)
p
(
z
∣
x
)
log
q
(
x
∣
z
)
d
x
d
z
+
∫
p
~
(
x
)
K
L
(
p
(
z
∣
x
)
∣
∣
q
(
z
)
)
d
x
\iint{\tilde{p}(x)p(z|x)\log{\tilde{p}(x)} \mathrm{d}x \mathrm{d}z}-\iint{\tilde{p}(x)p(z|x)\log{q(x|z)} \mathrm{d}x \mathrm{d}z} + \int{\tilde{p}(x) KL(p(z|x)||q(z)) \mathrm{d}x}
∬p~(x)p(z∣x)logp~(x)dxdz−∬p~(x)p(z∣x)logq(x∣z)dxdz+∫p~(x)KL(p(z∣x)∣∣q(z))dx
由数字计算vs采样计算可以知道
E
x
∼
p
(
x
)
[
f
(
x
)
]
=
∫
f
(
x
)
p
(
x
)
d
x
≈
1
n
∑
i
=
1
n
f
(
x
i
)
\mathbb{E}_{x \sim p(x)}[f(x)]=\int{f(x)p(x)\rm{d}x} \approx \frac{1}{n}\sum_{i=1}^n{f(x_i)}
Ex∼p(x)[f(x)]=∫f(x)p(x)dx≈n1i=1∑nf(xi)
且
log
p
~
(
x
)
\log{\tilde{p}(x)}
logp~(x)不包含优化目标,可以视为常数,可以得到
E
x
∼
p
~
(
x
)
[
−
∫
p
(
z
∣
x
)
log
q
(
x
∣
z
)
d
z
+
K
L
(
p
(
z
∣
x
)
∣
∣
q
(
z
)
)
]
=
E
x
∼
p
~
(
x
)
[
E
z
∼
p
(
z
∣
x
)
[
−
log
q
(
x
∣
z
)
]
+
K
L
(
p
(
z
∣
x
)
∣
∣
q
(
z
)
)
]
\mathbb{E}_{x \sim \tilde{p}(x)}[-\int{p(z|x)\log{q(x|z)} \mathrm{d}z} + KL(p(z|x)||q(z))]=\mathbb{E}_{x \sim \tilde{p}(x)}[\mathbb{E}_{z \sim p(z|x)}[-\log{q(x|z)}]+KL(p(z|x)||q(z))]
Ex∼p~(x)[−∫p(z∣x)logq(x∣z)dz+KL(p(z∣x)∣∣q(z))]=Ex∼p~(x)[Ez∼p(z∣x)[−logq(x∣z)]+KL(p(z∣x)∣∣q(z))]
编码器
论文[1]中提到了The evidence lower bound,即VAE的优化目标。对于隐变量
z
z
z,用
q
(
z
∣
x
)
q(z|x)
q(z∣x)近似
p
(
z
)
p(z)
p(z),则有
K
L
(
p
(
z
)
∣
∣
q
(
z
∣
x
)
)
=
∫
p
(
z
)
log
p
(
z
)
q
(
z
∣
x
)
d
z
=
E
z
∼
p
(
z
)
[
log
p
(
z
)
]
−
E
z
∼
p
(
z
)
[
log
q
(
z
∣
x
)
]
KL(p(z)||q(z|x))=\int{p(z) \log{\frac{p(z)}{q(z|x)}} \mathrm{d}z}=\mathbb{E}_{z\sim p(z)}[\log{p(z)}] - \mathbb{E}_{z \sim p(z)}[\log{q(z|x)}]
KL(p(z)∣∣q(z∣x))=∫p(z)logq(z∣x)p(z)dz=Ez∼p(z)[logp(z)]−Ez∼p(z)[logq(z∣x)]
利用贝叶斯公式,可以得到
E
[
log
p
(
z
)
]
−
E
[
log
q
(
z
,
x
)
]
+
log
q
(
x
)
\mathbb{E}[\log{p(z)}]-\mathbb{E}[\log{q(z,x)}]+\log{q(x)}
E[logp(z)]−E[logq(z,x)]+logq(x)
令
E
L
B
O
(
p
)
=
E
[
log
q
(
z
,
x
)
]
−
E
[
log
p
(
z
)
]
=
E
[
log
q
(
x
∣
z
)
]
−
K
L
(
p
(
z
)
∣
∣
q
(
z
)
)
\mathrm{ELBO}(p)=\mathbb{E}[\log{q(z,x)}]-\mathbb{E}[\log{p(z)}]=\mathbb{E}[\log{q(x|z)}]-KL(p(z)||q(z))
ELBO(p)=E[logq(z,x)]−E[logp(z)]=E[logq(x∣z)]−KL(p(z)∣∣q(z))
和前面得到的优化目标是一致的,
K
L
(
p
(
z
)
∣
∣
q
(
z
∣
x
)
)
KL(p(z)||q(z|x))
KL(p(z)∣∣q(z∣x))和
K
L
(
p
(
z
)
∣
∣
q
(
z
)
)
KL(p(z)||q(z))
KL(p(z)∣∣q(z))目的是一致的!
重参数化
上式中
E
z
∼
p
(
z
∣
x
)
\mathbb{E}_{z \sim p(z|x)}
Ez∼p(z∣x)需要对隐变量
z
z
z进行采样,但是“采样”不可导,因此,在
N
(
0
,
1
)
\mathcal{N}(0,1)
N(0,1)中采样得到
ξ
\xi
ξ,令
z
=
μ
+
σ
×
ξ
z=\mu+\sigma \times \xi
z=μ+σ×ξ,当每次只采样1个时,VAE的优化目标就变成了
E
x
∼
p
~
(
x
)
[
−
log
q
(
x
∣
μ
,
σ
)
+
K
L
(
p
(
μ
,
σ
∣
x
)
∣
∣
q
(
μ
,
σ
)
)
]
\mathbb{E}_{x \sim \tilde{p}(x)}[-\log{q(x|\mu,\sigma)}+KL(p(\mu,\sigma|x)||q(\mu,\sigma))]
Ex∼p~(x)[−logq(x∣μ,σ)+KL(p(μ,σ∣x)∣∣q(μ,σ))]
GAN
GAN约定 q ( z ) ∼ N ( z ; 0 , I ) q(z) \sim N(z;0,I) q(z)∼N(z;0,I),令 q ( x ∣ z ) = δ ( x − G ( z ) ) q(x|z)=\delta(x-G(z)) q(x∣z)=δ(x−G(z)), δ ( x ) \delta(x) δ(x)是狄拉克 δ \delta δ函数, G ( z ) G(z) G(z)为生成器。GAN中引入了一个二元变量 y y y来构成联合分布
q
(
x
,
y
)
=
{
p
~
(
x
)
p
1
,
y
=
1
q
(
x
)
p
0
,
y
=
0
q(x,y) = \begin{cases} \tilde{p}(x)p_1, & y=1 \\ q(x)p_0, & y = 0 \end{cases}
q(x,y)={p~(x)p1,q(x)p0,y=1y=0
设
p
(
x
,
y
)
=
p
(
y
∣
x
)
p
~
(
x
)
p(x,y)=p(y|x)\tilde{p}(x)
p(x,y)=p(y∣x)p~(x),则
K
L
(
q
(
x
,
y
)
∣
∣
p
(
x
,
y
)
)
=
∫
p
~
(
x
)
p
1
log
p
~
(
x
)
p
1
p
(
1
∣
x
)
p
~
(
x
)
d
x
+
∫
q
(
x
)
p
0
log
q
(
x
)
p
0
p
(
0
∣
x
)
p
~
(
x
)
d
x
KL(q(x,y)||p(x,y))=\int{\tilde{p}(x)p_1\log{\frac{\tilde{p}(x)p_1}{p(1|x)\tilde{p}(x)}} \mathrm{d}x} + \int{q(x)p_0 \log{\frac{q(x)p_0}{p(0|x)\tilde{p}(x)}} \mathrm{d}x}
KL(q(x,y)∣∣p(x,y))=∫p~(x)p1logp(1∣x)p~(x)p~(x)p1dx+∫q(x)p0logp(0∣x)p~(x)q(x)p0dx
令
D
(
x
)
=
p
(
1
∣
x
)
D(x)=p(1|x)
D(x)=p(1∣x)为判别器,采用交替优化,先固定
G
(
z
)
G(z)
G(z),即
q
(
x
)
q(x)
q(x)为常量,此时(我把负号去了)
D
=
arg max
D
∫
p
~
(
x
)
log
D
(
x
)
d
x
+
∫
q
(
x
)
log
(
1
−
D
(
x
)
)
d
x
=
arg max
D
E
x
∼
p
~
(
x
)
[
log
D
(
x
)
]
+
E
x
∼
q
(
x
)
[
log
(
1
−
D
(
x
)
)
]
D=\argmax_D{\int{\tilde{p}(x) \log{D(x)} \mathrm{d}x}+\int{q(x) \log{(1-D(x))} \mathrm{d}x}}=\argmax_D{\mathbb{E}_{x \sim \tilde{p}(x)}[\log{D(x)}] + \mathbb{E}_{x \sim q(x)}[\log{(1-D(x))}]}
D=Dargmax∫p~(x)logD(x)dx+∫q(x)log(1−D(x))dx=DargmaxEx∼p~(x)[logD(x)]+Ex∼q(x)[log(1−D(x))]
此时固定
D
(
x
)
D(x)
D(x),则
G
=
arg min
G
∫
q
(
x
)
log
q
(
x
)
(
1
−
D
(
x
)
)
p
~
(
x
)
d
x
G=\argmin_G{\int{q(x)\log{\frac{q(x)}{(1-D(x))\tilde{p}(x)}} \mathrm{d}x}}
G=Gargmin∫q(x)log(1−D(x))p~(x)q(x)dx
由
D
(
x
)
D(x)
D(x)最优解
D
(
x
)
=
p
~
(
x
)
p
~
(
x
)
+
q
o
(
x
)
D(x)=\frac{\tilde{p}(x)}{\tilde{p}(x)+q^o(x)}
D(x)=p~(x)+qo(x)p~(x)
可以得到
p
~
(
x
)
=
D
(
x
)
q
o
(
x
)
(
1
−
D
(
x
)
)
\tilde{p}(x) = \frac{D(x)q^o(x)}{(1-D(x))}
p~(x)=(1−D(x))D(x)qo(x)
此时
∫
q
(
x
)
log
q
(
x
)
D
(
x
)
q
o
(
x
)
d
x
=
−
E
z
∼
q
(
z
)
[
log
D
(
G
(
z
)
)
]
+
K
L
(
q
(
x
)
∣
∣
q
o
(
x
)
)
\int{q(x) \log{\frac{q(x)}{D(x)q^o(x)}} \mathrm{d}x}=-\mathbb{E}_{z \sim q(z)}[\log{D(G(z))}]+KL(q(x)||q^o(x))
∫q(x)logD(x)qo(x)q(x)dx=−Ez∼q(z)[logD(G(z))]+KL(q(x)∣∣qo(x))
其他
内容主要来自苏剑林老师的文章,其中主要是在学习的过程中将更加详细的过程记录了一下(其实参考苏老师的其他文章是可以找到更加详细的过程的)。了解变分推断的理论之后,能够进一步了解VAE的理论推导过程,因此才能够考虑将其应用在其他具体的领域和问题上,进行改进和优化
暂时没有梳理出论文[1]中Bayesian mixture of Gaussians的相关内容。
参考文献
[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational inference: A review for statisticians.” Journal of the American statistical Association 112.518 (2017): 859-877.
[2] Su, Jianlin. “Variational inference: A unified framework of generative models and some revelations.” arXiv preprint arXiv:1807.05936 (2018).