变分推断笔记

更多的内容请参考苏剑林老师的科学空间

变分推断

x x x为显变量, z z z为隐变量, p ~ ( x ) \tilde{p}(x) p~(x) x x x的证据分布,有
q ( x ) = q θ ( x ) = ∫ q θ ( x , z ) d z q(x)=q_{\theta}(x)=\int{q_{\theta}(x,z)\rm{d}z} q(x)=qθ(x)=qθ(x,z)dz
通常,我们希望用 q θ ( x ) q_{\theta}(x) qθ(x)近似 p ~ ( x ) \tilde{p}(x) p~(x),即最小化KL散度(等同于最大化似然函数,最小化交叉熵)
K L ( p ~ ( x ) ∣ ∣ q ( x ) ) = ∫ p ~ ( x ) log ⁡ p ~ ( x ) q ( x ) d x KL(\tilde{p}(x)||q(x))=\int{\tilde{p}(x)\log{\frac{\tilde{p}(x)}{q(x)}}\rm{d}x} KL(p~(x)q(x))=p~(x)logq(x)p~(x)dx
此时引入联合分布 p ( x , z ) p(x,z) p(x,z),由联合分布和边缘分布的关系可知, p ~ ( x ) = ∫ p ( x , z ) d z \tilde{p}(x)=\int{p(x,z)\rm{d}z} p~(x)=p(x,z)dz,变分推断的本质就是将边缘分布的KL散度 K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(\tilde{p}(x)||q(x)) KL(p~(x)q(x))改为边缘分布 K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) KL(p(x,z)||q(x,z)) KL(p(x,z)q(x,z)),从而有
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ( x , z ) log ⁡ p ( x , z ) q ( x , z ) d x d z KL(p(x,z)||q(x,z)) = \iint{p(x,z)\log{\frac{p(x,z)}{q(x,z)}}\mathrm{d}x\mathrm{d}z} KL(p(x,z)q(x,z))=p(x,z)logq(x,z)p(x,z)dxdz
由贝叶斯公式可以知道, p ( x , z ) = p ( z ∣ x ) p ~ ( x ) p(x,z)=p(z|x)\tilde{p}(x) p(x,z)=p(zx)p~(x) q ( x , z ) = q ( z ∣ x ) q ( x ) q(x,z)=q(z|x)q(x) q(x,z)=q(zx)q(x),则
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ( z ∣ x ) p ~ ( x ) log ⁡ p ( z ∣ x ) p ~ ( x ) q ( z ∣ x ) q ( x ) d x d z KL(p(x,z)||q(x,z)) =\iint{p(z|x)\tilde{p}(x) \log{\frac{p(z|x)\tilde{p}(x)}{q(z|x)q(x)}} \mathrm{d}x\mathrm{d}z} KL(p(x,z)q(x,z))=p(zx)p~(x)logq(zx)q(x)p(zx)p~(x)dxdz
log ⁡ \log log拆分可以得到
∬ p ( z ∣ x ) p ~ ( x ) log ⁡ p ~ ( x ) q ( x ) d x d z + ∬ p ( z ∣ x ) p ~ ( x ) log ⁡ p ( z ∣ x ) q ( z ∣ x ) d x d z \iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} + \iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z} p(zx)p~(x)logq(x)p~(x)dxdz+p(zx)p~(x)logq(zx)p(zx)dxdz
其中,
∬ p ( z ∣ x ) p ~ ( x ) log ⁡ p ~ ( x ) q ( x ) d x d z = ∫ p ~ ( x ) log ⁡ p ~ ( x ) q ( x ) d x ∫ p ( z ∣ x ) d z = K L ( p ~ ( x ) ∣ ∣ q ( x ) ) \iint{p(z|x)\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x\mathrm{d}z} = \int{\tilde{p}(x) \log{\frac{\tilde{p}(x)}{q(x)}} \mathrm{d}x} \int{ p(z|x)\mathrm{dz}}=KL(\tilde{p}(x)||q(x)) p(zx)p~(x)logq(x)p~(x)dxdz=p~(x)logq(x)p~(x)dxp(zx)dz=KL(p~(x)q(x))
∬ p ( z ∣ x ) p ~ ( x ) log ⁡ p ( z ∣ x ) q ( z ∣ x ) d x d z = ∫ p ~ ( x ) ∫ p ( z ∣ x ) log ⁡ p ( z ∣ x ) q ( z ∣ x ) d z d x = ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ∣ x ) ) d x \iint{ p(z|x)\tilde{p}(x) \log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}x\mathrm{d}z}=\int{\tilde{p}(x) \int{ p(z|x)\log{\frac{p(z|x)}{q(z|x)}} \mathrm{d}z} \mathrm{d}x}=\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x} p(zx)p~(x)logq(zx)p(zx)dxdz=p~(x)p(zx)logq(zx)p(zx)dzdx=p~(x)KL(p(zx)q(zx))dx
因此
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = K L ( p ~ ( x ) ∣ ∣ q ( x ) ) + ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ∣ x ) ) d x ≥ K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(p(x,z)||q(x,z))=KL(\tilde{p}(x)||q(x))+\int{\tilde{p}(x)KL(p(z|x)||q(z|x)) \mathrm{d}x} \ge KL(\tilde{p}(x)||q(x)) KL(p(x,z)q(x,z))=KL(p~(x)q(x))+p~(x)KL(p(zx)q(zx))dxKL(p~(x)q(x))
意味着联合分布的KL是一个更强的上界。通常情况下, K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) KL(p(x,z)||q(x,z)) KL(p(x,z)q(x,z))要比 K L ( p ~ ( x ) ∣ ∣ q ( x ) ) KL(\tilde{p}(x)||q(x)) KL(p~(x)q(x))容易计算,所以变分推断提供了一个可计算的方案。

VAE

q ( x , z ) = q ( x ∣ z ) q ( z ) q(x,z)=q(x|z)q(z) q(x,z)=q(xz)q(z) p ( x , z ) = p ~ ( x ) p ( z ∣ x ) p(x,z)=\tilde{p}(x)p(z|x) p(x,z)=p~(x)p(zx),带入联合分布KL散度,
K L ( p ( x , z ) ∣ ∣ q ( x , z ) ) = ∬ p ~ ( x ) p ( z ∣ x ) log ⁡ p ~ ( x ) p ( z ∣ x ) q ( x ∣ z ) q ( z ) d x d z KL(p(x,z)||q(x,z))=\iint{\tilde{p}(x)p(z|x) \log{\frac{\tilde{p}(x)p(z|x)}{q(x|z)q(z)}} \mathrm{d}x \mathrm{d}z} KL(p(x,z)q(x,z))=p~(x)p(zx)logq(xz)q(z)p~(x)p(zx)dxdz
log ⁡ \log log拆开可以得到
∬ p ~ ( x ) p ( z ∣ x ) log ⁡ p ~ ( x ) d x d z − ∬ p ~ ( x ) p ( z ∣ x ) log ⁡ q ( x ∣ z ) d x d z + ∫ p ~ ( x ) K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) d x \iint{\tilde{p}(x)p(z|x)\log{\tilde{p}(x)} \mathrm{d}x \mathrm{d}z}-\iint{\tilde{p}(x)p(z|x)\log{q(x|z)} \mathrm{d}x \mathrm{d}z} + \int{\tilde{p}(x) KL(p(z|x)||q(z)) \mathrm{d}x} p~(x)p(zx)logp~(x)dxdzp~(x)p(zx)logq(xz)dxdz+p~(x)KL(p(zx)q(z))dx
数字计算vs采样计算可以知道
E x ∼ p ( x ) [ f ( x ) ] = ∫ f ( x ) p ( x ) d x ≈ 1 n ∑ i = 1 n f ( x i ) \mathbb{E}_{x \sim p(x)}[f(x)]=\int{f(x)p(x)\rm{d}x} \approx \frac{1}{n}\sum_{i=1}^n{f(x_i)} Exp(x)[f(x)]=f(x)p(x)dxn1i=1nf(xi)
log ⁡ p ~ ( x ) \log{\tilde{p}(x)} logp~(x)不包含优化目标,可以视为常数,可以得到
E x ∼ p ~ ( x ) [ − ∫ p ( z ∣ x ) log ⁡ q ( x ∣ z ) d z + K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) ] = E x ∼ p ~ ( x ) [ E z ∼ p ( z ∣ x ) [ − log ⁡ q ( x ∣ z ) ] + K L ( p ( z ∣ x ) ∣ ∣ q ( z ) ) ] \mathbb{E}_{x \sim \tilde{p}(x)}[-\int{p(z|x)\log{q(x|z)} \mathrm{d}z} + KL(p(z|x)||q(z))]=\mathbb{E}_{x \sim \tilde{p}(x)}[\mathbb{E}_{z \sim p(z|x)}[-\log{q(x|z)}]+KL(p(z|x)||q(z))] Exp~(x)[p(zx)logq(xz)dz+KL(p(zx)q(z))]=Exp~(x)[Ezp(zx)[logq(xz)]+KL(p(zx)q(z))]

编码器

论文[1]中提到了The evidence lower bound,即VAE的优化目标。对于隐变量 z z z,用 q ( z ∣ x ) q(z|x) q(zx)近似 p ( z ) p(z) p(z),则有
K L ( p ( z ) ∣ ∣ q ( z ∣ x ) ) = ∫ p ( z ) log ⁡ p ( z ) q ( z ∣ x ) d z = E z ∼ p ( z ) [ log ⁡ p ( z ) ] − E z ∼ p ( z ) [ log ⁡ q ( z ∣ x ) ] KL(p(z)||q(z|x))=\int{p(z) \log{\frac{p(z)}{q(z|x)}} \mathrm{d}z}=\mathbb{E}_{z\sim p(z)}[\log{p(z)}] - \mathbb{E}_{z \sim p(z)}[\log{q(z|x)}] KL(p(z)q(zx))=p(z)logq(zx)p(z)dz=Ezp(z)[logp(z)]Ezp(z)[logq(zx)]
利用贝叶斯公式,可以得到
E [ log ⁡ p ( z ) ] − E [ log ⁡ q ( z , x ) ] + log ⁡ q ( x ) \mathbb{E}[\log{p(z)}]-\mathbb{E}[\log{q(z,x)}]+\log{q(x)} E[logp(z)]E[logq(z,x)]+logq(x)

E L B O ( p ) = E [ log ⁡ q ( z , x ) ] − E [ log ⁡ p ( z ) ] = E [ log ⁡ q ( x ∣ z ) ] − K L ( p ( z ) ∣ ∣ q ( z ) ) \mathrm{ELBO}(p)=\mathbb{E}[\log{q(z,x)}]-\mathbb{E}[\log{p(z)}]=\mathbb{E}[\log{q(x|z)}]-KL(p(z)||q(z)) ELBO(p)=E[logq(z,x)]E[logp(z)]=E[logq(xz)]KL(p(z)q(z))
和前面得到的优化目标是一致的, K L ( p ( z ) ∣ ∣ q ( z ∣ x ) ) KL(p(z)||q(z|x)) KL(p(z)q(zx)) K L ( p ( z ) ∣ ∣ q ( z ) ) KL(p(z)||q(z)) KL(p(z)q(z))目的是一致的!

重参数化

上式中 E z ∼ p ( z ∣ x ) \mathbb{E}_{z \sim p(z|x)} Ezp(zx)需要对隐变量 z z z进行采样,但是“采样”不可导,因此,在 N ( 0 , 1 ) \mathcal{N}(0,1) N(0,1)中采样得到 ξ \xi ξ,令 z = μ + σ × ξ z=\mu+\sigma \times \xi z=μ+σ×ξ,当每次只采样1个时,VAE的优化目标就变成了
E x ∼ p ~ ( x ) [ − log ⁡ q ( x ∣ μ , σ ) + K L ( p ( μ , σ ∣ x ) ∣ ∣ q ( μ , σ ) ) ] \mathbb{E}_{x \sim \tilde{p}(x)}[-\log{q(x|\mu,\sigma)}+KL(p(\mu,\sigma|x)||q(\mu,\sigma))] Exp~(x)[logq(xμ,σ)+KL(p(μ,σx)q(μ,σ))]

GAN

GAN约定 q ( z ) ∼ N ( z ; 0 , I ) q(z) \sim N(z;0,I) q(z)N(z;0,I),令 q ( x ∣ z ) = δ ( x − G ( z ) ) q(x|z)=\delta(x-G(z)) q(xz)=δ(xG(z)) δ ( x ) \delta(x) δ(x)是狄拉克 δ \delta δ函数, G ( z ) G(z) G(z)为生成器。GAN中引入了一个二元变量 y y y来构成联合分布

q ( x , y ) = { p ~ ( x ) p 1 , y = 1 q ( x ) p 0 , y = 0 q(x,y) = \begin{cases} \tilde{p}(x)p_1, & y=1 \\ q(x)p_0, & y = 0 \end{cases} q(x,y)={p~(x)p1,q(x)p0,y=1y=0
p ( x , y ) = p ( y ∣ x ) p ~ ( x ) p(x,y)=p(y|x)\tilde{p}(x) p(x,y)=p(yx)p~(x),则
K L ( q ( x , y ) ∣ ∣ p ( x , y ) ) = ∫ p ~ ( x ) p 1 log ⁡ p ~ ( x ) p 1 p ( 1 ∣ x ) p ~ ( x ) d x + ∫ q ( x ) p 0 log ⁡ q ( x ) p 0 p ( 0 ∣ x ) p ~ ( x ) d x KL(q(x,y)||p(x,y))=\int{\tilde{p}(x)p_1\log{\frac{\tilde{p}(x)p_1}{p(1|x)\tilde{p}(x)}} \mathrm{d}x} + \int{q(x)p_0 \log{\frac{q(x)p_0}{p(0|x)\tilde{p}(x)}} \mathrm{d}x} KL(q(x,y)p(x,y))=p~(x)p1logp(1x)p~(x)p~(x)p1dx+q(x)p0logp(0x)p~(x)q(x)p0dx
D ( x ) = p ( 1 ∣ x ) D(x)=p(1|x) D(x)=p(1x)为判别器,采用交替优化,先固定 G ( z ) G(z) G(z),即 q ( x ) q(x) q(x)为常量,此时(我把负号去了)
D = arg max ⁡ D ∫ p ~ ( x ) log ⁡ D ( x ) d x + ∫ q ( x ) log ⁡ ( 1 − D ( x ) ) d x = arg max ⁡ D E x ∼ p ~ ( x ) [ log ⁡ D ( x ) ] + E x ∼ q ( x ) [ log ⁡ ( 1 − D ( x ) ) ] D=\argmax_D{\int{\tilde{p}(x) \log{D(x)} \mathrm{d}x}+\int{q(x) \log{(1-D(x))} \mathrm{d}x}}=\argmax_D{\mathbb{E}_{x \sim \tilde{p}(x)}[\log{D(x)}] + \mathbb{E}_{x \sim q(x)}[\log{(1-D(x))}]} D=Dargmaxp~(x)logD(x)dx+q(x)log(1D(x))dx=DargmaxExp~(x)[logD(x)]+Exq(x)[log(1D(x))]
此时固定 D ( x ) D(x) D(x),则
G = arg min ⁡ G ∫ q ( x ) log ⁡ q ( x ) ( 1 − D ( x ) ) p ~ ( x ) d x G=\argmin_G{\int{q(x)\log{\frac{q(x)}{(1-D(x))\tilde{p}(x)}} \mathrm{d}x}} G=Gargminq(x)log(1D(x))p~(x)q(x)dx
D ( x ) D(x) D(x)最优解
D ( x ) = p ~ ( x ) p ~ ( x ) + q o ( x ) D(x)=\frac{\tilde{p}(x)}{\tilde{p}(x)+q^o(x)} D(x)=p~(x)+qo(x)p~(x)
可以得到
p ~ ( x ) = D ( x ) q o ( x ) ( 1 − D ( x ) ) \tilde{p}(x) = \frac{D(x)q^o(x)}{(1-D(x))} p~(x)=(1D(x))D(x)qo(x)
此时
∫ q ( x ) log ⁡ q ( x ) D ( x ) q o ( x ) d x = − E z ∼ q ( z ) [ log ⁡ D ( G ( z ) ) ] + K L ( q ( x ) ∣ ∣ q o ( x ) ) \int{q(x) \log{\frac{q(x)}{D(x)q^o(x)}} \mathrm{d}x}=-\mathbb{E}_{z \sim q(z)}[\log{D(G(z))}]+KL(q(x)||q^o(x)) q(x)logD(x)qo(x)q(x)dx=Ezq(z)[logD(G(z))]+KL(q(x)qo(x))

其他

内容主要来自苏剑林老师的文章,其中主要是在学习的过程中将更加详细的过程记录了一下(其实参考苏老师的其他文章是可以找到更加详细的过程的)。了解变分推断的理论之后,能够进一步了解VAE的理论推导过程,因此才能够考虑将其应用在其他具体的领域和问题上,进行改进和优化

暂时没有梳理出论文[1]中Bayesian mixture of Gaussians的相关内容。

参考文献

[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational inference: A review for statisticians.” Journal of the American statistical Association 112.518 (2017): 859-877.
[2] Su, Jianlin. “Variational inference: A unified framework of generative models and some revelations.” arXiv preprint arXiv:1807.05936 (2018).

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值