VAE详解

1. 背景知识

1.1 ELBo

1.1.1 为什么引入隐变量 z z z ?

因为我们在现实世界看到的物体可能也产生于高层级的表示,这样的表示或许概括了颜色、大小、形状等的抽象属性。

1.1.2 如何推导ELBo (Evidence Lower Bound)?

无条件的生成模型学习的是如何建模真实分布 p ( x ) p\left (x\right ) p(x) ,所以有:

log ⁡ p ( x ) ⏟ e v i d e n c e = log ⁡ p ( x ) ∫ q ϕ ( z ∣ x ) ⏟ approximate posterior d z = ∫ q ϕ ( z ∣ x ) ( log ⁡ p ( x ) ) d z = E q ϕ ( z ∣ x ) [ log ⁡ p ( x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) p ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) p ( z ∣ x ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] + E q ϕ ( z ∣ x ) [ log ⁡ q ϕ ( z ∣ x ) p ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] ⏟ ELBo + D K L ( q ϕ ( z ∣ x ) ⏟ approximate posterior ∥ p ( z ∣ x ) ⏟ true posterior ) ⏟ ≥ 0 ≥ E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] ⏟ ELBo \begin{align} \log{\underbrace{p\left (x\right )}_{\text evidence}} &= \log{p\left (x\right )}\int \underbrace{q_{\phi}\left (z\vert x\right )}_{\text{approximate posterior}}dz \\ &=\int q_{\phi}\left (z\vert x\right )\left (\log{p\left (x\right )}\right )dz \\ &=\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{p\left (x\right )}\right ] \\ &=\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p\left (x,z\right )}{p\left (z\vert x\right )}}\right ] \\ &=\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [ \log{\frac{p\left (x,z\right )q_{\phi}\left (z\vert x\right )}{p\left (z\vert x\right )q_{\phi}\left (z\vert x\right )}}\right ] \\ &=\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\vert x\right )}}\right ] + \mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{q_{\phi}\left (z\vert x\right )}{p\left (z\vert x\right )}}\right ] \\ &=\underbrace{\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p\left (x,z\right )}{q_{\phi}\left (z\vert x\right )}}\right ]}_{\text{ELBo}} + \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\vert x\right )}_{\text{approximate posterior}} \Vert \underbrace{p\left (z\vert x\right )}_{\text{true posterior}}\right )}_{\geq 0} \\ &\geq \underbrace{\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\vert x\right )}}\right ]}_{\text{ELBo}} \end{align} logevidence p(x)=logp(x)approximate posterior qϕ(zx)dz=qϕ(zx)(logp(x))dz=Eqϕ(zx)[logp(x)]=Eqϕ(zx)[logp(zx)p(x,z)]=Eqϕ(zx)[logp(zx)qϕ(zx)p(x,z)qϕ(zx)]=Eqϕ(zx)[logqϕ(zx)p(x,z)]+Eqϕ(zx)[logp(zx)qϕ(zx)]=ELBo Eqϕ(zx)[logqϕ(zx)p(x,z)]+0 DKL approximate posterior qϕ(zx)true posterior p(zx) ELBo Eqϕ(zx)[logqϕ(zx)p(x,z)]

1.1.3 为什么要去最大化ELBo?

原因1:因为我们想要模型学习近似后验 q ϕ ( z ∣ x ) q_{\phi}\left (z\vert x\right ) qϕ(zx) 无限接近真实后验 p ( z ∣ x ) p\left (z\vert x\right ) p(zx),但是无法直接去求公式 (7) 中的 D K L D_{KL} DKL 项:

min ⁡ ϕ D K L ( q ϕ ( z ∣ x ) ⏟ approximate posterior ⏟ Encoder is learnable ∥ p ( z ∣ x ) ⏟ true posterior ⏟ unknow ) ⏟ untractable \begin{align} \min_{\phi}{\underbrace{D_{KL}\left (\underbrace{\underbrace{q_{\phi}\left (z\vert x\right )}_{\text{approximate posterior}} }_{\text{Encoder is learnable}} \Vert \underbrace{\underbrace{p\left (z\vert x\right )}_{\text{true posterior}}}_{\text{unknow}}\right )}_{\text{untractable}}} \end{align} ϕminuntractable DKL Encoder is learnable approximate posterior qϕ(zx)unknow true posterior p(zx)

原因2:对于任意的样本 x i ∼ p ( x ) x_i \sim p\left (x\right ) xip(x) p ( x i ) p\left (x_i\right ) p(xi)是个常数,那么通过 max ⁡ ϕ ELBo \max_{\phi}{\text{ELBo}} maxϕELBo等价于 min ⁡ ϕ D K L \min_{\phi}{D_{KL}} minϕDKL
∵ log ⁡ p ( x i ) ⏟ c o n s t a n t = E q ϕ ( z ∣ x i ) [ log ⁡ p ( x i , z ) q ϕ ( z ∣ x i ) ] ⏟ ELBo + D K L ( q ϕ ( z ∣ x i ) ⏟ approximate posterior ∥ p ( z ∣ x i ) ⏟ true posterior ) ⏟ ≥ 0 min ⁡ ϕ D K L    ⟺    max ⁡ ϕ ELBo \begin{align} \because\log{\underbrace{p\left (x_i\right )}_{\text constant}} &= \underbrace{\mathbb{E}_{q_{\phi}\left (z\vert x_i\right )}\left [\log{\frac{p\left (x_{i},z\right )}{q_{\phi}\left (z\vert x_i\right )}}\right ]}_{\text{ELBo}} + \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\vert x_i\right )}_{\text{approximate posterior}} \Vert \underbrace{p\left (z\vert x_i\right )}_{\text{true posterior}}\right )}_{\geq 0} \\ \min_{\phi}{D_{KL}} &\iff \max_{\phi}{\text{ELBo}} \end{align} logconstant p(xi)ϕminDKL=ELBo Eqϕ(zxi)[logqϕ(zxi)p(xi,z)]+0 DKL approximate posterior qϕ(zxi)true posterior p(zxi) ϕmaxELBo

2. VAE (Variational Autoencoder)

2.1 为什么Variational?

因为我们优化的 q ϕ ( z ∣ x ) q_{\phi}\left (z\vert x\right ) qϕ(zx) 服从某一分布族,该分布族被 ϕ \mathbf{\phi} ϕ 参数化,这就是 Variational 的来源。

2.2 为什么Autoencoder?

因为模型会像AE (Autoencoder) 模型一样压缩数据维度,提取数据中的有效信息。

2.3 VAE的优化目标是什么?

max ⁡ ϕ E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] ⏟ ELBo = max ⁡ ϕ , θ E q ϕ ( z ∣ x ) [ log ⁡ p θ ( x ∣ z ) p ( z ) q ϕ ( z ∣ x ) ] = max ⁡ ϕ , θ E q ϕ ( z ∣ x ) [ log ⁡ p θ ( x ∣ z ) ] + E q ϕ ( z ∣ x ) [ p ( z ) q ϕ ( z ∣ x ) ] = max ⁡ ϕ , θ E q ϕ ( z ∣ x ) [ log ⁡ p θ ( x ∣ z ) ⏟ Decoder ] ⏟ resconstraction term − D K L ( q ϕ ( z ∣ x ) ⏟ Encoder ∥ p ( z ) ⏟ prior ) ⏟ prior matching term ≈ Monte Carlo Estimate max ⁡ ϕ , θ ∑ l = 1 L log ⁡ p θ ( x ∣ z l ) − D K L ( q ϕ ( z ∣ x ) ⏟ ∼ N ( μ , σ 2 ) ∥ p ( z ) ⏟ ∼ N ( 0 , 1 ) ) = max ⁡ ϕ , θ ∑ l = 1 L log ⁡ p θ ( x ∣ z l ) − 1 2 ( − log ⁡ σ 2 + μ 2 + σ 2 − 1 ) \begin{align} &\max_{\phi}\underbrace{\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\vert x\right )}}\right ]}_{\text{ELBo}} \\ &= \max_{\phi,\theta}\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\frac{p_{\theta}\left (x\vert z\right )p\left (z\right )}{q_{\phi}\left (z\vert x\right )}}\right ] \\ &=\max_{\phi,\theta}\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{p_{\theta}\left (x\vert z\right )}\right ] + \mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\frac{p\left (z\right )}{q_{\phi}\left (z\vert x\right )}\right ]\\ &=\max_{\phi,\theta}\underbrace{\mathbb{E}_{q_{\phi}\left (z\vert x\right )}\left [\log{\underbrace{p_{\theta}\left (x\vert z\right )}_{\text{Decoder}}}\right ]}_{\text{resconstraction term}} - \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\vert x\right )}_{\text{Encoder}} \Vert \underbrace{p\left (z\right )}_{\text{prior}}\right )}_{\text{prior matching term}} \\ &\overset{\text{Monte Carlo Estimate}}{\approx}\max_{\phi,\theta}\sum_{l=1}^{L}\log{p_{\theta}\left (x\vert z^{l}\right )} - D_{KL}\left (\underbrace{q_{\phi}\left (z\vert x\right )}_{\sim N\left (\mu,\sigma^2\right )}\Vert \underbrace{p\left (z\right )}_{\sim N\left (0,1\right )}\right )\\ &=\max_{\phi,\theta}\sum_{l=1}^{L}\log{p_{\theta}\left (x\vert z^{l}\right )} - \frac{1}{2}\left (-\log{\sigma^2} + \mu^2 + \sigma^2 - 1\right ) \end{align} ϕmaxELBo Eqϕ(zx)[logqϕ(zx)p(x,z)]=ϕ,θmaxEqϕ(zx)[logqϕ(zx)pθ(xz)p(z)]=ϕ,θmaxEqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[qϕ(zx)p(z)]=ϕ,θmaxresconstraction term Eqϕ(zx) logDecoder pθ(xz) prior matching term DKL Encoder qϕ(zx)prior p(z) Monte Carlo Estimateϕ,θmaxl=1Llogpθ(xzl)DKL N(μ,σ2) qϕ(zx)N(0,1) p(z) =ϕ,θmaxl=1Llogpθ(xzl)21(logσ2+μ2+σ21)

由等式 (15) (16) 可知,优化目标主要包括了两项:重构项 (reconstruction term) 迫使模型的解码器 (Decoder) 学习由隐变量 z z z 恢复原始样本的能力;先验匹配项 (prior matching term) 迫使模型的编码器 (Encoder) 学习将原始样本转换到先验分布 (标准正态分布) 的能力。

2.4 VAE模型架构

VAE模型架构图

3. VAE的训练

训练过程将批量的图片送入模型中,每张图片由 Encoder 产生 μ \mu μ σ \sigma σ,进而生成服从 N ( μ , σ 2 ) N\left (\mu, \sigma^2\right ) N(μ,σ2) 的隐变量 z z z,最后经过 Decoder 生成图片,整体流程如下:

x ⏟ x ∼ p ( x ) → Encoder ⏟ q ϕ ( z ∣ x ) → μ , σ → z ∼ N ( μ , σ 2 ) ⏟ z = μ + σ ⊙ ϵ , with  ϵ ∼ N ( 0 , I ) ⏟ reparameterization trick → Decoder ⏟ p θ ( x ∣ z ) → x ^ \underbrace{x}_{x \sim p\left (x\right )} \rarr\underbrace{\text{Encoder}}_{q_{\phi}\left (z\vert x\right )}\rarr \mu,\sigma \rarr \underbrace{\underbrace{z\sim \mathbf{N\left (\mu,\sigma^2\right )}}_{z=\mu + \sigma \odot \epsilon, \text{with } \epsilon \sim N\left (0,I\right )}}_{\text{reparameterization trick}} \rarr \underbrace{\text{Decoder}}_{p_{\theta}\left (x\vert z\right )}\rarr\hat{x} xp(x) xqϕ(zx) Encoderμ,σreparameterization trick z=μ+σϵ,with ϵN(0,I) zN(μ,σ2)pθ(xz) Decoderx^

其中,训练过程中会采用重参数化技巧 (reparameterization trick) 使得整个过程可导,因为这样 μ \mu μ σ \sigma σ 变成可导的参数,变化的 ϵ \epsilon ϵ 被看作不用求导的常数,不被算在梯度图中。

4. VAE的推理

推理只需要从标准正态分布中采样隐变量 z z z 即可以生成新的样本,因为 VAE 目标函数中的先验匹配项迫使 z z z 逐渐逼近标准正态分布,整体流程如下:

z ⏟ z ∼ N ( 0 , I ) → Decoder ⏟ p θ ( x ∣ z ) → x ^ ⏟ new sample \underbrace{z}_{\mathbf{z \sim N\left (0,I\right )}} \rarr \underbrace{\text{Decoder}}_{p_{\theta}\left (x \vert z\right )} \rarr \underbrace{\hat{x}}_{\text{new sample}} zN(0,I) zpθ(xz) Decodernew sample x^

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值