生成模型通过建模变量的联合分布,学习样本的生成过程。判别模型则是建模变量之间的映射关系。生成模型的优势在于:
- 生成模型是对物理规律深度理解,模型的可解释性高:我们对变量的生产过程提出假设,并通过观测数据的检测,最总接受或则推翻我们的假设,达到对世界运行规律的深度理解。
- 生成模型对变量间关系提出的假设,一旦假设被检验接受,模型揭示了变量间的因果关系:因果关系比相关关系的泛化能力更强,在不满足独立同分布条件下任然可以应用
VAE 模型便是这样的生成模型,VAE模型的优化目标即是所谓的ELBO Loss;这里对ELBO目标的推导如下,同时揭示ELBO与极大似然估计的关系:
l o g p θ ( x ) = E x ∼ q ( z ∣ x ) [ l o g p θ ( x ) ] logp_{\theta}(x) = E_{x\sim q(z|x)}[logp_{\theta}(x)] logpθ(x)=Ex∼q(z∣x)[logpθ(x)]
= E x ∼ q ( z ∣ x ) [ l o g ( p θ ( x , z ) p θ ( z ∣ x ) ) ] =E_{x\sim q(z|x)}[log(\frac{p_{\theta}(x,z)}{p_{\theta}(z|x)})] =Ex∼q(z∣x)[log(pθ(z∣x)pθ(x,z))]
= E x ∼ q ( z ∣ x ) [ l o g ( p θ ( x , z ) q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) p θ ( z ∣ x ) ) ] = E_{x\sim q(z|x)}[log(\frac{p_{\theta}(x,z)q_{\phi}(z|x)}{q_{\phi}(z|x)p_{\theta}(z|x)})] =Ex∼q(z∣x)[log(qϕ(z∣x)pθ(z∣x)pθ(x,z)qϕ(z∣x))]
= E x ∼ q ( z ∣ x ) [ l o g ( p θ ( x , z ) q ϕ ( z ∣ x ) ) ] + E x ∼ q ( z ∣ x ) [ l o g ( q ϕ ( z ∣ x ) p θ ( z ∣ x ) ) ] = E_{x\sim q(z|x)}[log(\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)})] + E_{x\sim q(z|x)}[log(\frac{q_{\phi}(z|x)}{p_{\theta}(z|x)})] =Ex∼q(z∣x)[log(qϕ(z∣x)pθ(x,z))]+Ex∼q(z∣x)[log(pθ(z∣x)qϕ(z∣x))]
= E x ∼ q ( z ∣ x ) [ l o g ( p θ ( x , z ) ) − l o g ( q ϕ ( z ∣ x ) ) ] + D K L ( q ϕ ( z ∣ x ) ∣ ∣ p θ ( z ∣ x ) ) = E_{x\sim q(z|x)}[log(p_{\theta}(x,z)) - log(q_{\phi}(z|x))] + D_{KL}(q_{\phi}(z|x)||p_{\theta}(z|x)) =Ex∼q(z∣x)[log(pθ(x,z))−log(qϕ(z∣x))]+DKL(qϕ(z∣x)∣∣pθ(z∣x))
L θ , ϕ ( x ) ≜ E x ∼ q ( z ∣ x ) [ l o g ( p θ ( x , z ) ) − l o g ( q ϕ ( z ∣ x ) ) ] \mathcal{L}_{\theta,\phi}(x) \triangleq E_{x\sim q(z|x)}[log(p_{\theta}(x,z)) - log(q_{\phi}(z|x))] Lθ,ϕ(x)≜Ex∼q(z∣x)[log(pθ(x,z))−log(qϕ(z∣x))]
= l o g p θ ( x ) − D K L ( q ϕ ( z ∣ x ) ∣ ∣ p θ ( z ∣ x ) ) = logp_{\theta}(x) - D_{KL}(q_{\phi}(z|x)||p_{\theta}(z|x)) =logpθ(x)−DKL(qϕ(z∣x)∣∣pθ(z∣x))
≤ l o g p θ ( x ) \le logp_{\theta}(x) ≤logpθ(x)
因此最大化 ELBO L θ , ϕ ( x ) \mathcal{L}_{\theta,\phi}(x) Lθ,ϕ(x) 等价于样本的极大似然估计 p θ ( x ) p_{\theta}(x) pθ(x)的下界。
q ϕ ( z ∣ x ) q_{\phi}(z|x) qϕ(z∣x)模型的选择同城需要满足两个条件:
1、便于计算
2、便于采样
这两点限制了 q ϕ ( z ∣ x ) q_{\phi}(z|x) qϕ(z∣x)的选择范围,满足这两点的一个常见模型是假设后验分布为各向独立的多维高斯分布:
( μ , l o g σ ) = N e u r a l N e t ( x ) (\mu, log\sigma) = NeuralNet(x) (μ,logσ)=NeuralNet(x)
z ∼ N ( μ , σ ) z\sim\mathcal{N}(\mu, \sigma) z∼N(μ,σ)
l o g q ϕ ( z ∣ x ) = l o g N ( z ; μ , σ ) logq_{\phi}(z|x) = log\mathcal{N}(z;\mu, \sigma) logqϕ(z∣x)=logN(z;μ,σ)
为了实现对 μ , σ \mu,\sigma μ,σ的梯度传导,以上模型通过重参数化技巧实现为:
( μ , l o g σ ) = N e u r a l N e t ( x ) (\mu, log \sigma) = NeuralNet(x) (μ,logσ)=NeuralNet(x)
ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal{N}(0, I) ϵ∼N(0,I)
z = f ( ϵ , μ , σ ) = m u + σ ⊙ ϵ z = f(\epsilon, \mu, \sigma) = mu + \sigma\odot \epsilon z=f(ϵ,μ,σ)=mu+σ⊙ϵ
l o g q ϕ ( z ∣ x ) = l o g N ( z ; μ , σ ) logq_{\phi}(z|x) = log\mathcal{N}(z;\mu, \sigma) logqϕ(z∣x)=logN(z;μ,σ)
各向独立的假设是比较强的假设,限制了模型的假设空间;在满足以上两点要求的前提下,为了提升模型的假设空间,一类做法是提升函数 f ( ϵ ) f(\epsilon) f(ϵ)的表达能力,并加入多层嵌套:
z = f ( ϵ ) = f n ( . . . f 2 ( f 1 ( ϵ ) ) ) z = f(\epsilon) = f_n(...f_2(f_1(\epsilon))) z=f(ϵ)=fn(...f2(f1(ϵ)))
以往的研究者们提出了多种函数 f f f,但是这类做法遇到的一个问题是 l o g ( q ϕ ( z ∣ x ) ) log(q_{\phi}(z|x)) log(qϕ(z∣x))的计算,当 q ϕ ( z ∣ x ) q_{\phi}(z|x) qϕ(z∣x)分布比较复杂时, z z z 的分布已经不是高斯分布。这里给出 l o g q ϕ ( z ∣ x ) logq_{\phi}(z|x) logqϕ(z∣x) 的计算方法,并通过证明揭示函数 f f f 的设计原则:
l o g q ϕ ( z ∣ x ) = l o g p ( ϵ ) − l o g ∣ d e t ( ∂ z ∂ ϵ ) ∣ logq_{\phi}(z|x) = logp(\epsilon) - log|det(\frac{\partial z}{\partial \epsilon})| logqϕ(z∣x)=logp(ϵ)−log∣det(∂ϵ∂z)∣
其中
ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal{N}(0, I) ϵ∼N(0,I)
z = f ( ϵ , x , ϕ ) z=f(\epsilon, x,\phi) z=f(ϵ,x,ϕ)
这里以一维随机变量为例,公式证明如下:
F Z ( z ∣ x ) ≜ P ( Z < z ∣ x ) F_Z(z|x)\triangleq P(Z < z|x) FZ(z∣x)≜P(Z<z∣x)
= P ( f ( E , x , ϕ ) < z ∣ x ) = P(f(\Epsilon, x, \phi) < z|x) =P(f(E,x,ϕ)<z∣x)
= P ( E < f − 1 ( z ) ∣ x ) = P(\Epsilon < f^{-1}(z)|x) =P(E<f−1(z)∣x)
= P ( E < f − 1 ( z ) ) = P(\Epsilon < f^{-1}(z)) =P(E<f−1(z))
= F E ( f − 1 ( z ) ) =F_{\Epsilon}(f^{-1}(z)) =FE(f−1(z))
= ∫ − ∞ f − 1 ( z ) p ( ϵ ) d ϵ =\int_{-\infin}^{f^{-1}(z)}p(\epsilon)d\epsilon =∫−∞f−1(z)p(ϵ)dϵ
q ( z ∣ x ) = d F Z ( z ∣ x ) d z q(z|x) = \frac{dF_Z(z|x)}{dz} q(z∣x)=dzdFZ(z∣x)
= d ∫ − ∞ f − 1 ( z ) p ( ϵ ) d ϵ d z =\frac{d\int_{-\infin}^{f^{-1}(z)}p(\epsilon)d\epsilon}{dz} =dzd∫−∞f−1(z)p(ϵ)dϵ
= d ∫ − ∞ f − 1 ( z ) p ( f − 1 ( z ) ) d f − 1 ( z ) d z =\frac{d\int_{-\infin}^{f^{-1}(z)}p(f^{-1}(z))df^{-1}(z)}{dz} =dzd∫−∞f−1(z)p(f−1(z))df−1(z)
= d ∫ − ∞ f − 1 ( z ) p ( f − 1 ( z ) d f − 1 ( z ) d f − 1 ( z ) d z = \frac{d\int_{-\infin}^{f^{-1}(z)}p(f^{-1}(z)}{df^{-1}(z)}\frac{df^{-1}(z)}{dz} =df−1(z)d∫−∞f−1(z)p(f−1(z)dzdf−1(z)
= ∫ − ∞ ϵ p ( ϵ ) d ϵ d ϵ d ϵ d z =\frac{\int_{-\infin}^{\epsilon}p(\epsilon)d\epsilon}{d\epsilon}\frac{d\epsilon}{dz} =dϵ∫−∞ϵp(ϵ)dϵdzdϵ
= p ( ϵ ) ∗ d ϵ d z = p(\epsilon) * \frac{d\epsilon}{dz} =p(ϵ)∗dzdϵ
以上证明我吗假定函数 f ( ϵ , x , ϕ ) f(\epsilon, x,\phi) f(ϵ,x,ϕ)递增函数,实际 f f f 为任何可逆函数时,类似方式可证明:
q ( z ∣ x ) = p ( ϵ ) ∣ d ϵ d z ∣ q(z|x)=p(\epsilon)|\frac{d\epsilon}{dz}| q(z∣x)=p(ϵ)∣dzdϵ∣
l o g q ( z ∣ x ) = l o g p ( ϵ ) − l o g ( ∣ d z d ϵ ∣ ) logq(z|x) = logp(\epsilon) - log(|\frac{dz}{d\epsilon}|) logq(z∣x)=logp(ϵ)−log(∣dϵdz∣)
对于多维随机变量,类似的可以证明:
l o g q ( z ∣ x ) = l o g p ( ϵ ) − l o g ∣ d e t ( ∂ z ∂ ϵ ) ∣ logq(\bold{z}|\bold{x}) = logp(\bold{\epsilon}) - log|det(\frac{\partial \bold{z}}{\partial \bold{\epsilon}})| logq(z∣x)=logp(ϵ)−log∣det(∂ϵ∂z)∣