CS236 Deep Generative Models (4)

概述

Latent Variable与Autoregressive Model的区别

  • Autoregressive使用DAG图、假设参数来近似Chain Rule即 p ( x 1 , . . . , x n ) = p ( x 1 ) p ( x 2 ∣ x 1 ) ⋯ p ( x n ∣ x 1 , . . . , x n − 1 ) p(x_1,...,x_n)=p(x_1)p(x_2|x_1)\cdots p(x_n|x_1,...,x_{n-1}) p(x1,...,xn)=p(x1)p(x2x1)p(xnx1,...,xn1)
  • Autoregressive的数据是完全可观测的,即随机变量 x i x_i xi的值,如Pixel
  • 啰嗦一下,我们观测的样本,可以理解成某些随机变量下的值,我们建模的是随机变量的关系(结构、参数),整个过程就是通过随机变量的值来估计随机变量的关系
  • Latent Variable Model就是假设了 p ( x 1 , . . . , x n ) p(x_1,...,x_n) p(x1,...,xn)中的某些随机变量 x i , . . . , x j x_i,...,x_j xi,...,xj是不可观测的,但又与样本生成的值有关
  • Autoregressive学习不到Unsupervised Representative Vector,但Latent可以学到(后面会提到)

下面主要先从离散Latent Variable即GMM,扩展到连续Latent Variable即VAE,最后再介绍Variational Infenence的方式进行Learning

一、GMM

1.1 GMM建模

pic-1
对一个image来说,在前面建模时,把每个pixel当成随机变量 x i x_i xi,这些都是可观测的,但 p ( x 1 , . . . , x n ) p(x_1,...,x_n) p(x1,...,xn)中的一些pixels的关系受隐变量 z i z_i zi的控制,如性别Gender,种族Ethnicity等等,因此对隐变量囊括进来的联合分布 P ( X , Z ) P(X,Z) P(X,Z)一起建模。

对于 K K K个高斯分布的混合模型GMM来说

  • z ∼ C a t e g o r i c a l ( p 1 , . . . , p K ) z\sim Categorical(p_1,...,p_K) zCategorical(p1,...,pK),隐变量 Z Z Z服从参数为 p 1 , p 2 , . . . p K p_1,p_2,...p_K p1,p2,...pK的多项式分布。(一般可假设为均匀分布)
  • p ( x ∣ z = k ) = N ( u k , Σ k ) p(x|z=k)=N(u_k,\Sigma_k) p(xz=k)=N(uk,Σk),对于隐变量 Z Z Z的第k个取值,随机变量 X X X服从参数为 u k , Σ k u_k,\Sigma_k uk,Σk的高斯分布
  • 特别提醒
    当GMM模型的参数 p 1 , p 2 , . . . , p K p_1,p_2,...,p_K p1,p2,...,pK u 1 , . . . , u K , σ 1 , . . . , σ K u_1,...,u_K,\sigma_1,...,\sigma_K u1,...,uK,σ1,...,σK通过 X t r a i n X_{train} Xtrain学习好后,输入一测试样本 x t e s t ( i ) x^{(i)}_{test} xtest(i),即可得到其属于 K K K个高斯分布的概率向量 ( P 1 , . . . , P K ) (P_1,...,P_K) (P1,...,PK)这就是latent variable model 可以学习到Unsupervised Learning Representation的原因,区别于Autoregressive Model的地方
    p ( X t r a i n ) = ∑ k = 1 K p ( X t r a i n ∣ Z = k ) p ( Z = k ) p ( Z = k ∣ x t e s t ( i ) ) = p ( Z = k , x t e s t ( i ) ) p ( x t e s t ( i ) ) = p ( Z = k , x t e s t ( i ) ) ∑ k = 1 K p ( x t e s t ( i ) ∣ Z = k ) p ( Z = k ) p(X_{train})=\sum_{k=1}^Kp(X_{train}|Z=k)p(Z=k)\\ p(Z=k|x^{(i)}_{test})=\frac{p(Z=k,x^{(i)}_{test})}{p(x^{(i)}_{test})}=\frac{p(Z=k,x^{(i)}_{test})}{\sum_{k=1}^Kp(x^{(i)}_{test}|Z=k)p(Z=k)} p(Xtrain)=k=1Kp(XtrainZ=k)p(Z=k)p(Z=kxtest(i))=p(xtest(i))p(Z=k,xtest(i))=k=1Kp(xtest(i)Z=k)p(Z=k)p(Z=k,xtest(i))
    GMM的参数模型可用EM算法进行学习,此处不提EM算法,可参见刘建平的EM算法总结

1.2 生成式中的三大基本问题

  • Representation:假设 p ( Z ) p(Z) p(Z)是多项式分布 C a t Cat Cat p ( X ∣ Z ) p(X|Z) p(XZ)是高斯分布 N N N,从而对联合分布 p ( X , Z ) = p ( X ∣ Z ) p ( Z ) p(X,Z)=p(X|Z)p(Z) p(X,Z)=p(XZ)p(Z)建了模
  • Inference:通过数据样本 x ( i ) = ( x 1 ( i ) , x 2 ( i ) , . . . x n ( i ) ) x^{(i)}=(x_1^{(i)},x_2^{(i)},...x_n^{(i)}) x(i)=(x1(i),x2(i),...xn(i))来推断模型参数即 p ( Z = k ∣ x ( i ) ) p(Z=k|x^{(i)}) p(Z=kx(i))
  • Learning : 使用(MLE)Maximum Likelihood Learning,即 arg max ⁡ θ ∑ i l o g p θ ( x ( i ) ) \argmax_\theta\sum_ilogp_\theta(x^{(i)}) θargmaxilogpθ(x(i)),可回顾Deep Generative Models (3),但此处是用EM算法进行Learn的,与MLL还是有点小差别。

二、VAE

2.1 VAE建模

pic-2
对于VAE来说,将GMM中的 K → ∞ K\rightarrow\infty K:

  • z ∼ N ( 0 , I ) z\sim N(0,I) zN(0,I)
  • p ( x ∣ z ) = N ( u θ ( z ) , Σ θ ( z ) ) p(x|z)=N(u_\theta(z),\Sigma_\theta(z)) p(xz)=N(uθ(z),Σθ(z)) (这里对结构进行了高斯分布的假设,但对高斯分布的参数用 θ \theta θ建了模)
  • 问题出现在 p ( x ) = ∫ p ( z ) ( x ∣ z ) d z p(x)=\int p(z)(x|z)dz p(x)=p(z)(xz)dz,对无穷多个高斯分布求积分,是intractable的,就是说,之前的MLL💊(完蛋)!
  1. 分析一下💊的问题
    D = { x ( 1 ) , x ( 2 ) , ⋯   , x ( N ) } D={\{x^{(1)},x^{(2)},\cdots,x^{(N)}\}} D={x(1),x(2),,x(N)},MLL最大化似然函数 L ( θ , D ) L(\theta,D) L(θ,D)
    L ( θ , D ) = l o g ∏ i ∈ D p ( x ( i ) ; θ ) = ∑ i ∈ D l o g p ( x ( i ) ; θ ) = ∑ i ∈ D l o g ∫ p ( x ( i ) , z ; θ ) d z θ t + 1 ← θ t + ∇ θ L ( θ ) \begin{aligned} L(\theta,D)=log\prod_{i\in D}p(x^{(i)};\theta)&=\sum_{i\in D}logp(x^{(i)};\theta)=\sum_{i\in D}log\int p(x^{(i)},z;\theta)dz\\ &\theta^{t+1}\leftarrow\theta^t+\nabla_\theta L(\theta) \end{aligned} L(θ,D)=logiDp(x(i);θ)=iDlogp(x(i);θ)=iDlogp(x(i),z;θ)dzθt+1θt+θL(θ)

    这个 ∫ p ( x ( i ) , z ; θ ) d z \int p(x^{(i)},z;\theta)dz p(x(i),z;θ)dz是intractable的,怎么算?

  2. 只能用Important Sampling来近似解决了
    l o g p ( x ( i ) ; θ ) = l o g ∫ p ( x ( i ) , z ; θ ) d z = l o g ∫ q i ( z ) q i ( z ) p ( x ( i ) , z ; θ ) d z = l o g E z ∼ q i ( z ) [ p ( x ( i ) , z ; θ ) q i ( z ) ] ( 1 ) ≥ E z ∼ q i ( z ) [ l o g p ( x ( i ) , z ; θ ) q i ( z ) ] ( 2 ) = E z ∼ q i ( z ) [ l o g p ( x ( i ) , z ; θ ) ] + H ( q i ( z ) ) = E L B O ( E v i d e n c e L o w e r B o u n d ) \begin{aligned} logp(x^{(i)};\theta)=log\int p(x^{(i)},z;\theta)dz&=log\int \frac{q_i(z)}{q_i(z)}p(x^{(i)},z;\theta)dz\\ &=logE_{z\sim q_i(z)}\Big[\frac{p(x^{(i)},z;\theta)}{q_i(z)}\Big]\quad (1)\\ &\geq E_{z\sim q_i(z)}\Big[log\frac{p(x^{(i)},z;\theta)}{q_i(z)}\Big] \quad (2)\\ &=E_{z\sim q_i(z)}\Big[log{p(x^{(i)},z;\theta)}\Big]+H(q_i(z))\\ &=ELBO(Evidence \quad Lower\quad Bound) \end{aligned} logp(x(i);θ)=logp(x(i),z;θ)dz=logqi(z)qi(z)p(x(i),z;θ)dz=logEzqi(z)[qi(z)p(x(i),z;θ)](1)Ezqi(z)[logqi(z)p(x(i),z;θ)](2)=Ezqi(z)[logp(x(i),z;θ)]+H(qi(z))=ELBO(EvidenceLowerBound)

    换一下表述: p ( x ( i ) , z ; θ ) = p θ ( x ( i ) , z ) p(x^{(i)},z;\theta)=p_\theta(x^{(i)},z) p(x(i),z;θ)=pθ(x(i),z)

    所以对于每一个样本有:
    l o g p θ ( x ( i ) ) ≥ E z ∼ q i ( z ) [ l o g p θ ( x ( i ) , z ) ] + H ( q i ) logp_\theta(x^{(i)})\geq E_{z\sim q_i(z)}\Big[logp_\theta(x^{(i)},z)\Big]+H(q_i) logpθ(x(i))Ezqi(z)[logpθ(x(i),z)]+H(qi)

    记 L i ( p θ , q i ) = E z ∼ q i ( z ) [ l o g p θ ( x ( i ) , z ) ] + H ( q i ) 记 L_i(p_\theta,q_i)=E_{z\sim q_i(z)}\Big[logp_\theta(x^{(i)},z)\Big]+H(q_i) Li(pθ,qi)=Ezqi(z)[logpθ(x(i),z)]+H(qi)

  3. 一个美妙的结论
    先下结论:
    l o g p θ ( x ( i ) ) = L i ( p θ , q i ) + K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) logp_\theta(x^{(i)})=L_i(p_\theta,q_i)+KL\big(q_i(z)||p_\theta(z|x^{(i)})\big) logpθ(x(i))=Li(pθ,qi)+KL(qi(z)pθ(zx(i)))

    然后证明:
    K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) = ∫ q i ( z ) l o g q i ( z ) p ( z ∣ x ( i ) ) d z = ∫ q i ( z ) l o g q i ( z ) d z − ∫ q i ( z ) l o g p ( z ∣ x ( i ) ) d z = − H ( q i ) − ∫ q i ( z ) l o g p ( x ( i ) , z ) p ( x ( i ) ) d z = − H ( q i ) − E z ∼ q i ( z ) l o g p ( x ( i ) , z ) + ∫ q i ( z ) l o g p ( x ( i ) ) d z = − L i ( p θ , q i ) + l o g p θ ( x ( i ) ) \begin{aligned} KL(q_i(z)||p_\theta(z|x^{(i)}))&=\int q_i(z)log\frac{q_i(z)}{p(z|x^{(i)})}dz\\ &=\int q_i(z)logq_i(z)dz-\int q_i(z)logp(z|x^{(i)})dz\\ &=-H(q_i)-\int q_i(z)log\frac{p(x^{(i)},z)}{p(x^{(i)})}dz\\ &=-H(q_i)-E_{z\sim q_i(z)}logp(x^{(i)},z)+\int q_i(z)logp(x^{(i)})dz\\ &=-L_i(p_\theta,q_i)+logp_\theta(x^{(i)}) \end{aligned} KL(qi(z)pθ(zx(i)))=qi(z)logp(zx(i))qi(z)dz=qi(z)logqi(z)dzqi(z)logp(zx(i))dz=H(qi)qi(z)logp(x(i))p(x(i),z)dz=H(qi)Ezqi(z)logp(x(i),z)+qi(z)logp(x(i))dz=Li(pθ,qi)+logpθ(x(i))
    arg max ⁡ θ ∑ i l o g p θ ( x ( i ) ) ≡ arg max ⁡ θ L i ( p θ , q i ) 同 时 arg min ⁡ q i K L ( q i ∣ ∣ p θ ( z ∣ x ( i ) ) ) \argmax_\theta\sum_ilogp_\theta(x^{(i)})\equiv \argmax_\theta L_i(p_\theta,q_i)同时\argmin_{q_i} KL(q_i||p_\theta(z|x^{(i)})) θargmaxilogpθ(x(i))θargmaxLi(pθ,qi)qiargminKL(qipθ(zx(i)))

  4. 结论的来源
    求知欲更强一点就是,这个 K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) KL(q_i(z)||p_\theta(z|x^{(i)})) KL(qi(z)pθ(zx(i)))中的 p θ ( z ∣ x ( i ) ) p_\theta(z|x^{(i)}) pθ(zx(i))是怎么来的呢?

    在最上面的推导 ( 1 ) (1) (1) ( 2 ) (2) (2)中使用了 l o g E [ f ] ≥ E [ l o g f ] logE[f]\geq E[logf] logE[f]E[logf],且等号成立条件为 p θ ( x ( i ) , z ) q i ( z ) = c \frac{p_\theta(x^{(i)},z)}{q_i(z)}=c qi(z)pθ(x(i),z)=c,且 ∫ q i ( z ) d z = 1 , c \int q_i(z)dz=1,c qi(z)dz=1,c为常数,有

    ∫ q i ( z ) d z = ∫ p θ ( x ( i ) , z ) c d z = 1 ∫ p θ ( x ( i ) , z ) d z = p θ ( x ( i ) ) = c q i ( z ) = p θ ( x ( i ) , z ) c = p θ ( x ( i ) , z ) p θ ( x ( i ) ) = p θ ( z ∣ x ( i ) ) \int q_i(z)dz=\int\frac{p_\theta(x^{(i)},z)}{c}dz=1\\ \int p_\theta(x^{(i)},z)dz=p_\theta(x^{(i)})=c\\ q_i(z)=\frac{p_\theta(x^{(i)},z)}{c}=\frac{p_\theta(x^{(i)},z)}{p_\theta(x^{(i)})}=p_\theta(z|x^{(i)}) qi(z)dz=cpθ(x(i),z)dz=1pθ(x(i),z)dz=pθ(x(i))=cqi(z)=cpθ(x(i),z)=pθ(x(i))pθ(x(i),z)=pθ(zx(i))

2.2 VAE建模总结

到这里梳理一波(搞清逻辑很关键!):

  • 一个样本的似然函数 l o g p θ ( x ( i ) ) = L i ( p θ , q i ) + K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) = E z ∼ q i ( z ) [ l o g p θ ( x ( i ) , z ) ] + H ( q i ) + K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) \begin{aligned} logp_\theta(x^{(i)})&=L_i(p_\theta,q_i)+KL\big(q_i(z)||p_\theta(z|x^{(i)})\big)\\&=E_{z\sim q_i(z)}\Big[logp_\theta(x^{(i)},z)\Big]+H(q_i)+KL\big(q_i(z)||p_\theta(z|x^{(i)})\big) \end{aligned} logpθ(x(i))=Li(pθ,qi)+KL(qi(z)pθ(zx(i)))=Ezqi(z)[logpθ(x(i),z)]+H(qi)+KL(qi(z)pθ(zx(i)))

  • VAE的假设中 p θ ( x ∣ z ) = N ( u θ ( z ) , σ θ ( z ) ) , z ∼ N ( 0 , I ) , p θ ( x , z ) = p θ ( x ∣ z ) p ( z ) , p θ ( z ∣ x ) = p θ ( x , z ) p θ ( x ) p_\theta(x|z)=N(u_\theta(z),\sigma_\theta(z)),z\sim N(0,I),p_\theta(x,z)=p_\theta(x|z)p(z),p_\theta(z|x)=\frac{p_\theta(x,z)}{p_\theta(x)} pθ(xz)=N(uθ(z),σθ(z))zN(0,I),pθ(x,z)=pθ(xz)p(z),pθ(zx)=pθ(x)pθ(x,z)

  • 问题就在于 p θ ( x ( i ) ) p_\theta(x^{(i)}) pθ(x(i))是intractable的,因此 p θ ( z ∣ x ( i ) ) p_\theta(z|x^{(i)}) pθ(zx(i))也intractable,想最大化似然函数 l o g p θ ( x ( i ) ) logp_\theta(x^{(i)}) logpθ(x(i)),就得知道 p θ ( z ∣ x ( i ) ) p_\theta(z|x^{(i)}) pθ(zx(i)),其又intractable,于是只能使用简单可采样的分布 q i ( z ) q_i(z) qi(z)去逼近 p θ ( z ∣ x ( i ) ) p_\theta(z|x^{(i)}) pθ(zx(i))

  • 参数化 q i ( z ; ϕ ( i ) ) q_i(z;\phi^{(i)}) qi(z;ϕ(i)),调整 ϕ ( i ) \phi^{(i)} ϕ(i),使 K L ( q i ( z ; ϕ ( i ) ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) KL\big(q_i(z;\phi^{(i)})||p_\theta(z|x^{(i)})\big) KL(qi(z;ϕ(i))pθ(zx(i)))最小,等价于使 L i ( p θ , q ϕ ( i ) ) L_i(p_\theta,q_{\phi^{(i)}}) Li(pθ,qϕ(i))最大。

三、Variational Inference and Learning

3.1 变分推断的含义与思路

根据上面的VAE分析
D = { x ( 1 ) , x ( 2 ) , ⋯   , x ( M ) } D={\{x^{(1)},x^{(2)},\cdots,x^{(M)}\}} D={x(1),x(2),,x(M)},最大化似然函数 L ( θ , D ) L(\theta,D) L(θ,D)
max ⁡ θ L ( θ , D ) = max ⁡ θ l o g ∏ i ∈ D p ( x ( i ) ; θ ) = max ⁡ θ ∑ i ∈ D l o g p θ ( x ( i ) ) = max ⁡ θ ∑ i ∈ D L i ( p θ , q i ) + K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) = max ⁡ θ ∑ i ∈ D E z ∼ q i ( z ) [ l o g p θ ( x ( i ) , z ) ] + H ( q i ) + K L ( q i ( z ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) ≥ max ⁡ θ , ϕ ( 1 ) , . . . , ϕ ( M ) ∑ i ∈ D E z ∼ q ϕ ( i ) ( z ) [ l o g p θ ( x ( i ) , z ) − l o g q ϕ ( i ) ( z ) ] = max ⁡ θ , ϕ ( 1 ) , . . . , ϕ ( M ) ∑ i ∈ D L ( x ( i ) ; θ , ϕ ( i ) ) \begin{aligned} \max_\theta L(\theta,D)&=\max_\theta log\prod_{i\in D}p(x^{(i)};\theta)\\ &=\max_\theta\sum_{i\in D}logp_\theta(x^{(i)})\\ &=\max_\theta \sum_{i\in D}L_i(p_\theta,q_i)+KL\big(q_i(z)||p_\theta(z|x^{(i)})\big)\\ &=\max_\theta \sum_{i\in D}E_{z\sim q_i(z)}\Big[logp_\theta(x^{(i)},z)\Big]+H(q_i)+KL\big(q_i(z)||p_\theta(z|x^{(i)})\big)\\ &\geq \max_{\theta,\phi^{(1)},...,\phi^{(M)}}\sum_{i\in D}E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp_\theta(x^{(i)},z)-logq_{\phi^{(i)}}(z)\Big]\\ &=\max_{\theta,\phi^{(1)},...,\phi^{(M)}}\sum_{i\in D}L(x^{(i)};\theta,\phi^{(i)}) \end{aligned} θmaxL(θ,D)=θmaxlogiDp(x(i);θ)=θmaxiDlogpθ(x(i))=θmaxiDLi(pθ,qi)+KL(qi(z)pθ(zx(i)))=θmaxiDEzqi(z)[logpθ(x(i),z)]+H(qi)+KL(qi(z)pθ(zx(i)))θ,ϕ(1),...,ϕ(M)maxiDEzqϕ(i)(z)[logpθ(x(i),z)logqϕ(i)(z)]=θ,ϕ(1),...,ϕ(M)maxiDL(x(i);θ,ϕ(i))

  • 关键Learning的思路
  1. 利用 ϕ ( i ) \phi^{(i)} ϕ(i)参数化分布 q i ( z ) q_i(z) qi(z)后,上述看成固定 θ \theta θ,最小化 K L ( q i ( z ; ϕ ( i ) ) ∣ ∣ p θ ( z ∣ x ( i ) ) ) KL\big(q_i(z;\phi^{(i)})||p_\theta(z|x^{(i)})\big) KL(qi(z;ϕ(i))pθ(zx(i)))中的参数 ϕ ( i ) \phi^{(i)} ϕ(i)
  2. 然后固定 ϕ ( i ) , i = 1 , . . . , M \phi^{(i)},i=1,...,M ϕ(i),i=1,...,M,最大化 L ( x ( i ) ; θ , ϕ ( i ) ) L(x^{(i)};\theta,\phi^{(i)}) L(x(i);θ,ϕ(i))中的参数 θ \theta θ
  3. 迭代更新 θ , ϕ \theta,\phi θ,ϕ,参数收敛后即完成Learning过程,如下图。
    pic-4
  • 变分推断的含义
    Variational Inference中的Variational就是针对一个固定的 θ \theta θ,根据objective function变动每个样本的 ϕ ( i ) \phi^{(i)} ϕ(i),而Inference在一开始提到大概意思,通过样本估计参数,即 p θ ( z ∣ x ( i ) ) p_\theta(z|x^{(i)}) pθ(zx(i)),最终意思Inference通过variational的方式逼近计算得到,称Variational Inference。

3.2 变分推断的算法

3.2.1 Stochastic Variational Inference

数据集 D = { x ( 1 ) , x ( 2 ) , ⋯   , x ( M ) } D={\{x^{(1)},x^{(2)},\cdots,x^{(M)}\}} D={x(1),x(2),,x(M)}
优化目标:
max ⁡ θ , ϕ ( 1 ) , . . . , ϕ ( M ) ∑ i ∈ D L ( x ( i ) ; θ , ϕ ( i ) ) = max ⁡ θ , ϕ ( 1 ) , . . . , ϕ ( M ) ∑ i ∈ D E z ∼ q ϕ ( i ) ( z ) [ l o g p θ ( x ( i ) , z ) − l o g q ϕ ( i ) ( z ) ] \max_{\theta,\phi^{(1)},...,\phi^{(M)}}\sum_{i\in D}L(x^{(i)};\theta,\phi^{(i)})=\max_{\theta,\phi^{(1)},...,\phi^{(M)}}\sum_{i\in D}E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp_\theta(x^{(i)},z)-logq_{\phi^{(i)}}(z)\Big] θ,ϕ(1),...,ϕ(M)maxiDL(x(i);θ,ϕ(i))=θ,ϕ(1),...,ϕ(M)maxiDEzqϕ(i)(z)[logpθ(x(i),z)logqϕ(i)(z)]

SVI的算法流程

  1. 初始化 θ , ϕ ( 1 ) , . . . , ϕ ( M ) \theta,\phi^{(1)},...,\phi^{(M)} θ,ϕ(1),...,ϕ(M)
  2. 随机sample一个样本 x ( i ) ∈ D x^{(i)}\in D x(i)D
  3. 固定 θ \theta θ,在 L ( x ( i ) ; θ , ϕ ( i ) ) L(x^{(i)};\theta,\phi^{(i)}) L(x(i);θ,ϕ(i))中优化 ϕ ( i ) \phi^{(i)} ϕ(i):
    1. Repeat ϕ ( i ) ← ϕ ( i ) + α 1 ∇ ϕ ( i ) L ( x ( i ) ; θ , ϕ ( i ) ) \phi^{(i)}\leftarrow\phi^{(i)}+\alpha_1\nabla_{\phi^{(i)}}L(x^{(i)};\theta,\phi^{(i)}) ϕ(i)ϕ(i)+α1ϕ(i)L(x(i);θ,ϕ(i))
    2. Until ϕ ( i ) ≈ arg max ⁡ ϕ ∗ ( i ) L ( x ( i ) ; θ , ϕ ( i ) ) \phi^{(i)}\approx \argmax_{\phi^{(i)}_*}L(x^{(i)};\theta,\phi^{(i)}) ϕ(i)ϕ(i)argmaxL(x(i);θ,ϕ(i))
  4. 固定 ϕ ∗ ( i ) \phi^{(i)}_* ϕ(i),计算 ∇ θ L ( x ( i ) ; θ , ϕ ∗ ( i ) ) \nabla_{\theta}L(x^{(i)};\theta,\phi^{(i)}_*) θL(x(i);θ,ϕ(i))
  5. θ ← θ + α 2 ∇ θ L ( x ( i ) ; θ , ϕ ∗ ( i ) ) \theta\leftarrow\theta+\alpha_2\nabla_{\theta}L(x^{(i)};\theta,\phi^{(i)}_*) θθ+α2θL(x(i);θ,ϕ(i))

算法流程大致明晰了,但gradient的计算好像不太清楚即 ∇ ϕ ( i ) L ( x ( i ) ; θ , ϕ ( i ) ) \nabla_{\phi^{(i)}}L(x^{(i)};\theta,\phi^{(i)}) ϕ(i)L(x(i);θ,ϕ(i)) ∇ θ L ( x ( i ) ; θ , ϕ ∗ ( i ) ) \nabla_{\theta}L(x^{(i)};\theta,\phi^{(i)}_*) θL(x(i);θ,ϕ(i)),下面具体展开来算算。

L ( x ( i ) ; θ , ϕ ( i ) ) = E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] ≈ 1 K ∑ k = 1 K [ l o g p ( x ( i ) , z k ; θ ) − l o g q ( z k ; ϕ ( i ) ) ] \begin{aligned} L(x^{(i)};\theta,\phi^{(i)})&=E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big]\\ &\approx \frac{1}{K}\sum_{k=1}^K\Big[logp(x^{(i)},z^k;\theta)-logq(z^k;\phi^{(i)})\Big] \end{aligned} L(x(i);θ,ϕ(i))=Ezqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]K1k=1K[logp(x(i),zk;θ)logq(zk;ϕ(i))]

因为假设 q ( z ; ϕ ( i ) ) q(z;\phi^{(i)}) q(z;ϕ(i))是比较好采样的,而且tractable,所以可以使用MC estimation对期望进行估计。

3.2.2 关于 θ \theta θ的梯度计算

∇ θ L ( x ( i ) ; θ , ϕ ( i ) ) = ∇ θ E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] = ∇ θ E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) ] ≈ 1 K ∑ k = 1 K l o g p ( x ( i ) , z k ; θ ) \begin{aligned} \nabla_\theta L(x^{(i)};\theta,\phi^{(i)})&=\nabla_\theta E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big]\\ &=\nabla_\theta E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)\Big]\\ &\approx \frac{1}{K}\sum_{k=1}^Klogp(x^{(i)},z^k;\theta) \end{aligned} θL(x(i);θ,ϕ(i))=θEzqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]=θEzqϕ(i)(z)[logp(x(i),z;θ)]K1k=1Klogp(x(i),zk;θ)

l o g p ( x ( i ) , z k ; θ ) = l o g p ( x ( i ) ∣ z k ; θ ) p ( z k ) P ( z ) = N ( 0 , I ) P ( x ∣ z k ; θ ) = N ( u θ ( z k ) , Σ θ ( z k ) ) logp(x^{(i)},z^k;\theta)=logp(x^{(i)}|z^k;\theta)p(z^k)\\ P(z)= N(0,I)\\ P(x|z^k;\theta)=N\big(u_\theta(z^k),\Sigma_\theta(z^k)\big) logp(x(i),zk;θ)=logp(x(i)zk;θ)p(zk)P(z)=N(0,I)P(xzk;θ)=N(uθ(zk),Σθ(zk))

所以可以算出 p ( z k ) p(z^k) p(zk) p ( x ( i ) ∣ z k , θ ) p(x^{(i)}|z^k,\theta) p(x(i)zk,θ)的概率密度,这关于 θ \theta θ的Gradient好求!

∇ ϕ ( i ) L ( x ( i ) ; θ , ϕ ( i ) ) = ∇ ϕ ( i ) E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] \begin{aligned} \nabla_{\phi^{(i)}}L(x^{(i)};\theta,\phi^{(i)})&=\nabla_{\phi^{(i)}}E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big] \end{aligned} ϕ(i)L(x(i);θ,ϕ(i))=ϕ(i)Ezqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]

这个关于 ϕ ( i ) \phi^{(i)} ϕ(i)的Gradient在期望中 E z ∼ q ϕ ( i ) ( z ) E_{z\sim q_{\phi^{(i)}}(z)} Ezqϕ(i)(z)有点麻烦呀=。=

抽象一下表述:

∇ ϕ ( i ) E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] = ∇ ϕ ( i ) E q ( z ; ϕ ( i ) ) [ r ( z ; ϕ ( i ) ) ] r ( z ; ϕ ( i ) ) = E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] \nabla_{\phi^{(i)}} E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big]=\nabla_{\phi^{(i)} }E_{q(z;\phi^{(i)})}\Big[r(z;\phi^{(i)})\Big]\\ r(z;\phi^{(i)})=E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big] ϕ(i)Ezqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]=ϕ(i)Eq(z;ϕ(i))[r(z;ϕ(i))]r(z;ϕ(i))=Ezqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]

3.2.3 关于 ϕ \phi ϕ的梯度计算

  • Reparameterization trick
    E q ( z ; ϕ ) [ r ( z ) ] = ∫ q ( z ; ϕ ) r ( z ) d z z ∼ q ( z ; ϕ ) = N ( u , σ 2 I ) E_{q(z;\phi)}[r(z)]=\int q(z;\phi)r(z)dz\\ z\sim q(z;\phi)=N(u,\sigma^2I) Eq(z;ϕ)[r(z)]=q(z;ϕ)r(z)dzzq(z;ϕ)=N(u,σ2I)

所以 ϕ = u , σ 2 \phi={u,\sigma^2} ϕ=u,σ2, z z z必须是连续变量

N ( u , σ 2 I ) N(u,\sigma^2I) N(u,σ2I)中采样 z z z相当于增加一个自由参数 ϵ ∼ N ( 0 , I ) , z = u + ϵ σ = g ( ϵ ; ϕ = { u , σ } ) \epsilon\sim N(0,I),z=u+\epsilon\sigma=g(\epsilon;\phi=\{u,\sigma\}) ϵN(0,I),z=u+ϵσ=g(ϵ;ϕ={u,σ})

E q ( z ; ϕ ) [ r ( z ) ] E_{q(z;\phi)}[r(z)] Eq(z;ϕ)[r(z)]代入 z = u + ϵ σ , ϕ = { u , σ } z=u+\epsilon\sigma,\phi=\{u,\sigma\} z=u+ϵσ,ϕ={u,σ}得:

E q ( z ; ϕ ) [ r ( z ) ] = E ϵ ∼ N ( 0 , I ) [ r ( g ( ϵ ; ϕ ) ] = ∫ p ( ϵ ) r ( u + σ ϵ ) d ϵ ∇ ϕ E q ( z ; ϕ ) [ r ( z ) ] = ∇ ϕ E ϵ [ r ( g ( ϵ ; ϕ ) ) ] = E ϵ [ ∇ ϕ r ( g ( ϵ ; ϕ ) ] ≈ 1 K ∑ k = 1 K ∇ ϕ r ( g ( ϵ k ; ϕ ) \begin{aligned} &E_{q(z;\phi)}[r(z)]=E_{\epsilon\sim N(0,I)}[r(g(\epsilon;\phi)]=\int p(\epsilon)r(u+\sigma\epsilon)d\epsilon\\ &\nabla_\phi E_{q(z;\phi)}[r(z)]=\nabla_\phi E_{\epsilon}[r(g(\epsilon;\phi))]=E_{\epsilon}[\nabla_\phi r(g(\epsilon;\phi)]\\ &\approx \frac{1}{K}\sum_{k=1}^K \nabla_\phi r(g(\epsilon^k;\phi) \end{aligned} Eq(z;ϕ)[r(z)]=EϵN(0,I)[r(g(ϵ;ϕ)]=p(ϵ)r(u+σϵ)dϵϕEq(z;ϕ)[r(z)]=ϕEϵ[r(g(ϵ;ϕ))]=Eϵ[ϕr(g(ϵ;ϕ)]K1k=1Kϕr(g(ϵk;ϕ)

所以最终有:

∇ ϕ ( i ) E q ( z ; ϕ ( i ) ) [ r ( z ; ϕ ( i ) ) ] = ∇ ϕ ( i ) E z ∼ q ϕ ( i ) ( z ) [ l o g p ( x ( i ) , z ; θ ) − l o g q ( z ; ϕ ( i ) ) ] = ∇ ϕ ( i ) E ϵ ∼ N ( 0 , I ) [ r ( g ( ϵ ; ϕ ( i ) ) ; ϕ ( i ) ) ] ≈ ∇ ϕ ( i ) 1 K ∑ k = 1 K [ r ( g ( ϵ k ; ϕ ( i ) ) ; ϕ ( i ) ) ] \begin{aligned} \nabla_{\phi^{(i)} }E_{q(z;\phi^{(i)})}\Big[r(z;\phi^{(i)})\Big]&=\nabla_{\phi^{(i)}} E_{z\sim q_{\phi^{(i)}}(z)}\Big[logp(x^{(i)},z;\theta)-logq(z;\phi^{(i)})\Big]\\ &=\nabla_{\phi^{(i)} }E_{\epsilon\sim N(0,I)}\Big[r(g(\epsilon;\phi^{(i)});\phi^{(i)})\Big]\\ &\approx\nabla_{\phi^{(i)}}\frac{1}{K}\sum_{k=1}^K\Big[r(g(\epsilon^k;\phi^{(i)});\phi^{(i)})\Big] \end{aligned} ϕ(i)Eq(z;ϕ(i))[r(z;ϕ(i))]=ϕ(i)Ezqϕ(i)(z)[logp(x(i),z;θ)logq(z;ϕ(i))]=ϕ(i)EϵN(0,I)[r(g(ϵ;ϕ(i));ϕ(i))]ϕ(i)K1k=1K[r(g(ϵk;ϕ(i));ϕ(i))]

  • REINFROCE的做法

J ( θ ) = E τ ∼ π θ ( τ ) [ r ( τ ) ] ∇ θ J ( θ ) = E π θ ( τ ) [ ∇ θ l o g π θ ( τ ) r ( τ ) ] J(\theta)=E_{\tau\sim\pi_\theta(\tau)}\Big[r(\tau)\Big]\\ \nabla_\theta J(\theta)=E_{\pi_\theta(\tau)}\big[\nabla_\theta log\pi_\theta(\tau)r(\tau)\big]\\ J(θ)=Eτπθ(τ)[r(τ)]θJ(θ)=Eπθ(τ)[θlogπθ(τ)r(τ)]

因此有

∇ ϕ ( i ) E q ( z ; ϕ ( i ) ) [ r ( z ; ϕ ( i ) ) ] = E q ( z ; ϕ ( i ) ) [ ∇ ϕ l o g q ϕ ( i ) ( z ) r ( z ; ϕ ( i ) ) ] ≈ 1 K ∑ k = 1 K [ ∇ ϕ l o g q ϕ ( i ) ( z k ) r ( z k ; ϕ ( i ) ) ] \begin{aligned} \nabla_{\phi^{(i)}} E_{q(z;\phi^{(i)})}[r(z;\phi^{(i)})]&=E_{q(z;\phi^{(i)})}\Big[\nabla_\phi logq_{\phi^{(i)}}(z)r(z;\phi^{(i)})\Big]\\ &\approx \frac{1}{K}\sum_{k=1}^K\Big[\nabla_\phi logq_{\phi^{(i)}}(z^k)r(z^k;\phi^{(i)})\Big] \end{aligned} ϕ(i)Eq(z;ϕ(i))[r(z;ϕ(i))]=Eq(z;ϕ(i))[ϕlogqϕ(i)(z)r(z;ϕ(i))]K1k=1K[ϕlogqϕ(i)(zk)r(zk;ϕ(i))]

对离散还是连续的变量 z z z均可使用,问题就是high variance!

3.3 Amortized Inference

前面的SVI,对一个data样本 x ( i ) x^{(i)} x(i),逼近其 p ( z ∣ x ( i ) ) p(z|x^{(i)}) p(zx(i)),对应了一组高斯分布的参数 ϕ ( i ) = { u ( i ) , σ ( i ) , ϵ ( i ) } \phi^{(i)}=\{{u^{(i)},\sigma^{(i)}},\epsilon^{(i)}\} ϕ(i)={u(i),σ(i),ϵ(i)},麻烦,为什么不直接使用一个mapping,建立样本与参数空间 ϕ ( i ) \phi^{(i)} ϕ(i)的映射呢?即输入一个 x ( i ) x^{(i)} x(i),输出其服从的高斯分布的参数 { u ( i ) , σ ( i ) } \{{u^{(i)},\sigma^{(i)}}\} {u(i),σ(i)},记为 q ϕ ( z ∣ x ) q_\phi(z|x) qϕ(zx)

目标函数为:
L ( x ; θ , ϕ ) = E q ϕ ( z ∣ x ) [ l o g p ( x , z ; θ ) − l o g q ϕ ( z ∣ x ) ] L(x;\theta,\phi)=E_{q_\phi(z|x)}\Big[logp(x,z;\theta)-logq_\phi(z|x)\Big] L(x;θ,ϕ)=Eqϕ(zx)[logp(x,z;θ)logqϕ(zx)]

所以算法流程为:

  1. 初始化 θ ( 0 ) , ϕ ( 0 ) \theta^{(0)},\phi^{(0)} θ(0),ϕ(0)
  2. 在数据集 D = { x ( 1 ) , x ( 2 ) , . . . , x ( M ) } D=\{x^{(1)},x^{(2)},...,x^{(M)}\} D={x(1),x(2),...,x(M)}中采样 x ( i ) x^{(i)} x(i)
  3. 计算梯度 ∇ θ L ( x ( i ) ; θ , ϕ ) \nabla_\theta L(x^{(i)};\theta,\phi) θL(x(i);θ,ϕ) ∇ ϕ L ( x ( i ) ; θ , ϕ ) \nabla_\phi L(x^{(i)};\theta,\phi) ϕL(x(i);θ,ϕ)
  4. 更新参数 θ , ϕ \theta,\phi θ,ϕ

四、VAE的总结

4.1 VAE Perspective

pic-3
L ( x ; θ , ϕ ) = E q ϕ ( z ∣ x ) [ l o g p ( x , z ; θ ) − l o g q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ l o g p ( x , z ; θ ) − l o g p ( z ) + l o g p ( z ) − l o g q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ l o g p ( x ∣ z ; θ ) ] − D K L ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) \begin{aligned} L(x;\theta,\phi)&=E_{q_\phi(z|x)}\Big[logp(x,z;\theta)-logq_\phi(z|x)\Big]\\ &=E_{q_\phi(z|x)}\Big[logp(x,z;\theta)-logp(z)+logp(z)-logq_\phi(z|x)\Big]\\ &=E_{q_\phi(z|x)}\Big[logp(x|z;\theta)\Big]-D_{KL}\big(q_\phi(z|x)||p(z)\big) \end{aligned} L(x;θ,ϕ)=Eqϕ(zx)[logp(x,z;θ)logqϕ(zx)]=Eqϕ(zx)[logp(x,z;θ)logp(z)+logp(z)logqϕ(zx)]=Eqϕ(zx)[logp(xz;θ)]DKL(qϕ(zx)p(z))

解释一下:

  • 输入一个真实样本 x ( i ) x^{(i)} x(i)
  • Encoder通过采样 q ϕ ( z ∣ x ( i ) ) q_\phi(z|x^{(i)}) qϕ(zx(i)) x ( i ) x^{(i)} x(i)映射到隐变量上 z ^ \hat z z^即高斯分布的参数
  • Decoder通过采样 p ( x ∣ z ^ ; θ ) p(x|\hat z;\theta) p(xz^;θ)将在隐变量 z ^ \hat z z^上的样本 x ^ \hat x x^重构出来

解析目标函数的作用:

  • E q ϕ ( z ∣ x ) [ l o g p ( x ∣ z ; θ ) ] E_{q_\phi(z|x)}\Big[logp(x|z;\theta)\Big] Eqϕ(zx)[logp(xz;θ)]使 x ^ ≈ x ( i ) \hat x\approx x^{(i)} x^x(i)
  • D K L ( q ϕ ( z ∣ x ) ∣ ∣ p ( z ) ) D_{KL}\big(q_\phi(z|x)||p(z)\big) DKL(qϕ(zx)p(z))使隐变量表示更加接近于先验的 p ( z ) p(z) p(z),即一般情况下 z ∼ N ( 0 , I ) z\sim N(0,I) zN(0,I)

4.2 总结

总结照常偷懒
pic7
pic-8

后记

如果是小白,第一次看可能会懵逼,在看的过程中注意以下问题:

  • 为什么需要用分布 q ( z ) q(z) q(z)去逼近 p ( z ∣ x ) p(z|x) p(zx)
  • 符号上 p ( x ; θ ) p(x;\theta) p(x;θ) p θ ( x ) p_\theta(x) pθ(x)是同一个意思,有时候用不同的表述会更清晰和简洁,但不熟悉的会混淆
  • 做建模的时候,参数化一个分布是什么意思?

感觉应该把整个问题讲清楚了,一直觉得公式的准确使用,有时候会很繁琐,但有助于理解问题,可惜的是,好多资料的notation都有一定程度的简化,对于想通过公式知道整个问题怎么做、怎么算,有点难呀~
后续补一点图、补一点实验。

(总感觉理解得不是很透彻,总缺了点东西,应该是网络的动态梯度更新过程,并没有呈现在脑海中orz= =)

参考资料:
CS236的PPT lec5-6

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值