Learning in Implicit Generative Models
对于隐生成模型来说,其直接定义了生成过程,如GAN中的生成器,没有似然函数,对于这一类模型的学习,就不能如VAE那样通过最大化似然函数得到。那么可以基于这样一个假设:真实的数据分布跟所定义的生成模型的分布相等
p
⋆
(
x
)
=
q
θ
(
x
)
p^{\star}(\mathbf x)=q_\theta(\mathbf x)
p⋆(x)=qθ(x)。主要通过两个步骤进行学习:比较和估计。对于比较步骤:利用density difference
r
(
x
)
=
p
⋆
(
x
)
−
q
θ
(
x
)
r(\mathbf x)=p^{\star}(\mathbf x)-q_\theta(\mathbf x)
r(x)=p⋆(x)−qθ(x)或者density ratio
r
(
x
)
=
p
⋆
(
x
)
/
q
θ
(
x
)
r(\mathbf x)=p^{\star}(\mathbf x)/q_\theta(\mathbf x)
r(x)=p⋆(x)/qθ(x),利用比较器
r
(
x
)
r(\mathbf x)
r(x)能够区分模型生成的数据与真实数据的相差程度。对于估计步骤:利用比较器所能提供的信息进而更新隐生成模型的参数
θ
\theta
θ。
总共有四种方法进行隐模型的学习,如图所示。
Class Probability Estimation
设数据为
X
⊂
R
d
\mathcal X \subset \mathbb R^d
X⊂Rd,从真实数据分布中得到
n
n
n个样本
X
p
=
{
x
1
(
p
)
,
…
,
x
n
(
p
)
}
\mathcal X_p=\{\mathbf x_1^{(p)},\dots,\mathbf x_n^{(p)}\}
Xp={x1(p),…,xn(p)},同样地,从模型分布中得到
n
′
n^{\prime}
n′个样本
X
q
=
{
x
1
(
q
)
,
…
,
x
n
′
(
q
)
}
\mathcal X_q=\{\mathbf x_1^{(q)},\dots,\mathbf x_{n^\prime}^{(q)}\}
Xq={x1(q),…,xn′(q)}。除此之外,对属于真实分布的样本赋予
y
=
1
y=1
y=1,而对于模型分布的样本赋予
y
=
0
y=0
y=0。这样一来我们可以表示
p
⋆
(
x
)
=
p
(
x
∣
y
=
1
)
,
q
θ
(
x
)
=
p
(
x
∣
y
=
0
)
p^{\star}(\mathbf x)=p(\mathbf x|y=1),q_\theta(\mathbf x)=p(\mathbf x|y=0)
p⋆(x)=p(x∣y=1),qθ(x)=p(x∣y=0),则
p
⋆
(
x
)
q
θ
(
x
)
=
p
(
x
∣
y
=
1
)
p
(
x
∣
y
=
0
)
=
p
(
y
=
1
∣
x
)
p
(
x
)
p
(
y
=
1
)
/
p
(
y
=
0
∣
x
)
p
(
x
)
p
(
y
=
0
)
=
p
(
y
=
1
∣
x
)
p
(
y
=
0
∣
x
)
1
−
π
π
\begin{aligned} \frac{p^{\star}(\mathbf x)}{q_\theta(\mathbf x)}&=\frac{p(\mathbf x|y=1)}{p(\mathbf x|y=0)}={\frac{p(y=1|\mathbf x)p(\mathbf x)}{p(y=1)}}/{\frac{p(y=0|\mathbf x)p(\mathbf x)}{p(y=0)}}\\ &=\frac{p(y=1|\mathbf x)}{p(y=0|\mathbf x)}\frac{1-\pi}{\pi} \end{aligned}
qθ(x)p⋆(x)=p(x∣y=0)p(x∣y=1)=p(y=1)p(y=1∣x)p(x)/p(y=0)p(y=0∣x)p(x)=p(y=0∣x)p(y=1∣x)π1−π可以发现ratio估计实则在估计类概率。其中
p
(
y
=
1
)
=
π
p(y=1)=\pi
p(y=1)=π,代表类边缘分布,一般人为设定,常常定义为
π
=
1
/
2
\pi=1 / 2
π=1/2,或者对于不平衡的数据可以定义
1
−
π
π
≈
n
′
/
n
\frac{1-\pi}{\pi} \approx n^{\prime} / n
π1−π≈n′/n。
下面我们的任务变为了指定一个配分函数或者判别器
D
(
x
;
ϕ
)
=
p
(
y
=
1
∣
x
)
∈
[
0
,
1
]
\mathcal{D}(\mathbf{x} ; \boldsymbol{\phi})=p(\mathbf{y}=1 | \mathbf{x})\in[0,1]
D(x;ϕ)=p(y=1∣x)∈[0,1]。则密度比与判别器结果的关系为
D
=
r
/
(
r
+
1
)
;
r
=
D
/
(
1
−
D
)
\mathcal{D}=r /(r+1) ; r=\mathcal{D} /(1-\mathcal{D})
D=r/(r+1);r=D/(1−D)。常见的配分函数有
一般来说选择Bernoulli loss
L
(
ϕ
,
θ
)
=
E
p
(
x
∣
y
)
p
(
y
)
[
−
y
log
D
(
x
;
ϕ
)
−
(
1
−
y
)
log
(
1
−
D
(
x
;
ϕ
)
)
]
=
π
E
p
∗
(
x
)
[
−
log
D
(
x
;
ϕ
)
]
+
(
1
−
π
)
E
q
θ
(
x
)
[
−
log
(
1
−
D
(
x
;
ϕ
)
)
]
\begin{array}{l}{\mathcal{L}(\boldsymbol{\phi}, \boldsymbol{\theta})} \\ {=\mathbb{E}_{p(\mathbf{x} | y) p(y)}[-y \log \mathcal{D}(\mathbf{x} ; \boldsymbol{\phi})-(1-y) \log (1-\mathcal{D}(\mathbf{x} ; \boldsymbol{\phi}))]} \\ {=\pi \mathbb{E}_{p^{*}(\mathbf{x})}[-\log \mathcal{D}(\mathbf{x} ; \boldsymbol{\phi})]} {+(1-\pi) \mathbb{E}_{q_{\theta}(\mathbf{x})}[-\log (1-\mathcal{D}(\mathbf{x} ; \boldsymbol{\phi}))]}\end{array}
L(ϕ,θ)=Ep(x∣y)p(y)[−ylogD(x;ϕ)−(1−y)log(1−D(x;ϕ))]=πEp∗(x)[−logD(x;ϕ)]+(1−π)Eqθ(x)[−log(1−D(x;ϕ))]由于
q
θ
(
x
)
q_\theta(\mathbf x)
qθ(x)为生成器,则
L
(
ϕ
,
θ
)
=
π
E
p
∗
(
x
)
[
−
log
D
(
x
;
ϕ
)
]
+
(
1
−
π
)
E
q
(
z
)
[
−
log
(
1
−
D
(
G
(
z
;
θ
)
;
ϕ
)
)
]
\begin{aligned} \mathcal{L}(\boldsymbol{\phi}, \boldsymbol{\theta}) &=\pi \mathbb{E}_{p *(\mathbf{x})}[-\log \mathcal{D}(\mathbf{x} ; \boldsymbol{\phi})] \\ &+(1-\pi) \mathbb{E}_{q(\mathbf{z})}[-\log (1-\mathcal{D}(\mathcal{G}(\mathbf{z} ; \boldsymbol{\theta}) ; \boldsymbol{\phi}))] \end{aligned}
L(ϕ,θ)=πEp∗(x)[−logD(x;ϕ)]+(1−π)Eq(z)[−log(1−D(G(z;θ);ϕ))]以上结果刚好为GAN所使用的目标函数。优化上式可以两次优化(bi-level optimisation):
Ratio loss:
min
ϕ
π
E
p
∗
(
x
)
[
−
log
D
(
x
;
ϕ
)
]
+
(
1
−
π
)
E
q
θ
(
x
)
[
−
log
(
1
−
D
(
x
;
ϕ
)
)
]
Generative loss:
min
e
E
q
(
z
)
[
log
(
1
−
D
(
G
(
z
;
θ
)
)
)
]
\begin{array}{l}{\text { Ratio loss: } \min _{\phi} \pi \mathbb{E}_{p^{*}(\mathbf{x})}[-\log \mathcal{D}(\mathbf{x} ; \phi)]} {\quad+(1-\pi) \mathbb{E}_{q_{\theta}(\mathbf{x})}[-\log (1-\mathcal{D}(\mathbf{x} ; \phi))]} \\ {\text { Generative loss: } \min _{e} \mathbb{E}_{q(\mathbf{z})}[\log (1-\mathcal{D}(\mathcal{G}(\mathbf{z} ; \boldsymbol{\theta})))]}\end{array}
Ratio loss: minϕπEp∗(x)[−logD(x;ϕ)]+(1−π)Eqθ(x)[−log(1−D(x;ϕ))] Generative loss: mineEq(z)[log(1−D(G(z;θ)))]
Divergence Minimisation
第二个方法是计算 p ⋆ p^{\star} p⋆和 q q q之间的散度。这就不得不提到f-散度。 D f [ p ∗ ( x ) ∥ q θ ( x ) ] = ∫ q θ ( x ) f ( p ∗ ( x ) q θ ( x ) ) d x = E q θ ( x ) [ f ( r ( x ) ) ] ≥ sup t E p ∗ ( x ) [ t ( x ) ] − E q θ ( x ) [ f † ( t ( x ) ) ] \begin{array}{c}{D_{f}\left[p^{*}(\mathbf{x}) \| q_{\theta}(\mathbf{x})\right]=\int q_{\theta}(\mathbf{x}) f\left(\frac{p^{*}(\mathbf{x})}{q_{\theta}(\mathbf{x})}\right) d \mathbf{x}} \\ {=\mathbb{E}_{q_{\theta}(\mathbf{x})}[f(r(\mathbf{x}))]} \\ {\quad \geq \sup _{t} \mathbb{E}_{p^{*}(\mathbf{x})}[t(\mathbf{x})]-\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f^{\dagger}(t(\mathbf{x}))\right]}\end{array} Df[p∗(x)∥qθ(x)]=∫qθ(x)f(qθ(x)p∗(x))dx=Eqθ(x)[f(r(x))]≥suptEp∗(x)[t(x)]−Eqθ(x)[f†(t(x))]其中 f f f为凸函数,Fenchel conjugate为 f † f^{\dagger} f†(前提是 f f f为凸函数,lower-semicontinuous函数), f † ( t ) = sup u ∈ d o m f { u t − f ( u ) } f^{\dagger}(t)=\sup _{u \in \mathrm{dom}_{f}}\{u t-f(u)\} f†(t)=u∈domfsup{ut−f(u)}这个共轭函数刚好有 f f f的性质,它也同样有Fenchel conjugate,即 f † † = f f^{\dagger\dagger}=f f††=f则 D f ( P ∥ Q ) = ∫ X q ( x ) sup t ∈ dom f † { t p ( x ) q ( x ) − f † ( t ) } d x ≥ sup t ∈ T ( ∫ X p ( x ) t ( x ) d x − ∫ X q ( x ) f † ( t ( x ) ) d x ) = sup t ∈ T ( E x ∼ P [ t ( x ) ] − E x ∼ Q [ f † ( t ( x ) ) ] ) \begin{aligned} D_{f}(P \| Q) &=\int_{\mathcal{X}} q(x) \sup _{t \in \operatorname{dom}_{f^{\dagger}}}\left\{t \frac{p(x)}{q(x)}-f^{\dagger}(t)\right\} \mathrm{d} x \\ & \geq \sup _{t \in \mathcal{T}}\left(\int_{\mathcal{X}} p(x) t(x) \mathrm{d} x-\int_{\mathcal{X}} q(x) f^{\dagger}(t(x)) \mathrm{d} x\right) \\ &=\sup _{t \in \mathcal{T}}\left(\mathbb{E}_{x \sim P}[t(x)]-\mathbb{E}_{x \sim Q}\left[f^{\dagger}(t(x))\right]\right) \end{aligned} Df(P∥Q)=∫Xq(x)t∈domf†sup{tq(x)p(x)−f†(t)}dx≥t∈Tsup(∫Xp(x)t(x)dx−∫Xq(x)f†(t(x))dx)=t∈Tsup(Ex∼P[t(x)]−Ex∼Q[f†(t(x))])这样就能进行min-max的训练了。对于以上不等式取等时为 t ∗ ( x ) = f ′ ( r ( x ) ) t^{*}(\mathbf{x})=f^{\prime}(r(\mathbf{x})) t∗(x)=f′(r(x)),将其代入原式子得 L = E p ∗ ( x ) [ − f ′ ( r ϕ ( x ) ) ] + E q θ ( x ) [ f † ( f ′ ( r ϕ ( x ) ) ] \mathcal{L}=\mathbb{E}_{p^{*}(\mathbf{x})}\left[-f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]+\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f^{\dagger}\left(f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]\right. L=Ep∗(x)[−f′(rϕ(x))]+Eqθ(x)[f†(f′(rϕ(x))]其中 r ϕ = r ∗ = p ∗ / q θ r_{\phi}=r^{*}=p^{*} / q_{\theta} rϕ=r∗=p∗/qθ,则优化目标为 Ratio loss: min ϕ E p ∗ ( x ) [ − f ′ ( r ϕ ( x ) ) ] + E q θ ( x ) [ f † ( f ′ ( r ϕ ( x ) ) ] Generative loss: min θ E q ( z ) [ − f † ( f ′ ( r ( G ( z ; θ ) ) ) ] \begin{array}{l}{\text { Ratio loss: }} {\min _{\phi} \mathbb{E}_{p^{*}(\mathbf{x})}\left[-f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]+\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f^{\dagger}\left(f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]\right.} \\ {\text { Generative loss: } \min _{\theta} \mathbb{E}_{q(\mathbf{z})}\left[-f^{\dagger}\left(f^{\prime}(r(\mathcal{G}(\mathbf{z} ; \boldsymbol{\theta})))\right]\right.}\end{array} Ratio loss: minϕEp∗(x)[−f′(rϕ(x))]+Eqθ(x)[f†(f′(rϕ(x))] Generative loss: minθEq(z)[−f†(f′(r(G(z;θ)))]密度比暗示着 p ∗ ( x ) ≈ p ~ = r ϕ ( x ) q θ ( x ) p^{*}(\mathbf{x}) \approx \tilde{p}=r_{\phi}(\mathbf{x}) q_{\theta}(\mathbf{x}) p∗(x)≈p~=rϕ(x)qθ(x),那么 D K L [ p ∗ ( x ) ∥ p ~ ( x ) ] = ∫ p ∗ ( x ) log p ∗ ( x ) r ϕ ( x ) q θ ( x ) d x + ∫ ( r ϕ ( x ) q θ ( x ) − p ∗ ( x ) ) d x \begin{aligned} D_{K L}\left[p^{*}(\mathbf{x}) \| \tilde{p}(\mathbf{x})\right]=& \int p^{*}(\mathbf{x}) \log \frac{p^{*}(\mathbf{x})}{r_{\phi}(\mathbf{x}) q_{\theta}(\mathbf{x})} d \mathbf{x} +\int\left(r_{\phi}(\mathbf{x}) q_{\theta}(\mathbf{x})-p^{*}(\mathbf{x})\right) d \mathbf{x} \end{aligned} DKL[p∗(x)∥p~(x)]=∫p∗(x)logrϕ(x)qθ(x)p∗(x)dx+∫(rϕ(x)qθ(x)−p∗(x))dx这是针对于非归一化的分布的KL散度。从而有 L = E p ∗ ( x ) [ − log r ϕ ( x ) ] + E q θ ( x ) [ r ϕ ( x ) − 1 ] − E p ∗ ( x ) [ log q θ ( x ) ] + E p ∗ ( x ) [ log p ∗ ( x ) ] \begin{aligned} \mathcal{L}=& \mathbb{E}_{p^{*}(\mathbf{x})}\left[-\log r_{\phi}(\mathbf{x})\right]+\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x})-1\right] -\mathbb{E}_{p^{*}(\mathbf{x})}\left[\log q_{\theta}(\mathbf{x})\right]+\mathbb{E}_{p^{*}(\mathbf{x})}\left[\log p^{*}(\mathbf{x})\right] \end{aligned} L=Ep∗(x)[−logrϕ(x)]+Eqθ(x)[rϕ(x)−1]−Ep∗(x)[logqθ(x)]+Ep∗(x)[logp∗(x)]则很容易得到ratio loss,即关于 ϕ \phi ϕ的有关项。但是通过这个没法得到generative loss,由于第三项需要 log q θ ( x ) \log q_{\theta}(\mathbf{x}) logqθ(x),这对于隐模型来说是无法得到的。
Ratio matching
直接优化真实密度比 r ∗ ( x ) = p ∗ ( x ) / q θ ( x ) r^{*}(\mathbf{x})=p^{*}(\mathbf{x}) / q_{\theta}(\mathbf{x}) r∗(x)=p∗(x)/qθ(x)和估计密度比 r ϕ ( x ) r_{\phi}(\mathbf{x}) rϕ(x), L = 1 2 ∫ q θ ( x ) ( r ( x ) − r ∗ ( x ) ) 2 d x = 1 2 E q θ ( x ) [ r ϕ ( x ) 2 ] − E p ∗ ( x ) [ r ϕ ( x ) ] + 1 2 E p ∗ ( x ) [ r ∗ ( x ) ] = 1 2 E q θ ( x ) [ r ϕ ( x ) 2 ] − E p ∗ ( x ) [ r ϕ ( x ) ] s.t. r ϕ ( x ) ≥ 0 \begin{aligned} \mathcal{L} &=\frac{1}{2} \int q_{\theta}(\mathbf{x})\left(r(\mathbf{x})-r^{*}(\mathbf{x})\right)^{2} d \mathbf{x} \\ &=\frac{1}{2} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x})^{2}\right]-\mathbb{E}_{p^{*}(\mathbf{x})}\left[r_{\phi}(\mathbf{x})\right]+\frac{1}{2} \mathbb{E}_{p^{*}(\mathbf{x})}\left[r^{*}(\mathbf{x})\right] \\ &=\frac{1}{2} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x})^{2}\right]-\mathbb{E}_{p^{*}(\mathbf{x})}\left[r_{\phi}(\mathbf{x})\right] \quad \text { s.t. } r_{\phi}(\mathbf{x}) \geq 0 \end{aligned} L=21∫qθ(x)(r(x)−r∗(x))2dx=21Eqθ(x)[rϕ(x)2]−Ep∗(x)[rϕ(x)]+21Ep∗(x)[r∗(x)]=21Eqθ(x)[rϕ(x)2]−Ep∗(x)[rϕ(x)] s.t. rϕ(x)≥0利用上式可以很容易得到ratio loss和generative loss,进而进行优化。除了使用这种平方误差,还可以考虑使用Bregman divergence(以上的均方误差为其一个特例) B f ( r ∗ ( x ) ∥ r ϕ ( x ) ) = E q θ ( x ) ( f ( r ∗ ( x ) ) − f ( r ϕ ( x ) ) − f ′ ( r ϕ ( x ) ) [ r ∗ ( x ) − r ϕ ( x ) ] ) = E q θ ( x ) [ r ϕ ( x ) f ′ ( r ϕ ( x ) ) − f ( r ϕ ( x ) ) ] − E p ∗ [ f ′ ( r ϕ ( x ) ) ] + D f [ p ∗ ( x ) ∥ q θ ( x ) ] = L B ( r ϕ ( x ) ) + D f [ p ∗ ( x ) ∥ q θ ( x ) ] \begin{array}{l}{B_{f}\left(r^{*}(\mathbf{x}) \| r_{\phi}(\mathbf{x})\right)} \\ {=\mathbb{E}_{q_{\theta}(\mathbf{x})}\left(f\left(r^{*}(\mathbf{x})\right)-f\left(r_{\phi}(\mathbf{x})\right)\right.} {-f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\left[r^{*}(\mathbf{x})-r_{\phi}(\mathbf{x})\right] )} \\{=\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x}) f^{\prime}\left(r_{\phi}(\mathbf{x})\right)-f\left(r_{\phi}(\mathbf{x})\right)\right]}{-\mathbb{E}_{p^{*}}\left[f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]+D_{f}\left[p^{*}(\mathbf{x}) \| q_{\theta}(\mathbf{x})\right]} \\ {=\mathcal{L}_{B}\left(r_{\phi}(\mathbf{x})\right)+D_{f}\left[p^{*}(\mathbf{x}) \| q_{\theta}(\mathbf{x})\right]}\end{array} Bf(r∗(x)∥rϕ(x))=Eqθ(x)(f(r∗(x))−f(rϕ(x))−f′(rϕ(x))[r∗(x)−rϕ(x)])=Eqθ(x)[rϕ(x)f′(rϕ(x))−f(rϕ(x))]−Ep∗[f′(rϕ(x))]+Df[p∗(x)∥qθ(x)]=LB(rϕ(x))+Df[p∗(x)∥qθ(x)]从上式可以得到ratio loss,即为 L B ( r ϕ ( x ) ) \mathcal{L}_{B}\left(r_{\phi}(\mathbf{x})\right) LB(rϕ(x)), L B ( r ϕ ( x ) ) = E p ∗ [ − f ′ ( r ϕ ( x ) ) ] + E q θ ( x ) [ r ϕ ( x ) f ′ ( r ϕ ( x ) ) − f ( r ϕ ( x ) ) ] = E p ∗ [ − f ′ ( r ϕ ( x ) ) ] + E q θ ( x ) [ f † ( f ′ ( r ϕ ( x ) ) ) ] \begin{array}{l}{\mathcal{L}_{B}\left(r_{\phi}(\mathbf{x})\right)} \\ {=E_{p^{*}}\left[-f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]} {\quad+\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x}) f^{\prime}\left(r_{\phi}(\mathbf{x})\right)-f\left(r_{\phi}(\mathbf{x})\right)\right]} \\ {=E_{p^{*}}\left[-f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]+\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f^{\dagger}\left(f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right)\right]}\end{array} LB(rϕ(x))=Ep∗[−f′(rϕ(x))]+Eqθ(x)[rϕ(x)f′(rϕ(x))−f(rϕ(x))]=Ep∗[−f′(rϕ(x))]+Eqθ(x)[f†(f′(rϕ(x)))]发现利用 f † ( f ′ ( x ) ) = max r r f ′ ( x ) − f ( r ) f^{\dagger}\left(f^{\prime}(x)\right)=\max _{r} r f^{\prime}(x)-f(r) f†(f′(x))=rmaxrf′(x)−f(r)后,变为了上一节得到过的目标函数。进一步考虑,仍然提出关于 θ \theta θ的部分,得到generative loss L ( q θ ) = E q θ ( x ) [ r ϕ ( x ) f ′ ( r ϕ ( x ) ) ] − E q θ ( x ) [ f ( r ϕ ( x ) ) ] + D f [ p ∗ ( x ) ∥ q θ ( x ) ] \begin{aligned} \mathcal{L}\left(q_{\theta}\right)=& \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x}) f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]-\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f\left(r_{\phi}(\mathbf{x})\right)\right] +D_{f}\left[p^{*}(\mathbf{x}) \| q_{\theta}(\mathbf{x})\right] \end{aligned} L(qθ)=Eqθ(x)[rϕ(x)f′(rϕ(x))]−Eqθ(x)[f(rϕ(x))]+Df[p∗(x)∥qθ(x)]发现其中仍然包含 q θ ( x ) q_\theta(\mathbf x) qθ(x),无法继续求解。我们采用 p ∗ ≈ r ϕ q θ p^{*} \approx r_{\phi} q_{\theta} p∗≈rϕqθ,则 D f [ p ∗ ( x ) ∥ q θ ( x ) ] = E q θ ( x ) [ f ( p ∗ q θ ( x ) ) ] ≈ E q θ ( x ) [ f ( q θ ( x ) r ϕ ( x ) q θ ( x ) ) ] = E q θ ( x ) [ f ( r ϕ ( x ) ) ] \begin{array}{l}{D_{f}\left[p^{*}(\mathbf{x}) \| q_{\theta}(\mathbf{x})\right]=\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f\left(\frac{p^{*}}{q_{\theta}(\mathbf{x})}\right)\right]} \\ {\approx \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f\left(\frac{q_{\theta}(\mathbf{x}) r_{\phi}(\mathbf{x})}{q_{\theta}(\mathbf{x})}\right)\right]=\mathbb{E}_{q_{\theta}(\mathbf{x})}\left[f\left(r_{\phi}(\mathbf{x})\right)\right]}\end{array} Df[p∗(x)∥qθ(x)]=Eqθ(x)[f(qθ(x)p∗)]≈Eqθ(x)[f(qθ(x)qθ(x)rϕ(x))]=Eqθ(x)[f(rϕ(x))]这样一来可以得到 Ratio loss: min ϕ E q θ ( x ) [ r ϕ ( x ) f ′ ( r ϕ ( x ) ) − f ( r ϕ ( x ) ) ] − E p ∗ [ f ′ ( r ϕ ( x ) ) ] Generative loss: min θ E q θ ( x ) [ r ϕ ( x ) f ′ ( r ϕ ( x ) ) ] \begin{array}{l}{\text { Ratio loss: }} {\min _{\phi} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x}) f^{\prime}\left(r_{\phi}(\mathbf{x})\right)-f\left(r_{\phi}(\mathbf{x})\right)\right]-\mathbb{E}_{p^{*}}\left[f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]} \\ {\text { Generative loss: } \min _{\theta} \mathbb{E}_{q_{\theta}(\mathbf{x})}\left[r_{\phi}(\mathbf{x}) f^{\prime}\left(r_{\phi}(\mathbf{x})\right)\right]}\end{array} Ratio loss: minϕEqθ(x)[rϕ(x)f′(rϕ(x))−f(rϕ(x))]−Ep∗[f′(rϕ(x))] Generative loss: minθEqθ(x)[rϕ(x)f′(rϕ(x))]
Moment Matching
最后一个方法是检验 p ⋆ p^\star p⋆和 q q q的矩是否相同。 L ( ϕ , θ ) = ( E p ∗ ( x ) [ s ( x ) ] − E q θ ( x ) [ s ( x ) ] ) 2 = ( E p ∗ ( x ) [ s ( x ) ] − E q ( z ) [ s ( G ( z ; θ ) ) ] ) 2 \begin{aligned} \mathcal{L}(\phi, \boldsymbol{\theta}) &=\left(\mathbb{E}_{p^{*}(\mathbf{x})}[s(\mathbf{x})]-\mathbb{E}_{q_{\theta}(\mathbf{x})}[s(\mathbf{x})]\right)^{2} \\ &=\left(\mathbb{E}_{p^{*}(\mathbf{x})}[s(\mathbf{x})]-\mathbb{E}_{q(\mathbf{z})}[s(\mathcal{G}(\mathbf{z} ; \boldsymbol{\theta}))]\right)^{2} \end{aligned} L(ϕ,θ)=(Ep∗(x)[s(x)]−Eqθ(x)[s(x)])2=(Ep∗(x)[s(x)]−Eq(z)[s(G(z;θ))])2其中 s ( x ) s(\mathbf x) s(x)为某种统计量,其选择极其重要,一般来说我们希望所有矩都相同。