DDPM的Loss推导
图中是基于MHVAE的标注,替换为 x → x 0 x \rightarrow x_0 x→x0、 z i → x i z_i \rightarrow x_i zi→xi
其中加噪过程
q
(
x
t
∣
x
t
−
1
)
q(x_{t}|x_{t-1})
q(xt∣xt−1)是人为的,具体公式参考[[001 DDPM-v2]],因此不添加
ϕ
\phi
ϕ参数;
其中去噪过程
p
θ
(
x
t
−
1
∣
x
t
)
p_{\theta}(x_{t-1}|x_t)
pθ(xt−1∣xt)是需要学习的,因此添加
θ
\theta
θ参数进行神经网络参数化操作;
DDPM的ELBO
参考上述MLE推导得到ELBO的公式
l
o
g
p
θ
(
x
)
=
l
o
g
∫
p
θ
(
x
0
:
T
)
d
x
1
:
T
=
l
o
g
∫
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
q
(
x
1
:
T
∣
x
0
)
d
x
1
:
T
=
l
o
g
E
q
(
x
1
:
T
∣
x
0
)
[
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
≥
E
q
(
x
1
:
T
∣
x
0
)
[
l
o
g
p
θ
(
x
0
:
T
)
q
(
x
1
:
T
∣
x
0
)
]
=
E
L
B
O
\begin{aligned} logp_{\theta}(x)&=log\int p_{\theta}(x_{0:T})dx_{1:T}\\ &=log\int\frac{p_{\theta}(x_{0:T})q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)}dx_{1:T} \\ &=log\mathbb{E}_{q(x_{1:T}|x_0)}[\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}] \\ &\geq\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}] = ELBO \\ \end{aligned}
logpθ(x)=log∫pθ(x0:T)dx1:T=log∫q(x1:T∣x0)pθ(x0:T)q(x1:T∣x0)dx1:T=logEq(x1:T∣x0)[q(x1:T∣x0)pθ(x0:T)]≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=ELBO
其中
p
θ
(
x
0
:
T
)
=
p
(
x
T
)
p
(
x
0
:
T
−
1
∣
x
T
)
=
p
(
x
T
)
p
(
x
T
−
1
∣
x
T
)
p
(
x
0
:
T
−
2
∣
x
T
−
1
,
x
T
)
=
p
(
x
T
)
p
(
x
T
−
1
∣
x
T
)
p
(
x
0
:
T
−
2
∣
x
T
−
1
)
=
⋯
=
p
(
x
T
)
p
(
x
T
−
1
∣
x
T
)
⋯
p
(
x
0
∣
x
1
)
=
p
(
x
T
)
∏
t
=
1
T
p
(
x
t
−
1
∣
x
t
)
\begin{aligned} p_{\theta}(x_{0:T}) &= p(x_T)p(x_{0:T-1}|x_T)\\ &=p(x_T)p(x_{T-1}|x_{T})p(x_{0:T-2}|x_{T-1},x_T)\\ &=p(x_T)p(x_{T-1}|x_{T})p(x_{0:T-2}|x_{T-1}) \\ &=\cdots \\ &=p(x_T)p(x_{T-1}|x_{T})\cdots p(x_0|x_1) \\ &=p(x_T)\prod_{t=1}^Tp(x_{t-1}|x_t) \end{aligned}
pθ(x0:T)=p(xT)p(x0:T−1∣xT)=p(xT)p(xT−1∣xT)p(x0:T−2∣xT−1,xT)=p(xT)p(xT−1∣xT)p(x0:T−2∣xT−1)=⋯=p(xT)p(xT−1∣xT)⋯p(x0∣x1)=p(xT)t=1∏Tp(xt−1∣xt)
q
(
x
1
:
T
∣
x
0
)
=
q
(
x
2
:
T
∣
x
1
)
q
(
x
1
∣
x
0
)
=
q
(
x
3
:
T
∣
x
2
,
x
1
)
q
(
x
2
∣
x
1
)
q
(
x
1
∣
x
0
)
=
q
(
x
3
:
T
∣
x
2
)
q
(
x
2
∣
x
1
)
q
(
x
1
∣
x
0
)
=
⋯
=
q
(
x
T
∣
x
T
−
1
)
⋯
q
(
x
2
∣
x
1
)
q
(
x
1
∣
x
0
)
=
∏
t
=
1
T
q
(
x
t
∣
x
t
−
1
)
\begin{aligned} q(x_{1:T}|x_0) &= q(x_{2:T}|x_1)q(x_1|x_0) \\ &=q(x_{3:T}|x_2,x_1)q(x_2|x_1)q(x_1|x_0) \\ &=q(x_{3:T}|x_2)q(x_2|x_1)q(x_1|x_0) \\ &=\cdots \\ &=q(x_{T}|x_{T-1})\cdots q(x_2|x_1)q(x_1|x_0)\\ &=\prod_{t=1}^Tq(x_t|x_{t-1}) \end{aligned}
q(x1:T∣x0)=q(x2:T∣x1)q(x1∣x0)=q(x3:T∣x2,x1)q(x2∣x1)q(x1∣x0)=q(x3:T∣x2)q(x2∣x1)q(x1∣x0)=⋯=q(xT∣xT−1)⋯q(x2∣x1)q(x1∣x0)=t=1∏Tq(xt∣xt−1)
代入得
l o g p θ ( x ) ≥ E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x T ∣ x T − 1 ) ∏ t = 1 T − 1 q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ ∑ t = 1 T − 1 l o g p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x T , x T − 1 ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x t − 1 , x t , x t + 1 ∣ x 0 ) [ l o g p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] \begin{aligned} logp_{\theta}(x)&\geq\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^Tq(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)\prod_{t=2}^Tp_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^Tq(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)\prod_{t=1}^{T-1}p_{\theta}(x_{t}|x_{t+1})}{q(x_T|x_{T-1})\prod_{t=1}^{T-1}q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_T|x_{T-1})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=1}^{T-1}\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[\sum_{t=1}^{T-1}log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}]+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{T},x_{T-1}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}]+\sum_{t=1}^{T-1}\mathbb{E}_{q(x_{t-1},x_t,x_{t+1}|x_0)}[log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}] \\ \end{aligned} logpθ(x)≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=Eq(x1:T∣x0)[log∏t=1Tq(xt∣xt−1)pθ(xT)∏t=1Tpθ(xt−1∣xt)]=Eq(x1:T∣x0)[log∏t=1Tq(xt∣xt−1)pθ(xT)pθ(x0∣x1)∏t=2Tpθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(xT∣xT−1)∏t=1T−1q(xt∣xt−1)pθ(xT)pθ(x0∣x1)∏t=1T−1pθ(xt∣xt+1)]=Eq(x1:T∣x0)[logq(xT∣xT−1)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=1∏T−1q(xt∣xt−1)pθ(xt∣xt+1)]=Eq(x1:T∣x0)[logpθ(x0∣x1)]+Eq(x1:T∣x0)[logq(xT∣xT−1)pθ(xT)]+Eq(x1:T∣x0)[t=1∑T−1logq(xt∣xt−1)pθ(xt∣xt+1)]=Eq(x1:T∣x0)[logpθ(x0∣x1)]+Eq(x1:T∣x0)[logq(xT∣xT−1)pθ(xT)]+t=1∑T−1Eq(x1:T∣x0)[logq(xt∣xt−1)pθ(xt∣xt+1)]=Eq(x1∣x0)[logpθ(x0∣x1)]+Eq(xT,xT−1∣x0)[logq(xT∣xT−1)pθ(xT)]+t=1∑T−1Eq(xt−1,xt,xt+1∣x0)[logq(xt∣xt−1)pθ(xt∣xt+1)]
[!NOTE]
- ∏ t = 2 T p θ ( x t − 1 ∣ x t ) \prod_{t=2}^Tp_{\theta}(x_{t-1}|x_t) ∏t=2Tpθ(xt−1∣xt)可以通过换元法,改写成 ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) \prod_{t=1}^{T-1}p_{\theta}(x_{t}|x_{t+1}) ∏t=1T−1pθ(xt∣xt+1);
- 期望的和 等于 和的期望;
- 最后一行由于其它变量都没有用上,因此只保留相关的变量进行采样;
消除变量
E q ( x T , x T − 1 ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) ] = ∬ q ( x T , x T − 1 ∣ x 0 ) l o g p θ ( x T ) q ( x T ∣ x T − 1 ) d x T − 1 d x T = ∬ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 , x 0 ) q ( x T − 1 ∣ x 0 ) d x T − 1 d x T = ∬ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 ) q ( x T − 1 ∣ x 0 ) d x T − 1 d x T = ∫ [ ∫ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 ) d x T ] q ( x T − 1 ∣ x 0 ) d x T − 1 = ∫ q ( x T − 1 ∣ x 0 ) [ − D K L ( q ( x T ∣ x T − 1 ) ∣ ∣ p θ ( x T ) ) ] d x T − 1 = E q ( x T − 1 ∣ x 0 ) [ − D K L ( q ( x T ∣ x T − 1 ) ∣ ∣ p θ ( x T ) ) ] \begin{aligned} \mathbb{E}_{q(x_{T},x_{T-1}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}] &= \iint q(x_T,x_{T-1}|x_0)log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}dx_{T-1}dx_T \\ &=\iint log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})} q(x_T|x_{T-1},x_0)q(x_{T-1}|x_0)dx_{T-1}dx_T \\ &=\iint log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})} q(x_T|x_{T-1})q(x_{T-1}|x_0)dx_{T-1}dx_T \\ &=\int [\int log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}q(x_T|x_{T-1})dx_T]q(x_{T-1}|x_0)dx_{T-1} \\ &=\int q(x_{T-1}|x_0)[-D_{KL}(q(x_T|x_{T-1})||p_{\theta}(x_T))]dx_{T-1} \\ &=\mathbb{E}_{q(x_{T-1}|x_0)}[-D_{KL}(q(x_T|x_{T-1})||p_{\theta}(x_T))] \end{aligned} Eq(xT,xT−1∣x0)[logq(xT∣xT−1)pθ(xT)]=∬q(xT,xT−1∣x0)logq(xT∣xT−1)pθ(xT)dxT−1dxT=∬logq(xT∣xT−1)pθ(xT)q(xT∣xT−1,x0)q(xT−1∣x0)dxT−1dxT=∬logq(xT∣xT−1)pθ(xT)q(xT∣xT−1)q(xT−1∣x0)dxT−1dxT=∫[∫logq(xT∣xT−1)pθ(xT)q(xT∣xT−1)dxT]q(xT−1∣x0)dxT−1=∫q(xT−1∣x0)[−DKL(q(xT∣xT−1)∣∣pθ(xT))]dxT−1=Eq(xT−1∣x0)[−DKL(q(xT∣xT−1)∣∣pθ(xT))]
这里尤其尤其要注意的是,积分顺序非常关键!
我踩得坑是:
E q ( x T , x T − 1 ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) ] = ∬ q ( x T , x T − 1 ∣ x 0 ) l o g p θ ( x T ) q ( x T ∣ x T − 1 ) d x T − 1 d x T = ∬ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 , x 0 ) q ( x T − 1 ∣ x 0 ) d x T − 1 d x T = ∬ l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 ) q ( x T − 1 ∣ x 0 ) d x T − 1 d x T = ∫ [ ∫ q ( x T − 1 ∣ x 0 ) d x T − 1 ] l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 ) d x T = ∫ 1 × l o g p θ ( x T ) q ( x T ∣ x T − 1 ) q ( x T ∣ x T − 1 ) d x T = − D K L ( q ( x T ∣ x T − 1 ) ∣ ∣ p θ ( x T ) ) \begin{aligned} \mathbb{E}_{q(x_{T},x_{T-1}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}] &= \iint q(x_T,x_{T-1}|x_0)log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}dx_{T-1}dx_T \\ &=\iint log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})} q(x_T|x_{T-1},x_0)q(x_{T-1}|x_0)dx_{T-1}dx_T \\ &=\iint log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})} q(x_T|x_{T-1})q(x_{T-1}|x_0)dx_{T-1}dx_T \\ &=\int [\int q(x_{T-1}|x_0)dx_{T-1}]log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}q(x_T|x_{T-1})dx_T \\ &=\int 1 \times log\frac{p_{\theta}(x_T)}{q(x_T|x_{T-1})}q(x_T|x_{T-1})dx_T \\ &=-D_{KL}(q(x_T|x_{T-1})||p_{\theta}(x_T)) \end{aligned} Eq(xT,xT−1∣x0)[logq(xT∣xT−1)pθ(xT)]=∬q(xT,xT−1∣x0)logq(xT∣xT−1)pθ(xT)dxT−1dxT=∬logq(xT∣xT−1)pθ(xT)q(xT∣xT−1,x0)q(xT−1∣x0)dxT−1dxT=∬logq(xT∣xT−1)pθ(xT)q(xT∣xT−1)q(xT−1∣x0)dxT−1dxT=∫[∫q(xT−1∣x0)dxT−1]logq(xT∣xT−1)pθ(xT)q(xT∣xT−1)dxT=∫1×logq(xT∣xT−1)pθ(xT)q(xT∣xT−1)dxT=−DKL(q(xT∣xT−1)∣∣pθ(xT))
[!important]
- ∫ p ( x 1 ∣ x 2 ) d x 1 = 1 \int p(x_1|x_2) dx_1=1 ∫p(x1∣x2)dx1=1,要看清楚这里是积分, x 1 x_1 x1是变量,积分完之后 x 1 x_1 x1变量就消失了
- 代入上式,若先把 x T − 1 x_{T-1} xT−1当作变量积分掉的话,剩下的带有 x T − 1 x_{T-1} xT−1条件概率的积分就无法完成
- 因此只能先把 x T x_T xT当作变量积分掉,因为剩下的 x T − 1 x_{T-1} xT−1变量没有 x T x_T xT的条件概率。
同理
E
q
(
x
t
−
1
,
x
t
,
x
t
+
1
∣
x
0
)
[
l
o
g
p
θ
(
x
t
∣
x
t
+
1
)
q
(
x
t
∣
x
t
−
1
)
]
=
∭
q
(
x
t
−
1
,
x
t
,
x
t
+
1
∣
x
0
)
l
o
g
p
θ
(
x
t
∣
x
t
+
1
)
q
(
x
t
∣
x
t
−
1
)
d
x
t
−
1
d
x
t
d
x
t
+
1
=
∭
q
(
x
t
+
1
,
x
t
−
1
∣
x
0
)
q
(
x
t
∣
x
t
−
1
)
l
o
g
p
θ
(
x
t
∣
x
t
+
1
)
q
(
x
t
∣
x
t
−
1
)
d
x
t
−
1
d
x
t
d
x
t
+
1
=
∬
[
∫
l
o
g
p
θ
(
x
t
∣
x
t
+
1
)
q
(
x
t
∣
x
t
−
1
)
q
(
x
t
∣
x
t
−
1
)
d
x
t
]
q
(
x
t
+
1
,
x
t
−
1
∣
x
0
)
d
x
t
−
1
d
x
t
+
1
=
∬
q
(
x
t
+
1
,
x
t
−
1
∣
x
0
)
[
−
D
K
L
(
q
(
x
t
∣
x
t
−
1
)
∣
∣
p
θ
(
x
t
∣
x
t
+
1
)
)
]
d
x
t
−
1
d
x
t
+
1
=
E
q
(
x
t
−
1
,
x
t
+
1
∣
x
0
)
[
−
D
K
L
(
q
(
x
t
∣
x
t
−
1
)
∣
∣
p
θ
(
x
t
∣
x
t
+
1
)
)
]
\begin{aligned} \mathbb{E}_{q(x_{t-1},x_t,x_{t+1}|x_0)}[log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}] &= \iiint q(x_{t-1},x_t,x_{t+1}|x_0)log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}dx_{t-1}dx_t dx_{t+1} \\ &=\iiint q(x_{t+1},x_{t-1}|x_0)q(x_t|x_{t-1})log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}dx_{t-1}dx_t dx_{t+1} \\ &=\iint [\int log\frac{p_{\theta}(x_{t}|x_{t+1})}{q(x_t|x_{t-1})}q(x_t|x_{t-1})dx_t]q(x_{t+1},x_{t-1}|x_0)dx_{t-1}dx_{t+1} \\ &=\iint q(x_{t+1},x_{t-1}|x_0)[-D_{KL}(q(x_t|x_{t-1})||p_{\theta}(x_{t}|x_{t+1}))]dx_{t-1}dx_{t+1} \\ &=\mathbb{E}_{q(x_{t-1},x_{t+1}|x_0)}[-D_{KL}(q(x_t|x_{t-1})||p_{\theta}(x_{t}|x_{t+1}))] \end{aligned}
Eq(xt−1,xt,xt+1∣x0)[logq(xt∣xt−1)pθ(xt∣xt+1)]=∭q(xt−1,xt,xt+1∣x0)logq(xt∣xt−1)pθ(xt∣xt+1)dxt−1dxtdxt+1=∭q(xt+1,xt−1∣x0)q(xt∣xt−1)logq(xt∣xt−1)pθ(xt∣xt+1)dxt−1dxtdxt+1=∬[∫logq(xt∣xt−1)pθ(xt∣xt+1)q(xt∣xt−1)dxt]q(xt+1,xt−1∣x0)dxt−1dxt+1=∬q(xt+1,xt−1∣x0)[−DKL(q(xt∣xt−1)∣∣pθ(xt∣xt+1))]dxt−1dxt+1=Eq(xt−1,xt+1∣x0)[−DKL(q(xt∣xt−1)∣∣pθ(xt∣xt+1))]
此时有
l o g p θ ( x ) ≥ E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x T − 1 ∣ x 0 ) [ − D K L ( q ( x T ∣ x T − 1 ) ∣ ∣ p θ ( x T ) ) ] + E q ( x t − 1 , x t + 1 ∣ x 0 ) [ − D K L ( q ( x t ∣ x t − 1 ) ∣ ∣ p θ ( x t ∣ x t + 1 ) ) ] \begin{aligned} logp_{\theta}(x)&\geq\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{T-1}|x_0)}[-D_{KL}(q(x_T|x_{T-1})||p_{\theta}(x_T))]+\mathbb{E}_{q(x_{t-1},x_{t+1}|x_0)}[-D_{KL}(q(x_t|x_{t-1})||p_{\theta}(x_{t}|x_{t+1}))] \\ \end{aligned} logpθ(x)≥Eq(x1∣x0)[logpθ(x0∣x1)]+Eq(xT−1∣x0)[−DKL(q(xT∣xT−1)∣∣pθ(xT))]+Eq(xt−1,xt+1∣x0)[−DKL(q(xt∣xt−1)∣∣pθ(xt∣xt+1))]
这里出现了一个问题,多元变量求期望方差会很大,那么能不能通过一些方法消去部分的变量呢?
马尔可夫性质贝叶斯
l o g p θ ( x ) ≥ E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x 1 ∣ x 0 ) ∏ t = 2 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t ∣ x t − 1 , x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ] \begin{aligned} logp_{\theta}(x)&\geq\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^Tq(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)\prod_{t=2}^Tp_{\theta}(x_{t-1}|x_t)}{\prod_{t=1}^Tq(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)\prod_{t=2}^{T}p_{\theta}(x_{t-1}|x_{t})}{q(x_1|x_0)\prod_{t=2}^{T}q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_1|x_{0})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=2}^{T}\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_t|x_{t-1})}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_1|x_{0})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=2}^{T}\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_t|x_{t-1},x_0)}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_1|x_{0})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=2}^{T}\frac{p_{\theta}(x_{t-1}|x_{t})}{\frac{q(x_{t-1}|x_{t},x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}}] \\ \end{aligned} logpθ(x)≥Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)]=Eq(x1:T∣x0)[log∏t=1Tq(xt∣xt−1)pθ(xT)∏t=1Tpθ(xt−1∣xt)]=Eq(x1:T∣x0)[log∏t=1Tq(xt∣xt−1)pθ(xT)pθ(x0∣x1)∏t=2Tpθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(x1∣x0)∏t=2Tq(xt∣xt−1)pθ(xT)pθ(x0∣x1)∏t=2Tpθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(x1∣x0)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=2∏Tq(xt∣xt−1)pθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(x1∣x0)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=2∏Tq(xt∣xt−1,x0)pθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(x1∣x0)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=2∏Tq(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)pθ(xt−1∣xt)]
[!NOTE]
- 由于马尔可夫性质规定: x t x_{t} xt时刻只与 x t − 1 x_{t-1} xt−1时刻相关。因此 q ( x t ∣ x t − 1 , x 0 ) = q ( x t ∣ x t − 1 ) q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}) q(xt∣xt−1,x0)=q(xt∣xt−1);
- 但是逆向过程并不满足马尔可夫性质,即 q ( x t − 1 ∣ x t , x 0 ) ≠ q ( x t − 1 ∣ x t ) q(x_{t-1}|x_{t},x_0) \neq q(x_{t-1}|x_{t}) q(xt−1∣xt,x0)=q(xt−1∣xt),因此后文中 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t},x_0) q(xt−1∣xt,x0)中的 x 0 x_0 x0一直没有删除;
- 值得注意的是,我们从原理正向推导出发时,直接在逆向非马尔可夫性质条件下在 p ( x t ∣ x t − 1 ) p(x_{t}|x_{t-1}) p(xt∣xt−1)中添加 x 0 x_0 x0条件,并通过预估 x 0 ^ = f ( x t , t ) \hat{x_0}=f(x_t,t) x0^=f(xt,t)的形式来消除新增的 x 0 x_0 x0条件。上面的思路显得比较跳跃且难以想象,通过MLE估计ELBO的推导中,在满足马尔科夫性质下利用 q ( x t ∣ x t − 1 ) = q ( x t ∣ x t − 1 , x 0 ) q(x_t|x_{t-1})=q(x_t|x_{t-1},x_0) q(xt∣xt−1)=q(xt∣xt−1,x0)公式进行推导显得更为合理,极度怀疑正向推导加 x 0 x_0 x0的措施是根据ELBO推导过程的trick来的。
其中
$$
\begin{aligned}
\mathbb{E}{q(x{1:T}|x_0)}[log\prod_{t=2}^{T}{\frac{q(x_{t-1}|x_{t},x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}}] &=\mathbb{E}{q(x{1:T}|x_0)}[log\prod_{t=2}^{T}q(x_{t-1}|x_{t},x_0)]+\mathbb{E}{q(x{1:T}|x_0)}[log\frac{\cancel{q(x_2|x_0)}}{q(x_1|x_0)}+log\frac{\cancel{q(x_3|x_1)}}{\cancel{q(x_2|x_0)}}+\cdots+log\frac{q(x_T|x_0)}{\cancel{q(x_{T-1}|x_0)}}] \
&=\mathbb{E}{q(x{1:T}|x_0)}[log\prod_{t=2}^{T}q(x_{t-1}|x_{t},x_0)]+\mathbb{E}{q(x{1:T}|x_0)}[log\frac{q(x_T|x_0)}{q(x_{1}|x_0)}]
\end{aligned}
$$
代入原式得
l o g p θ ( x ) ≥ E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x 1 ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ l o g q ( x 1 ∣ x 0 ) q ( x T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ l o g p θ ( x T ) p θ ( x 0 ∣ x 1 ) q ( x T ∣ x 0 ) ] + E q ( x 1 : T ∣ x 0 ) [ ∑ t = 2 T l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x T ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x 0 ) ] + E q ( x t − 1 , x t ∣ x 0 ) [ ∑ t = 2 T l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] + E q ( x T ∣ x 0 ) [ l o g p θ ( x T ) q ( x T ∣ x 0 ) ] + ∑ t = 2 T E q ( x t − 1 , x t ∣ x 0 ) [ l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] − D K L ( q ( x T ∣ x 0 ) ∣ ∣ p θ ( x T ) ) + ∑ t = 2 T E q ( x t − 1 , x t ∣ x 0 ) [ l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] \begin{aligned} logp_{\theta}(x)&\geq\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_1|x_{0})}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=2}^{T}\frac{p_{\theta}(x_{t-1}|x_{t})}{\frac{q(x_{t-1}|x_{t},x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}}] \\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{\cancel{q(x_1|x_{0})}}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\prod_{t=2}^{T}\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}]+\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{\cancel{q(x_1|x_0)}}{q(x_{T}|x_0)}]\\ &=\mathbb{E}_{q(x_{1:T}|x_0)}[log\frac{p_{\theta}(x_T)p_{\theta}(x_0|x_1)}{q(x_{T}|x_0)}]+\mathbb{E}_{q(x_{1:T}|x_0)}[\sum_{t=2}^{T}log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}]\\ &=\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{T}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_{T}|x_0)}]+\mathbb{E}_{q(x_{t-1},x_t|x_0)}[\sum_{t=2}^{T}log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}] \\ &=\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]+\mathbb{E}_{q(x_{T}|x_0)}[log\frac{p_{\theta}(x_T)}{q(x_{T}|x_0)}]+\sum_{t=2}^{T}\mathbb{E}_{q(x_{t-1},x_t|x_0)}[log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}] \\ &=\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]-D_{KL}(q(x_{T}|x_0)||p_{\theta}(x_T))+\sum_{t=2}^{T}\mathbb{E}_{q(x_{t-1},x_t|x_0)}[log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}] \end{aligned} logpθ(x)≥Eq(x1:T∣x0)[logq(x1∣x0)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=2∏Tq(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)pθ(xt−1∣xt)]=Eq(x1:T∣x0)[logq(x1∣x0) pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[logt=2∏Tq(xt−1∣xt,x0)pθ(xt−1∣xt)]+Eq(x1:T∣x0)[logq(xT∣x0)q(x1∣x0) ]=Eq(x1:T∣x0)[logq(xT∣x0)pθ(xT)pθ(x0∣x1)]+Eq(x1:T∣x0)[t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)]=Eq(x1∣x0)[logpθ(x0∣x1)]+Eq(xT∣x0)[logq(xT∣x0)pθ(xT)]+Eq(xt−1,xt∣x0)[t=2∑Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)]=Eq(x1∣x0)[logpθ(x0∣x1)]+Eq(xT∣x0)[logq(xT∣x0)pθ(xT)]+t=2∑TEq(xt−1,xt∣x0)[logq(xt−1∣xt,x0)pθ(xt−1∣xt)]=Eq(x1∣x0)[logpθ(x0∣x1)]−DKL(q(xT∣x0)∣∣pθ(xT))+t=2∑TEq(xt−1,xt∣x0)[logq(xt−1∣xt,x0)pθ(xt−1∣xt)]
其中
E q ( x t − 1 , x t ∣ x 0 ) [ l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) ] = ∬ q ( x t − 1 , x t ∣ x 0 ) l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) d x t − 1 d x t = ∬ q ( x t − 1 ∣ x t , x 0 ) q ( x t ∣ x 0 ) l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) d x t − 1 d x t = ∫ [ ∫ l o g p θ ( x t − 1 ∣ x t ) q ( x t − 1 ∣ x t , x 0 ) q ( x t − 1 ∣ x t , x 0 ) d x t − 1 ] q ( x t ∣ x 0 ) d x t = E q ( x t ∣ x 0 ) [ − D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ] \begin{aligned} \mathbb{E}_{q(x_{t-1},x_t|x_0)}[log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}] &= \iint q(x_{t-1},x_t|x_0)log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}dx_{t-1}dx_t \\ &=\iint q(x_{t-1}|x_{t},x_0)q(x_{t}|x_{0})log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}dx_{t-1}dx_t \\ &=\int [\int log\frac{p_{\theta}(x_{t-1}|x_{t})}{q(x_{t-1}|x_{t},x_0)}q(x_{t-1}|x_{t},x_0)dx_{t-1}]q(x_{t}|x_0)dx_{t} \\ &=\mathbb{E}_{q(x_{t}|x_0)}[-D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t}))] \end{aligned} Eq(xt−1,xt∣x0)[logq(xt−1∣xt,x0)pθ(xt−1∣xt)]=∬q(xt−1,xt∣x0)logq(xt−1∣xt,x0)pθ(xt−1∣xt)dxt−1dxt=∬q(xt−1∣xt,x0)q(xt∣x0)logq(xt−1∣xt,x0)pθ(xt−1∣xt)dxt−1dxt=∫[∫logq(xt−1∣xt,x0)pθ(xt−1∣xt)q(xt−1∣xt,x0)dxt−1]q(xt∣x0)dxt=Eq(xt∣x0)[−DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))]
[!NOTE] Title
注意上式不能将 q ( x t − 1 , x t ∣ x 0 ) q(x_{t-1},x_t|x_0) q(xt−1,xt∣x0)分解为 q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q(x_t|x_{t-1})q(x_{t-1}|x_0) q(xt∣xt−1)q(xt−1∣x0),因为不管先对 x t − 1 x_{t-1} xt−1还是 x t x_t xt积分,都会在后续被积函数中作为条件存在
至此利用马尔可夫性质对完成了多元变量的消除工作:
l o g p θ ( x ) ≥ E q ( x 1 ∣ x 0 ) [ l o g p θ ( x 0 ∣ x 1 ) ] ⏟ 重构项 − D K L ( q ( x T ∣ x 0 ) ∣ ∣ p θ ( x T ) ) ⏟ 正则项 + ∑ t = 2 T E q ( x t ∣ x 0 ) [ − D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ] ⏟ 去噪匹配项 \begin{aligned} logp_{\theta}(x)&\geq\underbrace{\mathbb{E}_{q(x_{1}|x_0)}[logp_{\theta}(x_0|x_1)]}_{重构项}-\underbrace{D_{KL}(q(x_{T}|x_0)||p_{\theta}(x_T))}_{正则项}+\underbrace{\sum_{t=2}^{T}\mathbb{E}_{q(x_{t}|x_0)}[-D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t}))]}_{去噪匹配项} \end{aligned} logpθ(x)≥重构项 Eq(x1∣x0)[logpθ(x0∣x1)]−正则项 DKL(q(xT∣x0)∣∣pθ(xT))+去噪匹配项 t=2∑TEq(xt∣x0)[−DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))]
也可以看到这里前两项与VAE具有相同的形式。
当 T = 1 T=1 T=1时,即意味着只有一个潜变量 x 1 = z x_1=z x1=z,这时退化到与VAE的ELBO具有完全相同的表达式。
ELBO解析
∑ t = 2 T E q ( x t ∣ x 0 ) [ − D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ] \sum_{t=2}^{T}\mathbb{E}_{q(x_{t}|x_0)}[-D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t}))] ∑t=2TEq(xt∣x0)[−DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))]是ELBO中占比最大的,优先看这个。
其中 p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_{t}) pθ(xt−1∣xt)是模型参数化的结果, q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t},x_0) q(xt−1∣xt,x0)是模型需要靠近的对象(ground-truth)。
对[[001 DDPM-v2#后向生成过程|ground-truth的推导]]不再赘述,最终结果为
q
(
x
t
−
1
∣
x
t
,
x
0
)
=
N
(
1
α
t
[
x
t
−
β
t
β
ˉ
t
ϵ
ˉ
t
]
,
β
t
β
ˉ
t
−
1
β
ˉ
t
I
)
q(x_{t-1}|x_{t},x_0) = N(\frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{t}], \frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}I)
q(xt−1∣xt,x0)=N(αt1[xt−βˉtβtϵˉt],βˉtβtβˉt−1I)
由于最终模型参数化
p
θ
(
x
t
−
1
∣
x
t
)
p_{\theta}(x_{t-1}|x_{t})
pθ(xt−1∣xt)是为了接近
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t-1}|x_{t},x_0)
q(xt−1∣xt,x0),那不妨:
- 直接使用 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t},x_0) q(xt−1∣xt,x0)的方差: β t β ˉ t − 1 β ˉ t \frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}} βˉtβtβˉt−1;
- 参考 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_{t},x_0) q(xt−1∣xt,x0)均值的形式去设置预测的变量: 1 α t [ x t − β t β ˉ t ϵ ˉ θ ] \frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{\theta}] αt1[xt−βˉtβtϵˉθ]
代入上述假设,展开 D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t})) DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))
D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = D K L ( N ( 1 α t [ x t − β t β ˉ t ϵ ˉ t ] , β t β ˉ t − 1 β ˉ t I ) ∣ ∣ N ( 1 α t [ x t − β t β ˉ t ϵ ˉ θ ] , β t β ˉ t − 1 β ˉ t I ) ) \begin{aligned} D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t})) &= D_{KL}(N(\frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{t}], \frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}I)||N(\frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{\theta}], \frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}I)) \\ \end{aligned} DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=DKL(N(αt1[xt−βˉtβtϵˉt],βˉtβtβˉt−1I)∣∣N(αt1[xt−βˉtβtϵˉθ],βˉtβtβˉt−1I))
参考
D K L ( N ( μ 1 , σ 1 2 I ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = log σ 2 σ 1 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 D_{KL}(N(\mu_1,\sigma_1^2I)||N(\mu_2,\sigma_2^2))=\log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2} DKL(N(μ1,σ12I)∣∣N(μ2,σ22))=logσ1σ2+2σ22σ12+(μ1−μ2)2−21
得到最终的值为
D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) = log β t β ˉ t − 1 β ˉ t β t β ˉ t − 1 β ˉ t + β t β ˉ t − 1 β ˉ t + ( 1 α t [ x t − β t β ˉ t ϵ ˉ t ] − 1 α t [ x t − β t β ˉ t ϵ ˉ θ ] ) 2 2 β t β ˉ t − 1 β ˉ t − 1 2 = β t 2 α t β ˉ t − 1 ∥ ϵ ˉ θ − ϵ ˉ t ∥ 2 \begin{aligned} D_{KL}(q(x_{t-1}|x_{t},x_0)||p_{\theta}(x_{t-1}|x_{t})) &= \log\frac{\sqrt{\frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}}}{\sqrt{\frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}}}+\frac{\frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}+(\frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{t}]-\frac{1}{\sqrt{\alpha_{t}}}[x_{t}-\frac{\beta_{t}}{\sqrt{\bar{\beta}_{t}}}\bar{\epsilon}_{\theta}])^2}{2\frac{\beta_{t}\bar{\beta}_{t-1}}{\bar{\beta}_{t}}}-\frac{1}{2} \\ &=\frac{\beta_t}{2\alpha_t\bar{\beta}_{t-1}}\Vert \bar{\epsilon}_{\theta}-\bar{\epsilon}_{t} \Vert^2 \end{aligned} DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))=logβˉtβtβˉt−1βˉtβtβˉt−1+2βˉtβtβˉt−1βˉtβtβˉt−1+(αt1[xt−βˉtβtϵˉt]−αt1[xt−βˉtβtϵˉθ])2−21=2αtβˉt−1βt∥ϵˉθ−ϵˉt∥2