Python 全栈体系【四阶】(五十)

第五章 深度学习

十一、扩散模型

4. 附录:Diffusion的数学推导过程

4.2 Diffusion正向扩散过程推导

设初始数据 x 0 x_0 x0符合分布 q ( x 0 ) q(x_0) q(x0),即训练集分布,然后不断向其中添加高斯噪声,高斯噪声本身是不可训练参数,或者说均值和方差是固定的,通过方差系数 β 1 , ⋯   , β n \beta_1, \cdots, \beta_n β1,,βn来控制添加噪声的强度,它们是0~1之间的小数,一般会越来越大. 另外,这个过程被固定为马尔科夫链,每步的条件转移分布为 q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) q(xtxt1),整体后验分布表示为 q ( x 1 : T ∣ x 0 ) q(x_{1:T} | x_0) q(x1:Tx0),也就是连乘的形式.

q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) \\ q(x_{1:T}|x_0) = \prod_{t=1} ^ T q(x_t|x_{t-1}) q(xtxt1)=N(xt;1βt xt1,βtI)q(x1:Tx0)=t=1Tq(xtxt1)

正向过程的特点在于,可以根据系数 β \beta β x 0 x_0 x0直接求出任意时刻的转移分布 q ( x t ∣ x 0 ) q(x_t|x_0) q(xtx0),如下所示:

q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(x_t|x_0) = N(x_t; \sqrt{\bar \alpha_t} x_0, (1-\bar \alpha_t)I) q(xtx0)=N(xt;αˉt x0,(1αˉt)I)

其中, α t = 1 − β t \alpha_t=1-\beta_t αt=1βt α ˉ t = ∏ i = 1 t α i \bar \alpha_t = \prod_{i=1} ^t \alpha _i αˉt=i=1tαi.

4.2.1 正向扩散过程的推导

利用重参数化技巧,可将 x t x_t xt表示为 x t − 1 x_{t-1} xt1加上一个噪声值

x t = α t x t − 1 + 1 − α t ϵ t − 1 ;     w h e r e     ϵ t − 1 , ϵ t − 2 , ⋯ ∼ N ( 0 , 1 ) x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_{t-1}; \ \ \ where\ \ \ \epsilon_{t-1}, \epsilon_{t-2},\cdots \sim N(0,1) \\ xt=αt xt1+1αt ϵt1;   where   ϵt1,ϵt2,N(0,1)

t x − 1 t_{x-1} tx1又可以表示为 x t − 1 x_{t-1} xt1加上一个噪声值:

x t = α t ( t − 1 x t − 2 + 1 − α t − 1 ϵ t − 2 ) + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + α t ( 1 − α t − 1 ) ϵ t − 2 + 1 − α t ϵ t − 1 \begin{align} x_t &= \sqrt{\alpha_t} (\sqrt{t-1} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}) + \sqrt{1-\alpha_t} \epsilon_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} +\sqrt{\alpha_t (1-\alpha_{t-1})} \epsilon_{t-2} + \sqrt{1-\alpha_t} \epsilon_{t-1} \end{align} xt=αt (t1 xt2+1αt1 ϵt2)+1αt ϵt1=αtαt1 xt2+αt(1αt1) ϵt2+1αt ϵt1

上式第二项、第三项为两个高斯噪声相加. 两个均值为-的高斯值相加,均值仍为0;方程相加 σ 1 2 + σ 2 2 = α t ( 1 − α t − 1 ) + 1 − α t = 1 − α t α t − 1 \sigma_1 ^2 + \sigma_2 ^2 = \alpha_t(1-\alpha_{t-1}) + 1 - \alpha_t = 1 - \alpha_t \alpha_{t-1} σ12+σ22=αt(1αt1)+1αt=1αtαt1

所以,第二项、第三项合并后可表示为:

x t = α t α t − 1 x t − 2 + 1 − α t α t − 1 ϵ ˉ t − 2 ;   w h e r e   ϵ ˉ t − 2  merges two Gaussians(*) = . . . = α ˉ t x 0 + 1 − α ˉ ϵ \begin{align} x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1- \alpha_t \alpha_{t-1}} \bar \epsilon_{t-2}; \ where \ \bar \epsilon_{t-2} \ \text{merges two Gaussians(*)} \\ &= ... \\ &= \sqrt{\bar \alpha_t} x_0 + \sqrt{1-\bar \alpha} \epsilon \end{align} xt=αtαt1 xt2+1αtαt1 ϵˉt2; where ϵˉt2 merges two Gaussians(*)=...=αˉt x0+1αˉ ϵ

这样,就能根据初始分布和时间t直接求出t时刻的分布.

4.2.2 逆向去噪过程推导

如果把正向扩散过程比作墨水在水中扩散的过程,逆向过程就相当于从水中提取出墨水的过程. 为了简化分析,也把它假定为马尔科夫链,转移分布也是高斯的,这样就变成了一个参数估计问题,用神经网络来学习转移分布.

p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(x_{0:T}) = p(x_T) \prod_{t=1} ^T p_\theta(x_{t-1}|x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))

其中,网络的输入是 x t x_t xt t t t,转移分布为 μ \mu μ,协方差为 Σ \Sigma Σ θ \theta θ为模型参数(要求的目标),转移概率 p θ p_\theta pθ为未知的. 逆向过程比正向过程要难(这就好比把墨水融到水中容易,把墨水从水中提取出来更困难),Diffusion模型的做法是,通过公式推导,把逆向过程的转移分布 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)用正向扩散过程的后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t, x_0) q(xt1xt,x0)来逼近:

q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta_t}I) q(xt1xt,x0)=N(xt1;μ~(xt,x0),βt~I)

根据贝叶斯定理:

q ( x t − 1 ∣ x t , x 0 ) = q ( x t − 1 ) q ( x t , x 0 ∣ x t − 1 ) q ( x t , x 0 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x 0 ∣ x t − 1 ) q ( x 0 ) q ( x t ∣ x 0 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x 0 ) q ( x t ∣ x 0 ) × q ( x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t − 1 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) \begin{align} q(x_{t-1}|x_t, x_0) &= \frac{q(x_{t-1})q(x_t, x_0|x_{t-1})}{q(x_t, x_0)} \\ &= \frac{q(x_{t-1})q(x_t|x_{t-1})q(x_0|x_{t-1})}{q(x_0)q(x_t|x_0)} \\ &=\frac{q(x_{t-1})q(x_t|x_{t-1})}{q(x_0)q(x_t|x_0)} \times \frac{q(x_0)q(x_{t-1}|x_0)}{q(x_{t-1})}\\ &=\frac{q(x_{t-1})q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &=q(x_t|x_{t-1}, x_0) \frac{q(x_{t-1} | x_0)}{q(x_t|x_0)} \end{align} q(xt1xt,x0)=q(xt,x0)q(xt1)q(xt,x0xt1)=q(x0)q(xtx0)q(xt1)q(xtxt1)q(x0xt1)=q(x0)q(xtx0)q(xt1)q(xtxt1)×q(xt1)q(x0)q(xt1x0)=q(xtx0)q(xt1)q(xtxt1)q(xt1x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)

将上面的公式写成指数形式,并省略掉前面的系数,展开,凑成一元二次方程标准形式:

q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ e x p ( − 1 2 ( ( x t − α x t − 1 ) 2 β t + ( x t − 1 − α ˉ t − 1 x 0 ) 2 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) ) = e x p ( − 1 2 ( ( x t 2 − 2 α t x t x t − 1 + α t x t − 1 2 β t + ( x t − 1 2 − 2 α ˉ t − 1 x 0 x t − 1 + α ˉ t − 1 x 0 2 ) 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) ) = e x p ( − 1 2 ( ( α t β t + 1 1 − α ˉ t − 1 ) x t − 1 2 − ( 2 α t β t x t + 2 α ˉ t − 1 1 − α ˉ t − 1 x 0 ) x t − 1 + C ( x t , x 0 ) ) ) \begin{align} & q(x_t|x_{t-1}, x_0) \frac{q(x_{t-1} | x_0)}{q(x_t|x_0)} \\ &\propto exp \bigg( - \frac{1}{2} \big( \frac{(x_t - \sqrt \alpha x_{t-1})^2}{\beta_t} + \frac{(x_{t-1} - \sqrt{\bar \alpha_{t-1}} x_0)^2}{1-\bar \alpha_{t-1}} - \frac{(x_t - \sqrt{\bar \alpha_t} x_0)^2}{1-\bar \alpha_t} \big)\bigg) \\ &= exp \bigg( - \frac{1}{2} \big( \frac{(x_t ^2 - 2\sqrt \alpha_t x_t x_{t-1} + \alpha_t x_{t-1} ^2}{\beta_t} + \frac{(x_{t-1} ^2 - 2 \sqrt{\bar \alpha_{t-1}} x_0 x_{t-1} + \bar \alpha_{t-1} x_0 ^2)}{1-\bar \alpha_{t-1}} - \frac{(x_t - \sqrt{\bar \alpha_t} x_0)^2}{1-\bar \alpha_t} \big)\bigg) \\ &= exp\bigg( - \frac{1}{2} \big((\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar \alpha_{t-1}})x_{t-1} ^2 - (\frac{2 \sqrt \alpha_t}{\beta_t} x_t + \frac{2 \sqrt {\bar \alpha_{t-1}}}{1- \bar \alpha_{t-1}} x_0) x_{t-1} + C(x_t, x_0) \big)\bigg) \end{align} q(xtxt1,x0)q(xtx0)q(xt1x0)exp(21(βt(xtα xt1)2+1αˉt1(xt1αˉt1 x0)21αˉt(xtαˉt x0)2))=exp(21(βt(xt22α txtxt1+αtxt12+1αˉt1(xt122αˉt1 x0xt1+αˉt1x02)1αˉt(xtαˉt x0)2))=exp(21((βtαt+1αˉt11)xt12(βt2α txt+1αˉt12αˉt1 x0)xt1+C(xt,x0)))

根据一元二次方程方程公式 a ( x − b 2 a ) 2 + ( 4 a b − b 2 4 a ) = 0 a(x - \frac{b}{2a})^2 + (\frac{4ab - b^2}{4a}) = 0 a(x2ab)2+(4a4abb2)=0和高斯概率密度函数 f ( x ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) f(x)=\frac{1}{\sqrt{2 \pi \sigma} } exp(- \frac{(x-\mu)^2}{2 \sigma ^2}) f(x)=2πσ 1exp(2σ2(xμ)2)可知,均值为 − b 2 a - \frac{b}{2a} 2ab,方差为 1 a \frac{1}{a} a1,带入公式计算出均值和方差(以下公式省略掉了常数):

​ μ ~ ( x t , x 0 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) / ( α t β t + 1 1 − α ˉ t − 1 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) 1 − α ˉ t − 1 1 − α ˉ t . β t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 ​ β t ~ = 1 / ( α t β t + 1 1 − α ˉ t − 1 ) = 1 / ( α t − α t ˉ + β t β t ( 1 − α ˉ t − 1 ) ) = 1 − α ˉ t − 1 1 − α ˉ t . β t ​\begin{align}\tilde \mu(x_t, x_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0) / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0) \frac{1 - \bar \alpha_{t-1}}{1 - \bar \alpha_t} . \beta_t \\&= \frac{\sqrt {\alpha_t}(1- \bar \alpha_{t-1})}{1- \bar \alpha_t} x_t + \frac{\sqrt {\bar \alpha_{t-1}} \beta_t}{1 - \bar \alpha_t} x_0 \\ ​\tilde{\beta_t} &= 1 / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) = 1 / (\frac{\alpha_t - \bar{\alpha_t} + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) = \frac{1 - \bar \alpha_{t-1} }{1- \bar{\alpha}_t} .\beta_t \\ \end{align} μ~(xt,x0)βt~=(βtαt xt+1αˉt1αˉt1 x0)/(βtαt+1αˉt11)=(βtαt xt+1αˉt1αˉt1 x0)1αˉt1αˉt1.βt=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtx0=1/(βtαt+1αˉt11)=1/(βt(1αˉt1)αtαtˉ+βt)=1αˉt1αˉt1.βt

这里使用了之前的定义:

α t = 1 − β t α ˉ t = ∏ i = 1 T α t \alpha_t = 1 - \beta_t \\ \bar \alpha_t = \prod_{i=1} ^ T \alpha_t αt=1βtαˉt=i=1Tαt

这样,就得到扩散过程后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t, x_0) q(xt1xt,x0)的解析式,它是一个高斯分布. 其中,均值是关于 α , α ˉ , β t \alpha, \bar \alpha, \beta_t α,αˉ,βt以及 x 0 , x t x_0,x_t x0,xt的表达式, 而方差完全是个常数,跟x没有关系. 进一步根据前面正向扩散过程中,应用重参数技巧推导的 x 0 x_0 x0 x t x_t xt之间的关系,可以得到:

x 0 = 1 α ˉ t ( x t − 1 − α t ˉ ϵ t ) x_0 = \frac{1}{\sqrt {\bar \alpha_t}} (x_t - \sqrt {1 - \bar {\alpha_t}} \epsilon_t) x0=αˉt 1(xt1αtˉ ϵt)

带入上面均值表达式,替换掉 x 0 x_0 x0,于是 μ ~ t \tilde \mu_t μ~t就等于:

μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \begin{align} \tilde \mu_t &= \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}} \beta_t}{1 - \bar {\alpha}_t} \frac{1}{\sqrt{\bar \alpha_t}} (x_t - \sqrt{1 - \bar \alpha_t} \epsilon_t) \\ &=\frac{1}{\sqrt \alpha_t}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t}} \epsilon_t) \end{align} μ~t=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt ϵt)=α t1(xt1αˉt 1αtϵt)

其中, ϵ \epsilon ϵ是t时刻从标准正态分布中采样得到的随机值. 到此为止,就完成扩散过程的后验概率分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt1xt,x0), 它依旧是一个高斯分布,均值只和 x t x_t xt及标准正态分布的噪声相关,方差只跟常数 α \alpha α β \beta β相关. 扩散模型的重要意义在于提供了一种全新的生成模型范式,可以更好地描述数据的演化过程. 总体上来说,正向扩散和逆向扩散都是马尔科夫链,其中正向过程是确定性的、可控的,通过不断调整系数 β t \beta_t βt逐步添加噪声;转移分布是高斯的;逆向过程虽然复杂,但是转移分布 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1} | x_t) pθ(xt1xt)也可以假设为高斯,用神经网络来逼近求解. 由于直接求解缺少有效数据,因此先推导了更容易求得,有解析式的正向过程后验条件分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt1xt,x0),用来逼近 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1} | x_t) pθ(xt1xt). 这三个分布某种意义上刻画了Diffusion模型的全部演化过程,在后面求损失函数过程中用到.

4.2.3 损失函数变分推导

先求数据的负对数似然函数,直接不好求解,从而变通一下,寻找它的上界,其方法就是加上一个KL散度,因为KL散度是非负数.

− l o g p θ ( x 0 ) ≤ − l o g p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) -logp_\theta(x_0) \leq -logp_\theta(x_0) + D_{KL}(q(x_{1:T} | x_0) || p_\theta(x_{1:T}|x_0)) logpθ(x0)logpθ(x0)+DKL(q(x1:Tx0)∣∣pθ(x1:Tx0))

最小化化负对数似然,等价于最小化他的上界(上式中右边的部分). 将右边部分进行变形,先利用贝叶斯定理展开,然后消掉最后一个无关项:

− l o g p θ ( x 0 ) ≤ − l o g p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) = − l o g p θ ( x 0 ) + E x 1 : T ∼ q ( x 1 : T ∣ x 0 ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = − l o g p θ ( x 0 ) + E q [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) + l o g p θ ( x 0 ) ] = E q [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \begin{align} -logp_\theta(x_0) & \leq -logp_\theta(x_0) + D_{KL}(q(x_{1:T} | x_0) || p_\theta(x_{1:T}|x_0)) \\ &= -log p_\theta(x_0) + \mathbb E_{x_{1:T} \sim q(x_{1:T}|x_0)} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T}) / p_\theta(x_0)} \bigg] \\ &= -log p_\theta(x_0) + \mathbb E_q \bigg[ log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})} + log p_\theta(x_0) \bigg] \\ &= \mathbb E_q \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] \end{align} logpθ(x0)logpθ(x0)+DKL(q(x1:Tx0)∣∣pθ(x1:Tx0))=logpθ(x0)+Ex1:Tq(x1:Tx0)[logpθ(x0:T)/pθ(x0)q(x1:Tx0)]=logpθ(x0)+Eq[logpθ(x0:T)q(x1:Tx0)+logpθ(x0)]=Eq[logpθ(x0:T)q(x1:Tx0)]

不等式两边都加上一个对 q ( x 0 ) q(x_0) q(x0)的期望,左侧就是交叉熵,右边就变成对 q ( x 0 : T ) q(x_{0:T}) q(x0:T)求期望:

E q ( x 0 ) l o g p θ ( x 0 ) ≤ E q ( x 0 : T ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \mathbb E_q(x_0) log p_\theta(x_0) \leq \mathbb E_{q(x_{0:T})} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] Eq(x0)logpθ(x0)Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)]

最小化交叉熵就等价于最小化它的上界,右侧部分称为证据下界(Evidence Lower Bound),也就是变分推断中的ELBO,只不过前面加了符号,因此最大化对数似然,变成了最小化负对数似然,右侧的下界也变成了上界. Diffusion模型选取的损失函数就是目标数据的交叉熵,然后通过变分先找到上界,然后持续化简上界表达式,因为分子部分就是正向过程的条件概率分布,分布部分是逆向过程的联合分布. 接下来,就是对右侧一顿暴推,推导出迭代形式:

L V L B = E q ( x 0 : T ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q [ l o g ∏ t = 1 T q ( x t ∣ x t − 1 ) p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 1 T l o g q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) . q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + ∑ t = 2 T q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + l o g q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ l o g q ( x T ∣ x 0 ) p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) − l o g p θ ( x 0 ∣ x 1 ) ] = E q [ D K L ( q ( x T ∣ x 0 ) ∣ ∣ p θ ( x T ) ) ⏟ L T + ∑ t = 2 T D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ⏟ L T − 1 − l o g p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] \begin{align} L_{VLB} &= \mathbb E_{q(x_{0:T})} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] \\ &= \mathbb E_q \bigg[ log \frac{\prod_{t=1} ^ T q(x_t | x_{t-1})}{p_\theta(x_T) \prod_{t=1} ^ T p_\theta(x_{t-1} | x_t)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=1} ^ T log \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \big(\frac{q(x_{t-1} | x_t, x_0)}{p_\theta(x_{t-1}|x_t)} . \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} \big) + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)} + \sum_{t=2} ^ T \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)} + log \frac{q(x_T|x_0)}{q(x_1|x_0)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ log \frac{q(x_T|x_0)}{p_\theta(x_T)} + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)} - log p_\theta(x_0|x_1) \bigg] \\ &= \mathbb E_q \big[ \underbrace {D_{KL}(q(x_T|x_0)||p_\theta(x_T))}_{L_T} + \sum_{t=2} ^T \underbrace {D_{KL} (q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))}_{L_{T-1}} - \underbrace {log p_\theta(x_0|x_1) }_{L_0} \big] \end{align} LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)]=Eq[logpθ(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1)]=Eq[logpθ(xT)+t=1Tlogpθ(xt1xt)q(xtxt1)]=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xtxt1)+logpθ(x0x1)q(x1x0)]=Eq[logpθ(xT)+t=2Tlog(pθ(xt1xt)q(xt1xt,x0).q(xt1x0)q(xtx0))+logpθ(x0x1)q(x1x0)]=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)+t=2Tq(xt1x0)q(xtx0)+logpθ(x0x1)q(x1x0)]=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)+logq(x1x0)q(xTx0)+logpθ(x0x1)q(x1x0)]=Eq[logpθ(xT)q(xTx0)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq[LT DKL(q(xTx0)∣∣pθ(xT))+t=2TLT1 DKL(q(xt1xt,x0)∣∣pθ(xt1xt))L0 logpθ(x0x1)]

4.2.4 损失函数的参数化

两个高斯分布p和q的KL散度其实是可以直接求解的,只和它们的均值和方差有关:

K L ( p , q ) = l o g σ 2 σ 1 + σ 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 KL(p, q) = log \frac{\sigma_2}{\sigma_1} + \frac{\sigma^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2 ^2} - \frac{1}{2} KL(p,q)=logσ1σ2+2σ22σ2+(μ1μ2)221

在推导得到的KL散度损失函数中,两个高斯分布的方差都是常数,因此对最优化没有贡献,可以忽略掉,只剩下含有两个均值的部分,得到:

L t = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ( x t , t ) ∣ ∣ 2 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 ] L_t = \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta(x_t, t)||_2 ^2} ||\tilde \mu_t(x_t, x_0) - \mu_\theta(x_t, t)|| ^2 \bigg] Lt=Ex0,ϵ[2∣∣Σθ(xt,t)221∣∣μ~t(xt,x0)μθ(xt,t)2]

其中 μ ~ ( x t , x 0 ) \tilde \mu(x_t, x_0) μ~(xt,x0) q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt1xt,x0)的均值, μ θ ( x t , t ) \mu_\theta(x_t, t) μθ(xt,t) p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt)的均值,优化的目标就是后面逆向过程的均值 μ θ \mu_\theta μθ要尽量毕竟前面正向过程后验分布的均值 μ ~ \tilde \mu μ~,或者说训练的目标就是让 μ θ \mu_\theta μθ来预测 μ ~ \tilde \mu μ~. 前面这个均值我们刚才已经求出来了,有具体解析式.

μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \begin{align} \tilde \mu_t &= \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}} \beta_t}{1 - \bar {\alpha}_t} \frac{1}{\sqrt{\bar \alpha_t}} (x_t - \sqrt{1 - \bar \alpha_t} \epsilon_t) \\ &=\frac{1}{\sqrt \alpha_t}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t}} \epsilon_t) \end{align} μ~t=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt ϵt)=α t1(xt1αˉt 1αtϵt)

因为 x t x_t xt在训练过程中是已知的,因此后面的均值 μ θ \mu_\theta μθ也可以通过重参数化技巧写成 x t x_t xt和一个含参的高斯噪声 ϵ θ \epsilon_\theta ϵθ的形式(下式中第二步),整理合并同类项,就消掉了 x t x_t xt,只剩下两个 ϵ \epsilon ϵ之间的差值. 再根据正向过程的重参数化推导,把 x t x_t xt替换成 x 0 x_0 x0 ϵ t \epsilon_t ϵt的形式.

L t = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ( x t , t ) ∣ ∣ 2 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) − 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ ϵ t − ϵ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∣ ∣ 2 ] \begin{align} L_t &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta(x_t, t)||_2 ^2} ||\tilde \mu_t(x_t, x_0) - \mu_\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta||_2 ^2} || \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t} } \epsilon_t) - \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{(1-\alpha_t) ^2}{2 \alpha_t(1 - \bar \alpha_t) ||\Sigma_\theta||_2 ^2} ||\epsilon_t - \epsilon_\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{(1-\alpha_t) ^2}{2 \alpha_t(1 - \bar \alpha_t) ||\Sigma_\theta||_2 ^2} ||\epsilon_t - \epsilon_\theta(\sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon_t, t) || ^2 \bigg] \\ \end{align} Lt=Ex0,ϵ[2∣∣Σθ(xt,t)221∣∣μ~t(xt,x0)μθ(xt,t)2]=Ex0,ϵ[2∣∣Σθ221∣∣αt 1(xt1αˉt 1αtϵt)αt 1(xt1αˉt 1αtϵθ(xt,t)2]=Ex0,ϵ[2αt(1αˉt)∣∣Σθ22(1αt)2∣∣ϵtϵθ(xt,t)2]=Ex0,ϵ[2αt(1αˉt)∣∣Σθ22(1αt)2∣∣ϵtϵθ(αˉt x0+1αˉt ϵt,t)2]

上面的式子,表达的含义是:有一个神经网络,输入 x 0 , ϵ t x_0, \epsilon_t x0,ϵt和时间戳 t t t,输出是预测的 ϵ θ \epsilon_\theta ϵθ,用来逼近扩散过程噪声 ϵ t \epsilon_t ϵt. 这样就实现了对负对数似然的优化. 论文作者进一步发现前面一项系数可以丢掉,并不影响结果,而且还更稳定,所以可以进一步简化为:

L t s i m p l e = E t ∼ [ 1 , T ] , x 0 , ϵ t [ ∣ ∣ ϵ t − ϵ θ ( x t , t ) ∣ ∣ 2 ] = E t ∼ [ 1 , T ] , x 0 , ϵ t ∣ ∣ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∣ ∣ 2 ] \begin{align} L_t ^{simple} &= \mathbb E_{t \sim [1, T], x_0, \epsilon_t} \bigg[ ||\epsilon_t - \epsilon_\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{t \sim [1, T], x_0, \epsilon_t} ||\epsilon_t - \epsilon_\theta(\sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon_t, t) || ^2 \bigg] \\ \end{align} Ltsimple=Et[1,T],x0,ϵt[∣∣ϵtϵθ(xt,t)2]=Et[1,T],x0,ϵt∣∣ϵtϵθ(αˉt x0+1αˉt ϵt,t)2]

其中, ϵ θ \epsilon_\theta ϵθ是由一个含有参数的神经网络预测得到的值, ϵ t \epsilon_t ϵt就是当前时刻的噪声随机量. 到此,就完成了损失函数部分的推导,过程较为复杂,但结果却十分简洁.

  • 19
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

柠檬小帽

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值