前言
上次已经学习了open AI的 DDPM(DDPM原理与代码剖析)和 IDDPM(IDDPM原理和代码剖析), 以及 斯坦福的 DDIM DDIM原理及代码(Denoising diffusion implicit models). 这次来看openAI的另一个作品 Diffusion Models Beat GANs on Image Synthesis
github: https://github.com/openai/guided-diffusion
该博客主要参考 66、Classifier Guided Diffusion条件扩散模型论文与PyTorch代码详细解读
该部分代码主要基于 IDDPM这篇论文对应的代码 IDDPM原理和代码剖析
先挖个坑…
代码分析部分还没完成。。。
理论
前置
(1) 作者先在uncondition 的扩散模型上做了很多消融实验,得到了一些结论,并用这些结论设计结构
(2) 一种straightforward的condition 扩散模型方法是将label信息进行embedding后加到time embedding中,但是效果不是很好。所以本文加上了分类器指导的方法(并没有把上述的常规的condition生成方法丢弃)。
具体的做法是在分类器中获取图片X的梯度,从而辅助模型进行采样生成图像。
Introduction
(1) diffusion模型是一个似然模型。
(2) 模型借鉴了improve-ddpm中预测方差的range(即公式中的v)
Σ
θ
(
X
t
,
t
)
=
e
x
p
(
v
l
o
g
β
t
+
(
1
−
v
)
l
o
g
β
~
t
)
\Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v)log \widetilde{\beta}_t)
Σθ(Xt,t)=exp(vlogβt+(1−v)logβ
t)
(3) 更改unet结构:
We explore the following architectural changes:
• Increasing depth versus width, holding model size relatively constant.
• Increasing the number of attention heads.
• Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
• Using the BigGAN residual block for upsampling and downsampling the activations,
following.
• Rescaling residual connections with
1
2
\frac{1}{\sqrt{2}}
21, following [60, 27, 28].
Adaptive Group Normalization
用time embedding和label embedding 去生成
y
s
y_s
ys 和
y
b
y_b
yb
A
d
a
G
N
(
h
,
y
)
=
y
s
G
r
o
u
p
N
o
r
m
(
h
)
+
y
b
AdaGN(h, y) = y_s GroupNorm(h)+y_b
AdaGN(h,y)=ysGroupNorm(h)+yb
以下部分在附录H (P25-26)






代码
其实推导了那么多,代码还是差不多,这里只讲有区别的地方。
p_sample
guided_diffusion/gaussian_diffusion.py
def p_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
):
"""
Sample x_{t-1} from the model at the given timestep.
:param cond_fn: if not None, this is a gradient function that acts
similarly to the model.
"""
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
noise = th.randn_like(x)
nonzero_mask = (
(t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
) # no noise when t == 0
if cond_fn is not None:
out["mean"] = self.condition_mean(
cond_fn, out, x, t, model_kwargs=model_kwargs
)
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"]}
对比可以发现,这里多了这一步
if cond_fn is not None:
out["mean"] = self.condition_mean(
cond_fn, out, x, t, model_kwargs=model_kwargs
)
condition_mean
guided_diffusion/gaussian_diffusion.py
def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute the mean for the previous step, given a function cond_fn that
computes the gradient of a conditional log probability with respect to
x. In particular, cond_fn computes grad(log(p(y|x))), and we want to
condition on y.
This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
"""
gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs)
new_mean = (
p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
)
return new_mean
cond_fn
scripts/classifier_sample.py

这里要返回的是 s × ▽ X t l o g p ϕ ( y ∣ X t ) s \times \bigtriangledown_{X_t} log p_{\phi}(y|X_t) s×▽Xtlogpϕ(y∣Xt), 其中, s s s 是 args.classifier_scale
def cond_fn(x, t, y=None):
assert y is not None
with th.enable_grad():
x_in = x.detach().requires_grad_(True)
logits = classifier(x_in, t)
log_probs = F.log_softmax(logits, dim=-1)
selected = log_probs[range(len(logits)), y.view(-1)]
return th.autograd.grad(selected.sum(), x_in)[0] * args.classifier_scale
ddim_sample
这是ddim的采样方法,关于这个在 IDDPM原理和代码剖析 有介绍,不明白的请移步哦。这里只讲主要变换。
def ddim_sample(
self,
model,
x,
t,
clip_denoised=True,
denoised_fn=None,
cond_fn=None,
model_kwargs=None,
eta=0.0,
):
"""
Sample x_{t-1} from the model using DDIM.
Same usage as p_sample().
"""
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
if cond_fn is not None:
out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)
# Usually our model outputs epsilon, but we re-derive it
# in case we used x_start or x_prev prediction.
eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"])
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape)
sigma = (
eta
* th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar))
* th.sqrt(1 - alpha_bar / alpha_bar_prev)
)
# Equation 12.
noise = th.randn_like(x)
mean_pred = (
out["pred_xstart"] * th.sqrt(alpha_bar_prev)
+ th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps
)
nonzero_mask = (
(t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
) # no noise when t == 0
sample = mean_pred + nonzero_mask * sigma * noise
return {"sample": sample, "pred_xstart": out["pred_xstart"]}
if cond_fn is not None:
out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs)
condition_score

def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
"""
Compute what the p_mean_variance output would have been, should the
model's score function be conditioned by cond_fn.
See condition_mean() for details on cond_fn.
Unlike condition_mean(), this instead uses the conditioning strategy
from Song et al (2020).
"""
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
x, self._scale_timesteps(t), **model_kwargs
)
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(
x_start=out["pred_xstart"], x_t=x, t=t
)
return out
其中, alpha_bar 是 α ‾ t \overline{\alpha}_t αt
alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape)
eps 是 ϵ θ ( X t ) − 1 − α ‾ t ▽ X t l o g p ϕ ( y ∣ X t ) \epsilon_{\theta}(X_t)-\sqrt{1-\overline{\alpha}_t} \bigtriangledown_{X_t} log p_{\phi}(y|X_t) ϵθ(Xt)−1−αt▽Xtlogpϕ(y∣Xt), 其中 cond_fn 函数返回的就是 ▽ X t l o g p ϕ ( y ∣ X t ) \bigtriangledown_{X_t} log p_{\phi}(y|X_t) ▽Xtlogpϕ(y∣Xt)
eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"])
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
x, self._scale_timesteps(t), **model_kwargs
)
后面的就和原始DDIM公式一样

但是我看代码更像是
μ
~
(
X
t
,
X
0
)
=
α
‾
t
−
1
1
−
α
‾
t
X
0
+
α
t
(
1
−
α
‾
t
−
1
)
1
−
α
‾
t
X
t
\widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t
μ
(Xt,X0)=1−αtαt−1X0+1−αtαt(1−αt−1)Xt
out = p_mean_var.copy()
out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps)
out["mean"], _, _ = self.q_posterior_mean_variance(
x_start=out["pred_xstart"], x_t=x, t=t
)
q_posterior_mean_variance函数返回的均值是这么算的
posterior_mean = (
_extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
+ _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t μ (Xt,X0)=1−αtαt−1X0+1−αtαt(1−αt−1)Xt
posterior_mean_coef1 就是 α ‾ t − 1 1 − α ‾ t \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} 1−αtαt−1
posterior_mean_coef2 就是 α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}} 1−αtαt(1−αt−1)
self.posterior_mean_coef1 = (
betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)
self.posterior_mean_coef2 = (
(1.0 - self.alphas_cumprod_prev)
* np.sqrt(alphas)
/ (1.0 - self.alphas_cumprod)
)