What are Diffusion Models?-CSDN博客

本文链接：https://blog.csdn.net/starzhou/article/details/136649246

博客介绍了扩散模型，它受非平衡热力学启发，通过正向扩散加噪和反向扩散去噪生成数据。文中分析了其与随机梯度朗之万动力学的联系，还提及加速采样、条件生成、提升分辨率和质量的方法，指出该模型兼具易处理性和灵活性，但生成样本耗时久。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)].
[Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen.
[Updated on 2022-08-31: Added latent diffusion model.

So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.

Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).

Fig. 1. Overview of different types of generative models.

What are Diffusion Models?

Several diffusion-based generative models have been proposed with similar ideas underneath, including diffusion probabilistic models (Sohl-Dickstein et al., 2015), noise-conditioned score network (NCSN; Yang & Ermon, 2019), and denoising diffusion probabilistic models (DDPM; Ho et al. 2020).

Forward diffusion process

Given a data point sampled from a real data distribution �0∼�(�), let us define a forward diffusion process in which we add small amount of Gaussian noise to the sample in � steps, producing a sequence of noisy samples �1,…,��. The step sizes are controlled by a variance schedule {��∈(0,1)}�=1�.

�(��|��−1)=�(��;1−��−1,��)�(�1:�|�0)=∏�=1��(��|��−1)

The data sample �0 gradually loses its distinguishable features as the step � becomes larger. Eventually when �→∞, �� is equivalent to an isotropic Gaussian distribution.

Fig. 2. The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)

A nice property of the above process is that we can sample �� at any arbitrary time step � in a closed form using reparameterization trick. Let ��=1−�� and �¯�=∏�=1��:

��=��−1+1−��−1 ;where ��−1,��−2,⋯∼�(0,�)=��−1��−2+1−��−1�¯�−2 ;where �¯�−2 merges two Gaussians (*).=…=�¯��0+1−�¯��(��|�0)=�(��;�¯��0,(1−�¯�)�)

(*) Recall that when we merge two Gaussians with different variance, �(0,�12�) and �(0,�22�), the new distribution is �(0,(�12+�22)�). Here the merged standard deviation is (1−��)+��(1−��−1)=1−��−1.

Usually, we can afford a larger update step when the sample gets noisier, so �1<�2<⋯<�� and therefore �¯1>⋯>�¯�.

Connection with stochastic gradient Langevin dynamics

Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh 2011) can produce samples from a probability density �(�) using only the gradients ∇�log⁡�(�) in a Markov chain of updates:

��=��−1+�2∇�log⁡�(��−1)+��,where ��∼�(0,�)

where � is the step size. When �→∞,�→0, �� equals to the true probability density �(�).

Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.

Reverse diffusion process

If we can reverse the above process and sample from �(��−1|��), we will be able to recreate the true sample from a Gaussian noise input, ��∼�(0,�). Note that if �� is small enough, �(��−1|��) will also be Gaussian. Unfortunately, we cannot easily estimate �(��−1|��) because it needs to use the entire dataset and therefore we need to learn a model �� to approximate these conditional probabilities in order to run the reverse diffusion process.

��(�0:�)=�(��)∏�=1��(��−1|��)��(��−1|��)=�(��−1;��(��,�),��(��,�))

Fig. 3. An example of training a diffusion model for modeling a 2D swiss roll data. (Image source: Sohl-Dickstein et al., 2015)

It is noteworthy that the reverse conditional probability is tractable when conditioned on �0:

�(��−1|��,�0)=�(��−1;�~(��,�0),�~��)

Using Bayes’ rule, we have:

�(��−1|��,�0)=�(��|��−1,�0)�(��−1|�0)�(��|�0)∝exp⁡(−12((��−��−1)2��+(��−1−�¯�−1�0)21−�¯�−1−(��−�¯��0)21−�¯�))=exp⁡(−12(��2−2��−1+��−12��+��−12−2�¯�−1�0��−1+�¯�−1�021−�¯�−1−(��−�¯��0)21−�¯�))=exp⁡(−12((��+11−�¯�−1)��−12−(2��+2�¯�−11−�¯�−1�0)��−1+�(��,�0)))

where �(��,�0) is some function not involving ��−1 and details are omitted. Following the standard Gaussian density function, the mean and variance can be parameterized as follows (recall that ��=1−�� and �¯�=∏�=1��):

�~�=1/(��+11−�¯�−1)=1/(��−�¯�+��(1−�¯�−1))=1−�¯�−11−�¯�⋅��~�(��,�0)=(��+�¯�−11−�¯�−1�0)/(��+11−�¯�−1)=(��+�¯�−11−�¯�−1�0)1−�¯�−11−�¯�⋅��=��(1−�¯�−1)1−�¯��+�¯�−1��1−�¯��0

Thanks to the nice property, we can represent �0=1�¯�(��−1−�¯��) and plug it into the above equation and obtain:

�~�=��(1−�¯�−1)1−�¯��+�¯�−1��1−�¯�1�¯�(��−1−�¯��)=1��(��−1−��1−�¯��)

As demonstrated in Fig. 2., such a setup is very similar to VAE and thus we can use the variational lower bound to optimize the negative log-likelihood.

−log⁡��(�0)≤−log⁡��(�0)+�KL(�(�1:�|�0)‖��(�1:�|�0))=−log⁡��(�0)+��1:�∼�(�1:�|�0)[log⁡�(�1:�|�0)��(�0:�)/��(�0)]=−log⁡��(�0)+��[log⁡�(�1:�|�0)��(�0:�)+log⁡��(�0)]=��[log⁡�(�1:�|�0)��(�0:�)]Let �VLB=��(�0:�)[log⁡�(�1:�|�0)��(�0:�)]≥−��(�0)log⁡��(�0)

It is also straightforward to get the same result using Jensen’s inequality. Say we want to minimize the cross entropy as the learning objective,

�CE=−��(�0)log⁡��(�0)=−��(�0)log⁡(∫��(�0:�)��1:�)=−��(�0)log⁡(∫�(�1:�|�0)��(�0:�)�(�1:�|�0)��1:�)=−��(�0)log⁡(��(�1:�|�0)��(�0:�)�(�1:�|�0))≤−��(�0:�)log⁡��(�0:�)�(�1:�|�0)=��(�0:�)[log⁡�(�1:�|�0)��(�0:�)]=�VLB

To convert each term in the equation to be analytically computable, the objective can be further rewritten to be a combination of several KL-divergence and entropy terms (See the detailed step-by-step process in Appendix B in Sohl-Dickstein et al., 2015):

�VLB=��(�0:�)[log⁡�(�1:�|�0)��(�0:�)]=��[log⁡∏�=1��(��|��−1)��(��)∏�=1��(��−1|��)]=��[−log⁡��(��)+∑�=1�log⁡�(��|��−1)��(��−1|��)]=��[−log⁡��(��)+∑�=2�log⁡�(��|��−1)��(��−1|��)+log⁡�(�1|�0)��(�0|�1)]=��[−log⁡��(��)+∑�=2�log⁡(�(��−1|��,�0)��(��−1|��)⋅�(��|�0)�(��−1|�0))+log⁡�(�1|�0)��(�0|�1)]=��[−log⁡��(��)+∑�=2�log⁡�(��−1|��,�0)��(��−1|��)+∑�=2�log⁡�(��|�0)�(��−1|�0)+log⁡�(�1|�0)��(�0|�1)]=��[−log⁡��(��)+∑�=2�log⁡�(��−1|��,�0)��(��−1|��)+log⁡�(��|�0)�(�1|�0)+log⁡�(�1|�0)��(�0|�1)]=��[log⁡�(��|�0)��(��)+∑�=2�log⁡�(��−1|��,�0)��(��−1|��)−log⁡��(�0|�1)]=��[�KL(�(��|�0)∥��(��))⏟��+∑�=2��KL(�(��−1|��,�0)∥��(��−1|��))⏟��−1−log⁡��(�0|�1)⏟�0]

Let’s label each component in the variational lower bound loss separately:

�VLB=��+��−1+⋯+�0where ��=�KL(�(��|�0)∥��(��))��=�KL(�(��|��+1,�0)∥��(��|��+1)) for 1≤�≤�−1�0=−log⁡��(�0|�1)

Every KL term in �VLB (except for �0) compares two Gaussian distributions and therefore they can be computed in closed form. �� is constant and can be ignored during training because � has no learnable parameters and �� is a Gaussian noise. Ho et al. 2020 models �0 using a separate discrete decoder derived from �(�0;��(�1,1),��(�1,1)).

Parameterization of �� for Training Loss

Recall that we need to learn a neural network to approximate the conditioned probability distributions in the reverse diffusion process, ��(��−1|��)=�(��−1;��(��,�),��(��,�)). We would like to train �� to predict �~�=1��(��−1−��1−�¯��). Because �� is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict �� from the input �� at time step �:

��(��,�)=1��(��−1−��1−�¯��(��,�))Thus ��−1=�(��−1;1��(��−1−��1−�¯��(��,�)),��(��,�))

The loss term �� is parameterized to minimize the difference from �~ :

��=��0,�[12‖��(��,�)‖22‖�~�(��,�0)−��(��,�)‖2]=��0,�[12‖��‖22‖1��(��−1−��1−�¯��)−1��(��−1−��1−�¯��(��,�))‖2]=��0,�[(1−��)22��(1−�¯�)‖��‖22‖��−��(��,�)‖2]=��0,�[(1−��)22��(1−�¯�)‖��‖22‖��−��(�¯��0+1−�¯��,�)‖2]

Simplification

Empirically, Ho et al. (2020) found that training the diffusion model works better with a simplified objective that ignores the weighting term:

��simple=��∼[1,�],�0,��[‖��−��(��,�)‖2]=��∼[1,�],�0,��[‖��−��(�¯��0+1−�¯��,�)‖2]

The final simple objective is:

�simple=��simple+�

where � is a constant not depending on �.

Fig. 4. The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)

Connection with noise-conditioned score networks (NCSN)

Song & Ermon (2019) proposed a score-based generative modeling method where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. The score of each sample �’s density probability is defined as its gradient ∇�log⁡�(�). A score network ��:��→�� is trained to estimate it, ��(�)≈∇�log⁡�(�).

To make it scalable with high-dimensional data in the deep learning setting, they proposed to use either denoising score matching (Vincent, 2011) or sliced score matching (use random projections; Song et al., 2019). Denosing score matching adds a pre-specified small noise to the data �(�~|�) and estimates �(�~) with score matching.

Recall that Langevin dynamics can sample data points from a probability density distribution using only the score ∇�log⁡�(�) in an iterative process.

However, according to the manifold hypothesis, most of the data is expected to concentrate in a low dimensional manifold, even though the observed data might look only arbitrarily high-dimensional. It brings a negative effect on score estimation since the data points cannot cover the whole space. In regions where data density is low, the score estimation is less reliable. After adding a small Gaussian noise to make the perturbed data distribution cover the full space ��, the training of the score estimator network becomes more stable. Song & Ermon (2019) improved it by perturbing the data with the noise of different levels and train a noise-conditioned score network to jointly estimate the scores of all the perturbed data at different noise levels.

The schedule of increasing noise levels resembles the forward diffusion process. If we use the diffusion process annotation, the score approximates ��(��,�)≈∇��log⁡�(��). Given a Gaussian distribution �∼�(�,�2�), we can write the derivative of the logarithm of its density function as ∇�log⁡�(�)=∇�(−12�2(�−�)2)=−�−��2=−�� where �∼�(0,�). Recall that �(��|�0)∼�(�¯��0,(1−�¯�)�) and therefore,

��(��,�)≈∇��log⁡�(��)=��(�0)[∇��(��|�0)]=��(�0)[−��(��,�)1−�¯�]=−��(��,�)1−�¯�

Parameterization of ��

The forward variances are set to be a sequence of linearly increasing constants in Ho et al. (2020), from �1=10−4 to ��=0.02. They are relatively small compared to the normalized image pixel values between [−1,1]. Diffusion models in their experiments showed high-quality samples but still could not achieve competitive model log-likelihood as other generative models.

Nichol & Dhariwal (2021) proposed several improvement techniques to help diffusion models to obtain lower NLL. One of the improvements is to use a cosine-based variance schedule. The choice of the scheduling function can be arbitrary, as long as it provides a near-linear drop in the middle of the training process and subtle changes around �=0 and �=�.

��=clip(1−�¯��¯�−1,0.999)�¯�=�(�)�(0)where �(�)=cos⁡(�/�+�1+�⋅�2)2

where the small offset � is to prevent �� from being too small when close to �=0.

Fig. 5. Comparison of linear and cosine-based scheduling of �_� during training. (Image source: Nichol & Dhariwal, 2021)

Parameterization of reverse process variance ��

Ho et al. (2020) chose to fix �� as constants instead of making them learnable and set ��(��,�)=��2� , where �� is not learned but set to �� or �~�=1−�¯�−11−�¯�⋅��. Because they found that learning a diagonal variance �� leads to unstable training and poorer sample quality.

Nichol & Dhariwal (2021) proposed to learn ��(��,�) as an interpolation between �� and �~� by model predicting a mixing vector � :

��(��,�)=exp⁡(�log⁡��+(1−�)log⁡�~�)

However, the simple objective �simple does not depend on �� . To add the dependency, they constructed a hybrid objective �hybrid=�simple+��VLB where �=0.001 is small and stop gradient on �� in the �VLB term such that �VLB only guides the learning of ��. Empirically they observed that �VLB is pretty challenging to optimize likely due to noisy gradients, so they proposed to use a time-averaging smoothed version of �VLB with importance sampling.

Fig. 6. Comparison of negative log-likelihood of improved DDPM with other likelihood-based generative models. NLL is reported in the unit of bits/dim. (Image source: Nichol & Dhariwal, 2021)

Speed up Diffusion Model Sampling

It is very slow to generate a sample from DDPM by following the Markov chain of the reverse diffusion process, as � can be up to one or a few thousand steps. One data point from Song et al. 2020: “For example, it takes around 20 hours to sample 50k images of size 32 × 32 from a DDPM, but less than a minute to do so from a GAN on an Nvidia 2080 Ti GPU.”

One simple way is to run a strided sampling schedule (Nichol & Dhariwal, 2021) by taking the sampling update every ⌈�/�⌉ steps to reduce the process from � to � steps. The new sampling schedule for generation is {�1,…,��} where �1<�2<⋯<��∈[1,�] and �<�.

For another approach, let’s rewrite ��(��−1|��,�0) to be parameterized by a desired standard deviation �� according to the nice property:

��−1=�¯�−1�0+1−�¯�−1��−1=�¯�−1�0+1−�¯�−1−��2��+��=�¯�−1�0+1−�¯�−1−��2��−�¯��01−�¯�+��(��−1|��,�0)=�(��−1;�¯�−1�0+1−�¯�−1−��2��−�¯��01−�¯�,��2�)

Recall that in �(��−1|��,�0)=�(��−1;�~(��,�0),�~��), therefore we have:

�~�=��2=1−�¯�−11−�¯�⋅��

Let ��2=�⋅�~� such that we can adjust �∈�+ as a hyperparameter to control the sampling stochasticity. The special case of �=0 makes the sampling process deterministic. Such a model is named the denoising diffusion implicit model (DDIM; Song et al., 2020). DDIM has the same marginal noise distribution but deterministically maps noise back to the original data samples.

During generation, we only sample a subset of � diffusion steps {�1,…,��} and the inference process becomes:

��,�(��−1|��,�0)=�(��−1;�¯�−1�0+1−�¯�−1−��2��−�¯��01−�¯�,��2�)

While all the models are trained with �=1000 diffusion steps in the experiments, they observed that DDIM (�=0) can produce the best quality samples when � is small, while DDPM (�=1) performs much worse on small �. DDPM does perform better when we can afford to run the full reverse Markov diffusion steps (�=�=1000). With DDIM, it is possible to train the diffusion model up to any arbitrary number of forward steps but only sample from a subset of steps in the generative process.

Fig. 7. FID scores on CIFAR10 and CelebA datasets by diffusion models of different settings, including DDIM (�=0) and DDPM (�^). (Image source: Song et al., 2020)

Compared to DDPM, DDIM is able to:

Generate higher-quality samples using a much fewer number of steps.
Have “consistency” property since the generative process is deterministic, meaning that multiple samples conditioned on the same latent variable should have similar high-level features.
Because of the consistency, DDIM can do semantically meaningful interpolation in the latent variable.

Latent diffusion model (LDM; Rombach & Blattmann, et al. 2022) runs the diffusion process in the latent space instead of pixel space, making training cost lower and inference speed faster. It is motivated by the observation that most bits of an image contribute to perceptual details and the semantic and conceptual composition still remains after aggressive compression. LDM loosely decomposes the perceptual compression and semantic compression with generative modeling learning by first trimming off pixel-level redundancy with autoencoder and then manipulate/generate semantic concepts with diffusion process on learned latent.

Fig. 8. The plot for tradeoff between compression rate and distortion, illustrating two-stage compressions - perceptural and semantic comparession. (Image source: Rombach & Blattmann, et al. 2022)

The perceptual compression process relies on an autoencoder model. An encoder � is used to compress the input image �∈��×�×3 to a smaller 2D latent vector �=�(�)∈�ℎ×�×� , where the downsampling rate �=�/ℎ=�/�=2�,�∈�. Then an decoder � reconstructs the images from the latent vector, �~=�(�). The paper explored two types of regularization in autoencoder training to avoid arbitrarily high-variance in the latent spaces.

KL-reg: A small KL penalty towards a standard normal distribution over the learned latent, similar to VAE.
VQ-reg: Uses a vector quantization layer within the decoder, like VQVAE but the quantization layer is absorbed by the decoder.

The diffusion and denoising processes happen on the latent vector �. The denoising model is a time-conditioned U-Net, augmented with the cross-attention mechanism to handle flexible conditioning information for image generation (e.g. class labels, semantic maps, blurred variants of an image). The design is equivalent to fuse representation of different modality into the model with cross-attention mechanism. Each type of conditioning information is paired with a domain-specific encoder �� to project the conditioning input � to an intermediate representation that can be mapped into cross-attention component, ��(�)∈��×��:

Attention(�,�,�)=softmax(��⊤�)⋅�where �=��(�)⋅��(��),�=��(�)⋅��(�),�=��(�)⋅��(�)and ��(�)∈��×��,��(�),��(�)∈��×��,��(��)∈��×��,��(�)∈��×��

Fig. 9. The architecture of latent diffusion model. (Image source: Rombach & Blattmann, et al. 2022)

Conditioned Generation

While training generative models on images with conditioning information such as ImageNet dataset, it is common to generate samples conditioned on class labels or a piece of descriptive text.

Classifier Guided Diffusion

To explicit incorporate class information into the diffusion process, Dhariwal & Nichol (2021) trained a classifier ��(�|��,�) on noisy image �� and use gradients ∇�log⁡��(�|��) to guide the diffusion sampling process toward the conditioning information � (e.g. a target class label) by altering the noise prediction. Recall that ∇��log⁡�(��)=−11−�¯��(��,�) and we can write the score function for the joint distribution �(��,�) as following,

∇��log⁡�(��,�)=∇��log⁡�(��)+∇��log⁡�(�|��)≈−11−�¯��(��,�)+∇��log⁡��(�|��)=−11−�¯�(��(��,�)−1−�¯�∇��log⁡��(�|��))

Thus, a new classifier-guided predictor �¯� would take the form as following,

�¯�(��,�)=��(��,�)−1−�¯�∇��log⁡��(�|��)

To control the strength of the classifier guidance, we can add a weight � to the delta part,

�¯�(��,�)=��(��,�)−1−�¯��∇��log⁡��(�|��)

The resulting ablated diffusion model (ADM) and the one with additional classifier guidance (ADM-G) are able to achieve better results than SOTA generative models (e.g. BigGAN).

Fig. 10. The algorithms use guidance from a classifier to run conditioned generation with DDPM and DDIM. (Image source: Dhariwal & Nichol, 2021])

Additionally with some modifications on the U-Net architecture, Dhariwal & Nichol (2021) showed performance better than GAN with diffusion models. The architecture modifications include larger model depth/width, more attention heads, multi-resolution attention, BigGAN residual blocks for up/downsampling, residual connection rescale by 1/2 and adaptive group normalization (AdaGN).

Classifier-Free Guidance

Without an independent classifier ��, it is still possible to run conditional diffusion steps by incorporating the scores from a conditional and an unconditional diffusion model (Ho & Salimans, 2021). Let unconditional denoising diffusion model ��(�) parameterized through a score estimator ��(��,�) and the conditional model ��(�|�) parameterized through ��(��,�,�). These two models can be learned via a single neural network. Precisely, a conditional diffusion model ��(�|�) is trained on paired data (�,�), where the conditioning information � gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ��(��,�)=��(��,�,�=∅).

The gradient of an implicit classifier can be represented with conditional and unconditional score estimators. Once plugged into the classifier-guided modified score, the score contains no dependency on a separate classifier.

∇��log⁡�(�|��)=∇��log⁡�(��|�)−∇��log⁡�(��)=−11−�¯�(��(��,�,�)−��(��,�))�¯�(��,�,�)=��(��,�,�)−1−�¯��∇��log⁡�(�|��)=��(��,�,�)+�(��(��,�,�)−��(��,�))=(�+1)��(��,�,�)−��(��,�)

Their experiments showed that classifier-free guidance can achieve a good balance between FID (distinguish between synthetic and generated images) and IS (quality and diversity).

The guided diffusion model, GLIDE (Nichol, Dhariwal & Ramesh, et al. 2022), explored both guiding strategies, CLIP guidance and classifier-free guidance, and found that the latter is more preferred. They hypothesized that it is because CLIP guidance exploits the model with adversarial examples towards the CLIP model, rather than optimize the better matched images generation.

Scale up Generation Resolution and Quality

To generate high-quality images at high resolution, Ho et al. (2021) proposed to use a pipeline of multiple diffusion models at increasing resolutions. Noise conditioning augmentation between pipeline models is crucial to the final image quality, which is to apply strong data augmentation to the conditioning input � of each super-resolution model ��(�|�). The conditioning noise helps reduce compounding error in the pipeline setup. U-net is a common choice of model architecture in diffusion modeling for high-resolution image generation.

Fig. 11. A cascaded pipeline of multiple diffusion models at increasing resolutions. (Image source: Ho et al. 2021])

They found the most effective noise is to apply Gaussian noise at low resolution and Gaussian blur at high resolution. In addition, they also explored two forms of conditioning augmentation that require small modification to the training process. Note that conditioning noise is only applied to training but not at inference.

Truncated conditioning augmentation stops the diffusion process early at step �>0 for low resolution.
Non-truncated conditioning augmentation runs the full low resolution reverse process until step 0 but then corrupt it by ��∼�(��|�0) and then feeds the corrupted �� s into the super-resolution model.

The two-stage diffusion model unCLIP (Ramesh et al. 2022) heavily utilizes the CLIP text encoder to produce text-guided images at high quality. Given a pretrained CLIP model � and paired training data for the diffusion model, (�,�), where � is an image and � is the corresponding caption, we can compute the CLIP text and image embedding, ��(�) and ��(�), respectively. The unCLIP learns two models in parallel:

A prior model �(��|�): outputs CLIP image embedding �� given the text �.
A decoder �(�|��,[�]): generates the image � given CLIP image embedding �� and optionally the original text �.

These two models enable conditional generation, because

�(�|�)=�(�,��|�)⏟�� is deterministic given �=�(�|��,�)�(��|�)

Fig. 12. The architecture of unCLIP. (Image source: Ramesh et al. 2022])

unCLIP follows a two-stage image generation process:

Given a text �, a CLIP model is first used to generate a text embedding ��(�). Using CLIP latent space enables zero-shot image manipulation via text.
A diffusion or autoregressive prior �(��|�) processes this CLIP text embedding to construct an image prior and then a diffusion decoder �(�|��,[�]) generates an image, conditioned on the prior. This decoder can also generate image variations conditioned on an image input, preserving its style and semantics.

Instead of CLIP model, Imagen (Saharia et al. 2022) uses a pre-trained large LM (i.e. a frozen T5-XXL text encoder) to encode text for image generation. There is a general trend that larger model size can lead to better image quality and text-image alignment. They found that T5-XXL and CLIP text encoder achieve similar performance on MS-COCO, but human evaluation prefers T5-XXL on DrawBench (a collection of prompts covering 11 categories).

When applying classifier-free guidance, increasing � may lead to better image-text alignment but worse image fidelity. They found that it is due to train-test mismatch, that is saying, because training data � stays within the range [−1,1], the test data should be so too. Two thresholding strategies are introduced:

Static thresholding: clip � prediction to [−1,1]
Dynamic thresholding: at each sampling step, compute � as a certain percentile absolute pixel value; if �>1, clip the prediction to [−�,�] and divide by �.

Imagen modifies several designs in U-net to make it efficient U-Net.

Shift model parameters from high resolution blocks to low resolution by adding more residual locks for the lower resolutions;
Scale the skip connections by 1/2
Reverse the order of downsampling (move it before convolutions) and upsampling operations (move it after convolution) in order to improve the speed of forward pass.

They found that noise conditioning augmentation, dynamic thresholding and efficient U-Net are critical for image quality, but scaling text encoder size is more important than U-Net size.

Quick Summary

Pros: Tractability and flexibility are two conflicting objectives in generative modeling. Tractable models can be analytically evaluated and cheaply fit data (e.g. via a Gaussian or Laplace), but they cannot easily describe the structure in rich datasets. Flexible models can fit arbitrary structures in data, but evaluating, training, or sampling from these models is usually expensive. Diffusion models are both analytically tractable and flexible
Cons: Diffusion models rely on a long Markov chain of diffusion steps to generate samples, so it can be quite expensive in terms of time and compute. New methods have been proposed to make the process much faster, but the sampling is still slower than GAN.