Date: July 11, 2021 | Estimated Reading Time: 25 min | Author: Lilian Weng
[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)].
[Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen.
[Updated on 2022-08-31: Added latent diffusion model.
So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.
Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
Fig. 1. Overview of different types of generative models.
What are Diffusion Models?
Several diffusion-based generative models have been proposed with similar ideas underneath, including diffusion probabilistic models (Sohl-Dickstein et al., 2015), noise-conditioned score network (NCSN; Yang & Ermon, 2019), and denoising diffusion probabilistic models (DDPM; Ho et al. 2020).
Forward diffusion process
Given a data point sampled from a real data distribution �0∼�(�), let us define a forward diffusion process in which we add small amount of Gaussian noise to the sample in � steps, producing a sequence of noisy samples �1,…,��. The step sizes are controlled by a variance schedule {��∈(0,1)}�=1�.
�(��|��−1)=�(��;1−����−1,���)�(�1:�|�0)=∏�=1��(��|��−1)
The data sample �0 gradually loses its distinguishable features as the step � becomes larger. Eventually when �→∞, �� is equivalent to an isotropic Gaussian distribution.
Fig. 2. The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)
A nice property of the above process is that we can sample �� at any arbitrary time step � in a closed form using reparameterization trick. Let ��=1−�� and �¯�=∏�=1���:
��=����−1+1−����−1 ;where ��−1,��−2,⋯∼�(0,�)=����−1��−2+1−����−1�¯�−2 ;where �¯�−2 merges two Gaussians (*).=…=�¯��0+1−�¯���(��|�0)=�(��;�¯��0,(1−�¯�)�)
(*) Recall that when we merge two Gaussians with different variance, �(0,�12�) and �(0,�22�), the new distribution is �(0,(�12+�22)�). Here the merged standard deviation is (1−��)+��(1−��−1)=1−����−1.
Usually, we can afford a larger update step when the sample gets noisier, so �1<�2<⋯<�� and therefore �¯1>⋯>�¯�.
Connection with stochastic gradient Langevin dynamics
Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh 2011) can produce samples from a probability density �(�) using only the gradients ∇�log�(�) in a Markov chain of updates:
��=��−1+�2∇�log�(��−1)+���,where ��∼�(0,�)
where � is the step size. When �→∞,�→0, �� equals to the true probability density �(�).
Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Reverse diffusion process
If we can reverse the above process and sample from �(��−1|��), we will be able to recreate the true sample from a Gaussian noise input, ��∼�(0,�). Note that if �� is small enough, �(��−1|��) will also be Gaussian. Unfortunately, we cannot easily estimate �(��−1|��) because it needs to use the entire dataset and therefore we need to learn a model �� to approximate these conditional probabilities in order to run the reverse diffusion process.
��(�0:�)=�(��)∏�=1���(��−1|��)��(��−1|��)=�(��−1;��(��,�),��(��,�))
Fig. 3. An example of training a diffusion model for modeling a 2D swiss roll data. (Image source: Sohl-Dickstein et al., 2015)
It is noteworthy that the reverse conditional probability is tractable when conditioned on �0:
�(��−1|��,�0)=�(��−1;�~(��,�0),�~��)
Using Bayes’ rule, we have:
�(��−1|��,�0)=�(��|��−1,�0)�(��−1|�0)�(��|�0)∝exp(−12((��−����−1)2��+(��−1−�¯�−1�0)21−�¯�−1−(��−�¯��0)21−�¯�))=exp(−12(��2−2������−1+����−12��+��−12−2�¯�−1�0��−1+�¯�−1�021−�¯�−1−(��−�¯��0)21−�¯�))=exp(−12((����+11−�¯�−1)��−12−(2������+2�¯�−11−�¯�−1�0)��−1+�(��,�0)))
where �(��,�0) is some function not involving ��−1 and details are omitted. Following the standard Gaussian density function, the mean and variance can be parameterized as follows (recall that ��=1−�� and �¯�=∏�=1���):
�~�=1/(����+11−�¯�−1)=1/(��−�¯�+����(1−�¯�−1))=1−�¯�−11−�¯�⋅���~�(��,�0)=(������+�¯�−11−�¯�−1�0)/(����+11−�¯�−1)=(������+�¯�−11−�¯�−1�0)1−�¯�−11−�¯�⋅��=��(1−�¯�−1)1−�¯���+�¯�−1��1−�¯��0
Thanks to the