Stable Diffusion is a latent diffusion model that generates AI images from text. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space.
We will dig deep into understanding how it works under the hood.
Why do you need to know? Apart from being a fascinating subject in its own right, some understanding of the inner mechanics will make you a better artist. You can use the tool correctly to achieve results with higher precision.
How does text-to-image differ from image-to-image? What’s the CFG scale? What’s denoising strength? You will find the answer in this article.
Let’s dive in.
Structured Stable Diffusion courses
Learn A111 and ComfyUI step-by-step.
Become a Member
Table of Contents [show]
VISIT OUR SPONSOR
Get fast & unlimited Stable Diffusion for Mobile + PC, 2 months free, save 15%
What can Stable Diffusion do?
In the simplest form, Stable Diffusion is a text-to-image model. Give it a text prompt. It will return an AI image matching the text.
Stable diffusion turns text prompts into images.
Diffusion model
Stable Diffusion belongs to a class of deep learning models called diffusion models. They are generative models, meaning they are designed to generate new data similar to what they have seen in training. In the case of Stable Diffusion, the data are images.
Why is it called the diffusion model? Because its math looks very much like diffusion in physics. Let’s go through the idea.
Let’s say I trained a diffusion model with only two kinds of images: cats and dogs. In the figure below, the two peaks on the left represent the groups of cat and dog images.
Forward diffusion
Forward diffusion turns a photo into noise. (Figure modified from this article)
A forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. The forward process will turn any cat or dog image into a noise image. Eventually, you won’t be able to tell whether they are initially a dog or a cat. (This is important)
It’s like a drop of ink fell into a glass of water. The ink drop diffuses in water. After a few minutes, It randomly distributes itself throughout the water. You can no longer tell whether it initially fell at the center or near the rim.
Below is an example of an image undergoing forward diffusion. The cat image turns to random noise.
Forward diffusion of a cat image.
Reverse diffusion
Now comes the exciting part. What if we can reverse the diffusion? Like playing a video backward. Going backward in time. We will see where the ink drop was initially added.
The reverse diffusion process recovers an image.
Starting from a noisy, meaningless image, reverse diffusion recovers a cat OR a dog image. This is the main idea.
Technically, every diffusion process has two parts: (1) drift and (2) random motion. The reverse diffusion drifts towards either cat OR dog images but nothing in between. That’s why the result can either be a cat or a dog.
How training is done
The idea of reverse diffusion is undoubtedly clever and elegant. But the million-dollar question is, “How can it be done?”
To reverse the diffusion, we need to know how much noise is added to an image. The answer is teaching a neural network model to predict the noise added. It is called the noise predictor in Stable Diffusion. It is a U-Net model. The training goes as follows.
- Pick a training image, like a photo of a cat.
- Generate a random noise image.
- Corrupt the training image by adding this noisy image up to a certain number of steps.
- Teach the noise predictor to tell us how much noise was added. This is done by tuning its weights and showing it the correct answer.
Noise is sequentially added at each step. The noise predictor estimates the total noise added up to each step.
After training, we have a noise predictor capable of estimating the noise added to an image.
Reverse diffusion
Now we have the noise predictor. How to use it?
We first generate a completely random image and ask the noise predictor to tell us the noise. We then subtract this estimated noise from the original image. Repeat this process a few times. You will get an image of either a cat or a dog.
Reverse diffusion works by subtracting the predicted noise from the image successively.
You may notice we have no control over generating a cat or dog’s image. We will address this when we talk about conditioning. For now, image generation is unconditioned.
You can read more about reverse diffusion sampling and samplers in this article.
Stable Diffusion model
Now I need to tell you some bad news: What we just talked about is NOT how Stable Diffusion works! The reason is that the above diffusion process is in image space. It is computationally very, very slow. You won’t be able to run on any single GPU, let alone the crappy GPU on your laptop.
The image space is enormous. Think about it: a 512×512 image with three color channels (red, green, and blue) is a 786,432-dimensional space! (You need to specify that many values for ONE image.)
Diffusion models like Google’s Imagen and Open AI’s DALL-E are in pixel space. They have used some tricks to make the model faster but still not enough.
Latent diffusion model
Stable Diffusion is designed to solve the speed problem. Here’s how.
Stable Diffusion is a latent diffusion model. Instead of operating in the high-dimensional image space, it first compresses the image into the latent space. The latent space is 48 times smaller so it reaps the benefit of crunching a lot fewer numbers. That’s why it’s a lot faster.
Variational Autoencoder
It is done using a technique ca