论文精读--DALL·E 2_dall-e论文-CSDN博客

本文链接：https://blog.csdn.net/m0_73202283/article/details/136342277

使用CLIP训练好的特征做层级式的依托于文本的图像生成，层级式是指生成小分辨率图片后不断用模型上采样得到高清大图

CLIP将输入的文本变成一个文本特征，然后DALLE2训练一个prior模型，输入是文本特征输出是图像特征，最后把图像特征喂给解码器得到图片

DALLE2 = CLIP + GLIDE

Abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

翻译：

对比学习模型，如CLIP，已被证明能够学习到同时捕获图像的语义和风格的鲁棒表示。为了利用这些表示进行图像生成，我们提出了一个两阶段模型：一个先验网络，它根据文本描述生成CLIP图像嵌入，以及一个解码器，它根据图像嵌入生成图像。我们展示，明确地生成图像表示可以改善图像多样性，同时在照片真实性和描述相似性上的损失最小。我们的解码器，在图像表示的条件下，还可以生成图像的变化，同时保留其语义和风格，变化那些在图像表示中不存在的非关键细节。此外，CLIP的联合嵌入空间使语言引导的图像操作能够在零样本的情况下进行。我们使用扩散模型作为解码器，并尝试使用自回归模型和扩散模型作为先验，发现后者在计算上更有效率，并产生更高质量的样本。

总结：

prior：CLIP得到的文本特征通过自回归或扩散模型(效果更好)生成图像特征

decoder：图像特征通过扩散模型生成图片

GAN的质量好，但是不够多样，所以还是Diffusion好

Introduction

Recent progress in computer vision has been driven by scaling models on large datasets of captioned images collected from the internet [10, 44, 60, 39, 31, 16]. Within this framework, CLIP [39] has emerged as a successful representation learner for images. CLIP embeddings have a number of desirable properties: they are robust to image distribution shift, have impressive zero-shot capabilities, and have been fine-tuned to achieve state-of-the-art results on a wide variety of vision and language tasks [45]. Concurrently, diffusion models [46, 48, 25] have emerged as a promising generative modeling framework, pushing the state-of-the-art on image and video generation tasks [11, 26, 24].To achieve best results, diffusion models leverage a guidance technique [11, 24] which improves sample fidelity (for images, photorealism) at the cost of sample diversity.

翻译：

计算机视觉领域的最新进展是由扩大模型规模推动的，这些模型在互联网上收集的大规模带注释图像数据集上进行训练 [10, 44, 60, 39, 31, 16]。在这个框架内，CLIP [39] 已经成为一个成功的图像表示学习器。CLIP嵌入具有许多理想的特性：它们对图像分布偏移具有鲁棒性，具有令人印象深刻的零样本能力，并且已经经过微调，在广泛的视觉和语言任务上取得了最先进的结果 [45]。与此同时，扩散模型 [46, 48, 25] 已经成为了一个有前途的生成建模框架，推动了图像和视频生成任务的最先进水平 [11, 26, 24]。为了达到最佳结果，扩散模型利用了一种指导技术 [11, 24]，这种技术在牺牲样本多样性的情况下提高了样本保真度（对于图像，即照片真实感）。

总结：

CLIP学到的特征很好

Diffusion的多样性来自于从数据分布上采样，但是真实性不如GAN，因为GAN的训练目标就是朝着以假乱真去的。但随着DDPM以及后续一系列工作出来，其中最有用的技巧就是guidance technique，通过牺牲一些多样性换取保真度，Diffusion模型质量变好。

classifier guided diffusion：

f是U-Net，这是Diffusion去噪的过程，classifier是在ImageNet上的图片加噪训练出的(因为去噪的图片本身就包含噪声)，利用得到的梯度帮助模型进行采样和图像的生成，因为这里的梯度其实暗含了图片中有没有物体，物体真不真实等信息，是在告诉U-Net现在生成的图片要更像某一类物体

classifier-free guidance：

无需classifier，能否找到一种指导信号使模型生成得更好？

有条件时的输出减去无条件时的输出，举个例子：通过文本生成图片，此时y是条件也就是文本，然后随机的把y去掉，不用文本而是用空的序列，得到另一个图片。此时我们能知道一个方向，能使得无条件生成的图片变成有条件生成的图片

In this work, we combine these two approaches for the problem of text-conditional image generation.We first train a diffusion decoder to invert the CLIP image encoder. Our inverter is non-deterministic, and can produce multiple images corresponding to a given image embedding. The presence of an encoder and its approximate inverse (the decoder) allows for capabilities beyond text-to-image translation. As in GAN inversion [62, 55], encoding and decoding an input image produces semantically similar output images (Figure 3). We can also interpolate between input images by inverting interpolations of their image embeddings (Figure 4). However, one notable advantage of using the CLIP latent space is the ability to semantically modify images by moving in the direction of any encoded text vector (Figure 5), whereas discovering these directions in GAN latent space involves luck and diligent manual examination. Furthermore, encoding and decoding images also provides us with a tool for observing which features of the image are recognized or disregarded by CLIP.

翻译：

在这项工作中，我们将这两种方法结合起来用于文本条件图像生成的问题。我们首先训练一个扩散解码器来反转CLIP图像编码器。我们的反转器是非确定性的，并且可以为给定的图像嵌入生成多个图像。编码器及其近似逆（即解码器）的存在使得功能超越了文本到图像的翻译。与GAN反转 [62, 55] 类似，对输入图像进行编码和解码会产生在语义上相似的输出图像（图3）。我们还可以通过反转输入图像嵌入之间的插值来在输入图像之间进行插值（图4）。然而，使用CLIP潜在空间的一个显著优势是能够通过移动到任何编码文本向量的方向来语义修改图像，而在GAN潜在空间中找到这些方向则涉及运气和勤奋的手动检查。此外，编码和解码图像还为我们提供了一种工具，用于观察CLIP识别或忽略图像的哪些特征。

To obtain a full generative model of images, we combine the CLIP image embedding decoder with a prior model, which generates possible CLIP image embeddings from a given text caption. We compare our text-to-image system with other systems such as DALL-E [40] and GLIDE [35], finding that our samples are comparable in quality to GLIDE, but with greater diversity in our generations. We also develop methods for training diffusion priors in latent space, and show that they achieve comparable performance to autoregressive priors, while being more compute-efficient. We refer to our full text-conditional image generation stack as unCLIP, since it generates images by inverting the CLIP image encoder.

翻译：

为了获得图像的完整生成模型，我们将CLIP图像嵌入解码器与一个先验模型结合起来，后者从给定的文本描述生成可能的CLIP图像嵌入。我们将我们的文本到图像系统与其他系统如DALL-E [40] 和 GLIDE [35] 进行比较，发现我们的样本质量与GLIDE相当，但在生成多样性方面更具优势。我们还开发了在潜在空间中训练扩散先验的方法，并显示它们实现了与自回归先验相当的性能，同时计算效率更高。我们将我们的完整文本条件图像生成堆栈称为unCLIP，因为它通过反转CLIP图像编码器来生成图像。

上面是CLIP，下面是DALLE2

如果要直接从text encoder提取特征，通过中间一个大的模型学一些融合特征，然后生成图片也可以，但是效果不如显式的图像生成，也就是文本特征转化成图像特征再生成

CLIP本身是学到了文本图像对的，所以拿输入图像的CLIP img encoder的结果当作groundtruth来监督DALLE2的图像特征，也就是让文本特征去预测groundtruth的图像特征，通过这种方式学到prior

unCLIP就是DALLE2，从特征得到图像，与CLIP相反，所以是un

Method

Our training dataset consists of pairs (x; y) of images x and their corresponding captions y. Given an image x, let zi and zt be its CLIP image and text embeddings, respectively. We design our generative stack to produce images from captions using two components:

• A prior P(zi|y) that produces CLIP image embeddings zi conditioned on captions y.

• A decoder P(x|zi,y) that produces images x conditioned on CLIP image embeddings zi (and optionally text captions y).

翻译：

我们的训练数据集包含图像 x 和它们对应的描述 y 的成对数据 (x; y)。给定一个图像 x，设 zi 和 zt 分别为其 CLIP 图像嵌入和文本嵌入。我们设计我们的生成堆栈以使用两个组件从描述生成图像：
• 一个先验 P(zi|y)，它根据描述 y 产生 CLIP 图像嵌入 zi。
• 一个解码器 P(x|zi,y)，它根据 CLIP 图像嵌入 zi（和可选的文本描述 y）产生图像 x。

The decoder allows us to invert images given their CLIP image embeddings, while the prior allows us to learn a generative model of the image embeddings themselves. Stacking these two components yields a generative model P(x|y) of images x given captions y:

The first equality holds because zi is a deterministic function of x. The second equality holds because of the chain rule. Thus, we can sample from the true conditional distribution P(x|y) by first sampling zi using the prior, and then sampling x using the decoder. In the following sections, we describe our decoder and prior stacks. For training details and hyperparameters, refer to Appendix C.

翻译：

解码器允许我们根据图像的CLIP图像嵌入来反转图像，而先验允许我们学习图像嵌入本身的生成模型。将这两个组件堆叠起来就得到了一个给定描述y的图像x的生成模型P(x|y)：

第一个等式成立是因为 zi 是 x 的一个确定性函数。第二个等式成立是因为链式法则。因此，我们可以通过首先使用先验采样 zi，然后使用解码器采样 x，从而从真实的条件分布 P(x|y) 中进行采样。在接下来的部分，我们将描述我们的解码器和先验堆栈。有关训练细节和超参数，请参阅附录 C。

总结：

CLIP的encoder是锁住的，所以zi是确定的；最后右边是prior，左边是decoder

Decoder

We use diffusion models [25, 48] to produce images conditioned on CLIP image embeddings (and optionally text captions). Specifically, we modify the architecture described in Nichol et al (2021) by projecting and adding CLIP embeddings to the existing timestep embedding, and by projecting CLIP embeddings into four extra tokens of context that are concatenated to the sequence of outputs from the GLIDE text encoder. We retained the text conditioning pathway present in the original GLIDE model, hypothesizing that it could allow the diffusion model to learn aspects of natural language that CLIP fails to capture (e.g. variable binding), but find that it offers little help in this regard (Section 7).

While we can sample from the conditional distribution of the decoder directly, past work using diffusion models shows using guidance on the conditioning information [11, 24, 35] improves sample quality a lot.We enable classifier-free guidance [24] by randomly setting the CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of the time during training.

To generate high resolution images, we train two diffusion upsampler models [34, 43]: one to upsample images from 64×64 to 256×256 resolution, and another to further upsample those to 1024×1024 resolution.

To improve the robustness of our upsamplers, we slightly corrupt the conditioning images during training.For the first upsampling stage, we use gaussian blur [43], and for the second, we use a more diverse BSR degradation [42, 59]. To reduce training compute and improve numerical stability, we follow Rombach et al [42] and train on random crops of images that are one-fourth the target size. We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution. We found no benefit from conditioning the upsamplers on the caption, and use unconditional ADMNets [11] with no guidance.

翻译：

我们使用扩散模型 [25, 48] 来根据 CLIP 图像嵌入（和可选的文本描述）生成图像。具体来说，我们修改了 Nichol 等人 (2021) 描述的架构，通过投射并将 CLIP 嵌入添加到现有的时间步嵌入中，并将 CLIP 嵌入投射到四个额外的上下文标记中，这些标记被连接到 GLIDE 文本编码器输出序列的末尾。我们保留了原始 GLIDE 模型中存在的文本条件路径，假设它可能允许扩散模型学习 CLIP 无法捕捉的自然语言的方面（例如变量绑定），但我们发现在这方面它提供的帮助很小（第7节）。虽然我们可以直接从解码器的条件分布中进行采样，但使用扩散模型进行条件信息引导的先前工作 [11, 24, 35] 显示显著提高了样本质量。我们通过在训练过程中随机地将 CLIP 嵌入设置为零（或一个学习的嵌入）10% 的时间，以及随机丢弃文本描述 50% 的时间，来启用无分类器引导 [24]。
为了生成高分辨率图像，我们训练了两个扩散上采样模型 [34, 43]：一个将图像从 64×64 分辨率上采样到 256×256 分辨率，另一个进一步将那些图像上采样到 1024×1024 分辨率。
为了提高我们上采样器的鲁棒性，我们在训练过程中轻微损坏了条件图像。对于第一个上采样阶段，我们使用高斯模糊 [43]，对于第二个阶段，我们使用更多样化的 BSR 退化 [42, 59]。为了减少训练计算量并提高数值稳定性，我们遵循 Rombach 等人 [42] 的方法，在目标大小四分之一的图像随机裁剪上进行训练。我们在模型中只使用空间卷积（即没有注意力层），并在推理时直接在目标分辨率上应用模型，观察到它很容易泛化到更高的分辨率。我们发现对上采样器进行文本描述的条件作用没有好处，并且使用无条件的 ADMNets [11] 而不进行引导。

总结：

GLIDE变体，加了CLIP的guidance

不仅用了CLIP的guidance，还用了classifier-free guidance，10%的时间丢掉CLIP的，50%的时间丢掉文本特征

层级式生成，从256x256一直上采样倒1024x1024；训练中加了很多噪声；由于只使用卷积而没用注意力层，所以不会有序列长度必须一致的限制

Prior

While a decoder can invert CLIP image embeddings zi to produce images x, we need a prior model that produces zi from captions y to enable image generations from text captions. We explore two different model classes for the prior model:

• Autoregressive (AR) prior: the CLIP image embedding zi is converted into a sequence of discrete codes and predicted autoregressively conditioned on the caption y.

• Diffusion prior: The continuous vector zi is directly modelled using a Gaussian diffusion model conditioned on the caption y.

In addition to the caption, we can condition the prior on the CLIP text embedding zt since it is a deterministic function of the caption. To improve sample quality we also enable sampling using classifier-free guidance for both the AR and diffusion prior, by randomly dropping this text conditioning information 10% of the time during training.

To train and sample from the AR prior more efficiently, we first reduce the dimensionality of the CLIP image embeddings zi by applying Principal Component Analysis (PCA) [37]. In particular, we find that the rank of the CLIP representation space is drastically reduced when training CLIP with SAM [15] while slightly improving evaluation metrics. We are able to preserve nearly all of the information2 by retaining only 319 principal components out of the original 1,024. After applying PCA, we order the principal components by decreasing eigenvalue magnitude, quantize each of the 319 dimensions into 1,024 discrete buckets, and predict the resulting sequence using a Transformer [53] model with a causal attention mask. This results in a threefold reduction in the number of tokens predicted during inference, and improves training stability.

We condition the AR prior on the text caption and the CLIP text embedding by encoding them as a prefix to the sequence. Additionally, we prepend a token indicating the (quantized) dot product between the text embedding and image embedding, zi · zt. This allows us to condition the model on a higher dot product, since higher text-image dot products correspond to captions which better describe the image. In practice, we find it beneficial to sample the dot product from the top half of the distribution.

翻译：

尽管解码器可以将 CLIP 图像嵌入 zi 反转为产生图像 x，但我们需要一个先验模型，该模型从描述 y 产生 zi，以实现从文本描述生成图像。我们探索了两种不同的模型类别作为先验模型：
• 自回归（AR）先验：CLIP 图像嵌入 zi 被转换为一串离散代码，并在描述 y 的条件下自回归地预测。
• 扩散先验：使用高斯扩散模型直接对连续向量 zi 进行建模，该模型在描述 y 的条件下进行条件设置。
除了描述之外，我们还可以将先验条件设置为 CLIP 文本嵌入 zt，因为它是对描述的确定性函数。为了提高样本质量，我们还通过在训练过程中随机丢弃这个文本条件信息 10% 的时间，为 AR 和扩散先验启用使用无分类器引导的采样。

为了更高效地训练和从 AR 先验中采样，我们首先通过应用主成分分析（PCA）[37]来降低 CLIP 图像嵌入 zi 的维度。特别是，我们发现当使用 SAM [15] 训练 CLIP 时，CLIP 表示空间的秩显著降低，同时略微提高了评估指标。通过只保留原始 1,024 个主成分中的 319 个，我们能够保留几乎所有信息。应用 PCA 后，我们按照特征值大小的降序排列主成分，将每个 319 维度量化为 1,024 个离散桶，并使用带有因果注意力掩码的 Transformer [53] 模型预测得到的序列。这导致在推理过程中预测的令牌数量减少了三倍，并提高了训练的稳定性。

我们通过将文本描述和 CLIP 文本嵌入编码为序列的前缀来对 AR 先验进行条件设置。此外，我们在序列前缀中添加一个标记，指示文本嵌入和图像嵌入（量化后的）点积，即 zi · zt。这允许我们在更高的点积上进行条件设置，因为更高的文本-图像点积对应于更好地描述图像的描述。在实际操作中，我们发现从分布的上半部分采样点积是有益的。

总结：

尝试了两种prior，一种是自回归，一种是Diffusion

自回归：输入是文本特征，把CLIP给的图像特征遮住，自回归去预测；训练效率太低，为了解决这个问题还用了PCA等方法

无论是自回归还是Diffusion，都用了classifier-free guidance，效果很好

For the diffusion prior, we train a decoder-only Transformer with a causal attention mask on a sequence consisting of, in order: the encoded text, the CLIP text embedding, an embedding for the diffusion timestep, the noised CLIP image embedding, and a final embedding whose output from the Transformer is used to predict the unnoised CLIP image embedding. We choose not to condition the diffusion prior on zi · zt like in the AR prior; instead, we improve quality during sampling time by generating two samples of zi and selecting the one with a higher dot product with zt. Instead of using the ε-prediction formulation from Ho et al [25], we find it better to train our model to predict the unnoised zi directly, and use a mean-squared error loss on this prediction:

翻译：

对于扩散先验，我们在一个由以下内容按顺序组成的序列上训练一个仅解码器的 Transformer，该 Transformer 带有因果注意力掩码：编码的文本、CLIP 文本嵌入、扩散时间步的嵌入、噪声的 CLIP 图像嵌入，以及一个最终的嵌入，其 Transformer 的输出用于预测无噪声的 CLIP 图像嵌入。我们选择不像 AR 先验那样在扩散先验中条件设置 zi · zt；相反，我们在采样时通过生成两个 zi 样本并选择与 zt 点积更高的那个来提高质量。我们发现在训练模型直接预测无噪声的 zi，并在这个预测上使用均方误差损失，比使用 Ho 等人 [25] 的 ε-预测公式效果更好：

总结：

输入输出是embedding向量，不适合用U-Net，所以训了一个transformer的decoder

输入很多：文本、CLIP文本特征、timing step、加噪的CLIP图像特征，transformer本身的embedding(cls token)

DDPM提出直接预测噪声而非图像本身，但作者在这发现预测图像本身的特征比预测噪声效果要更好，所以损失函数中不是 -ε，而是 -zi