UNIFYING DIFFUSION MODELS’ LATENT SPACE, WITHAPPLICATIONS TO CYCLEDIFFUSION AND GUIDANCE [ICCV 2023]

JennnyZhang

已于 2023-08-19 19:16:10 修改

阅读量609

点赞数 1

文章标签：人工智能深度学习

于 2023-08-19 17:16:57 首次发布

本文链接：https://blog.csdn.net/qq_53826699/article/details/132381987

版权

论文提出了一种新的高斯潜在空间公式和可重构DPM-Encoder，展示了扩散模型的潜在空间共享和应用在图像翻译、零样本编辑及跨模型指导中的优势。作者还提出了基于DPM-Encoder的非配对图像转换方法CycleDiffusion和能量模型引导的潜在代码控制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文链接https://arxiv.org/abs/2210.05559
github链接https://github.com/ChenWu98/cycle-diffusion

Abstract

Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of diffusion models, as well as a reconstructable DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences.

(1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors.

(2) One can guide pretrained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.

本文提供了扩散模型潜在空间的另一种高斯公式，以及将图像映射到潜在空间的可重构dpm编码器。虽然我们的公式纯粹基于扩散模型的定义，但我们展示了几个有趣的结果。

(1) 经验上，我们观察到两个扩散模型在相关域上独立训练产生了一个共同的潜在空间。根据这一发现，我们提出了循环扩散，它使用dpm编码器进行非配对图像到图像的转换。此外， CycleDiffusion应用于文本到图像的扩散模型，我们表明大规模文本到图像的扩散模型可以用作零镜头图像到图像的编辑器。

(2) 可以通过基于能量模型的统一的即插即用公式控制潜在代码来指导预训练的扩散模型和gan。使用CLIP模型和人脸识别模型作为指导，我们证明了扩散模型比高斯模型具有更好的低密度亚种群和个体覆盖率

Introduction

本文通过将各种扩散模型重新表述为从高斯潜在代码z到图像x的确定性映射，提供了图像生成模型的统一视图(Figure1)。接下来的问题是编码:如何将图像x映射到潜在代码z。许多生成模型已经研究了编码。例如，VAEs和规范化流设计有编码器，GAN反演(Xia等人，2021)为GAN构建事后编码器，确定性扩散概率模型(dpm) (Song等人，2021a;b)构建具有正向ode的编码器。

然而，目前尚不清楚如何为随机dpm (Ho et al .， 2020)、非确定性DDIM (Song et al .， 2021a)和潜在扩散模型(Rombach et al .， 2022)等dpm构建编码器。我们提出了dpm编码器(章节3.2)，一个随机dpm的可重构编码器。

我们展示了一些有趣的结果出现从我们的扩散模型的潜在空间的定义和我们的dpm编码器。首先，观察发现，给定两种扩散模型，固定的“随机种子”会产生相似的图像(Nichol et al, 2022)。在我们的公式下，我们通过图像距离的上界来形式化“相似图像”。由于定义的潜在码包含采样期间的所有随机性，DPM-Encoder在精神上类似于从真实图像推断“随机种子”。基于这种直觉和图像距离的上界，我们提出了循环扩散(第3.3节)，这是一种使用我们的dpm编码器进行非配对图像到图像转换的方法。与基于gan的UNIT方法一样(Liu et al .， 2017)， CycleDiffusion使用共同潜在空间对图像进行编码和解码。我们的实验表明，循环扩散优于以前基于gan或扩散模型的方法(第4.1节)。此外，通过应用大规模文本到图像扩散模型(例如，稳定扩散;Rombach等人，2022)，循环扩散，我们获得零镜头图像到图像编辑器(第4.2节)。