按示例绘画：基于示例的图像编辑与扩散模型

杨彬鑫 ${}^{1 * }$ 舒阳 ${\mathrm{{Gu}}}^{2}$ 张博 ${}^{2 \dagger }$ 张汀 ${}^{1}$ 陈学进

孙晓燕 ${}^{1}$ 陈东 ${}^{2}$ 温方

图1. 按示例绘画。用户可以通过绘制条件图像来编辑场景。我们的方法可以自动修改参考图像并将其融合到源图像中，实现高质量的结果。

摘要

语言指导的图像编辑最近取得了巨大成功。在本文中，我们研究示例指导的图像编辑以实现更精确的控制。我们通过利用自监督训练来分离和重新组织源图像和示例图像，达到这一目标。然而，简单的方法将导致明显的融合伪影。我们仔细分析了这一点，并提出了内容瓶颈和强增强，以避免直接复制和粘贴示例图像的简单解决方案。同时，为确保编辑过程的可控性，我们为示例图像设计了一个任意形状的遮罩，并利用无分类器指导来提高与示例图像的相似性。整个框架涉及扩散模型的单一前向传播，无需任何迭代优化。我们证明了我们的方法在性能上令人印象深刻，并能够在野外图像上实现高保真度的可控编辑。代码和预训练模型可在 https://github.com/Fantasy-Studio/Paint-by-Example 获取。

1. 引言

由于众多社交媒体平台的进步，创意图片编辑已经成为一种普遍需求。基于AI的技术 [37] 显著降低了传统上需要专业软件和劳动密集型手动操作的精美图像编辑的门槛。深度神经网络现在可以通过学习丰富的配对数据，为各种低级图像编辑任务产生令人信服的结果，例如图像修复 $\left\lbrack {{61},{68}}\right\rbrack$ 、合成 $\left\lbrack {{41},{67},{72}}\right\rbrack$ 、着色 $\left\lbrack {{53},{71},{74}}\right\rbrack$ 和美学增强 $\left\lbrack {7,{11}}\right\rbrack$ 。另一方面，更具挑战性的场景是语义图像编辑，它旨在操纵图像内容的高级语义，同时保持图像的真实性。沿着这条路，已经做出了巨大的努力 $\left\lbrack {2,5,{36},{49},{55},{57}}\right\rbrack$ ，主要依赖于生成模型的语义潜在空间，例如GANs [16,28,70]，然而，大多数现有工作仅限于特定的图像类型。

最近的大型语言-图像（LLI）模型，基于自回归模型 $\left\lbrack {{12},{69}}\right\rbrack$ 或扩散模型 $\left\lbrack {{19},{45},{50},{54}}\right\rbrack$ ，在建模复杂图像方面展示了前所未有的生成能力。这些模型使得各种图像操作任务 $\left\lbrack {3,{22},{30},{52}}\right\rbrack$ 成为可能，这些任务以前是无法攻克的，允许在文本提示的指导下对通用图像进行编辑。然而，即使是详细的文本描述也不可避免地引入了模糊性，可能无法准确反映用户期望的效果；实际上，许多细粒度的物体外观很难仅用简单语言来指定。因此，开发一种更直观的方法来简化初学者和非母语使用者的细粒度图像编辑至关重要。

在这项工作中，我们提出了一种基于示例的图像编辑方法，它允许根据用户提供或从数据库中检索到的示例图像，对图像内容进行精确的语义操作。俗话说，“一图胜千言”。我们相信，图像比文字更能以更细致的方式传达用户期望的图像定制效果。这项任务与主要关注在合成前景对象时进行颜色和光照校正的图像协调 $\left\lbrack {{20},{64}}\right\rbrack$ 完全不同，我们的目标是更为复杂的任务：语义转换示例，例如，生成不同的姿态、形变或视角，以便编辑的内容可以根据图像上下文无缝植入。实际上，我们的方法自动化了传统的图像编辑工作流程，其中艺术家需要对图像资源进行繁琐的转换以实现图像的连贯融合。

为了实现我们的目标，我们训练了一个基于样本图像的扩散模型 [24, 59]。与基于文本引导的模型不同，核心挑战在于收集包含源图像、样本以及相应的编辑真实标签的足够三胞胎训练对是不切实际的。一种解决方法是从输入图像中随机裁剪对象，作为训练修复模型时的参考。然而，在这种自我参考设置中训练的模型无法泛化到真实的样本，因为模型仅仅学习将参考对象复制并粘贴到最终输出中。我们确定了几种绕过这个问题的关键因素。首先是利用生成先验。具体来说，一个预训练的文本到图像模型有能力生成高质量期望结果，我们利用它作为初始化以避免陷入复制和粘贴的平凡解。然而，长时间的微调仍可能导致模型偏离先验知识并最终退化。因此，我们引入了内容瓶颈，用于自我参考调节，我们丢弃了空间标记，只将全局图像嵌入视为条件。这样，我们强制网络理解样本图像的高级语义和源图像的上下文，从而在自我监督训练期间防止产生平凡结果。此外，我们对自我参考图像应用了激进的增强，这可以有效地减少训练-测试差距。

我们进一步从两个方面提高了我们方法的编辑性。一个是我们的训练使用了不规则的随机掩码，以模仿实际编辑中使用的随意用户画笔。我们还证明了无分类器指导 [25] 有助于提高图像质量和参考样式的相似度。

提出的方法，按示例绘画（Paint by Example），很好地解决了语义图像组合问题，其中参考图像在融合到另一幅图像之前进行了语义转换和协调，如图1所示。我们的方法在类似设置下比之前的工作显示出显著的质量优势。特别地，我们的编辑仅涉及扩散模型的一次前向传播，而不进行任何图像特定的优化，这对于许多实际应用来说是必要的。总结来说，我们的贡献如下：

我们提出了一种新的图像编辑方法，按示例绘画（Paint by Example），它基于示例图像语义地改变图像内容。这种方法在使用方便的同时提供了细致的控制。
我们使用自我监督方式训练的图像条件扩散模型解决了该问题。我们提出了一组技术来应对退化的挑战。
我们的方法在野外图像编辑方面优于之前的技术，这一点通过定量指标和主观评估都有所体现。

2. 相关工作

图像合成。将一张图像的前景切割并粘贴到另一张图像上以创建逼真的合成图像是照片编辑中常见且广泛使用的操作。许多方法 $\left\lbrack {8 - {10},{27},{43},{47},{60},{63},{64}}\right\rbrack$ 已经被提出，专注于图像调和，以使合成图像看起来更加逼真。传统方法 $\left\lbrack {8,{27},{63}}\right\rbrack$ 倾向于提取手工特征来匹配颜色分布。近期的工作 $\left\lbrack {6,{21}}\right\rbrack$ 利用深度语义特征来提高鲁棒性。语义生成金字塔 [58] 通过结合前景图像和背景图像的深度特征实现语义图像合成，但结果与前景图像并不十分相似。较新的工作 DCCF [67] 提出了一种以金字塔方式排列的四种人类可理解的神经滤波器，并实现了最先进的颜色调和效果。然而，它们都假设前景和背景在语义上是和谐的，并且只在低级颜色空间中调整合成图像，而保持结构不变。在本文中，我们针对语义图像合成，考虑到具有挑战性的语义不和谐问题。

语义图像编辑。语义图像编辑，即编辑图像的高级语义，由于其潜在应用的吸引力，在视觉和图形学社区引起了极大的兴趣。一系列稳步发展的工作 $\lbrack 2,5,{48}$ ， ${55},{57}\rbrack$ 仔细剖析了生成对抗网络（GAN）的潜在空间，旨在找到语义解耦的潜在因素。而其他研究工作利用判别模型，如属性分类器 $\left\lbrack {{15},{26}}\right\rbrack$ 或人脸识别 $\left\lbrack {{34},{56}}\right\rbrack$ 模型，来帮助解耦和操作图像。另一个流行的方向是 $\left\lbrack {5,{18},{36},{65},{73},{75}}\right\rbrack$ 利用语义掩码来控制编辑。然而，大多数现有方法仅限于特定的图像类型，如人脸、汽车、鸟、猫等。在这项工作中，我们专注于引入一个适用于通用且复杂图像的高精度模型。

文本驱动的图像编辑。在众多语义图像编辑中，文本引导的图像编辑最近越来越受到关注。早期工作 $\lbrack 1,4$ ，14,42,66] 利用预训练的 GAN 生成器 [29] 和文本编码器 [44] 按照文本提示逐步优化图像。然而，由于 GAN 的建模能力有限，这些基于 GAN 的操作方法在编辑复杂场景或各种对象的图像时遇到了困难。扩散模型 $\left\lbrack {{45},{46},{54}}\right\rbrack$ 的迅速崛起和发展已经显示出在生成高质量和多样化图像方面的强大能力。许多工作 $\left\lbrack {3,{13},{22},{30},{31},{38},{39},{52}}\right\rbrack$ 利用扩散模型进行文本驱动的图像编辑。例如，DiffusionCLIP [31]，dreambooth [52]，和 Imagic [30] 针对不同文本提示专门微调扩散模型。Blended Diffusion [3] 提出了一个多步骤混合过程，使用用户提供的遮罩进行局部操作。虽然这些方法取得了令人印象深刻的结果，但我们认为语言引导仍然缺乏精确控制，而图像可以更好地表达一个人的具体想法。因此，在这项工作中，我们感兴趣的是基于示例的图像编辑。

3. 方法

我们致力于基于示例的图像编辑，这种编辑方法能够自动将参考图像（从数据库检索或由用户提供）以一个使合并后的图像看起来逼真且照片般真实的方式合并到源图像中。尽管基于文本的图像编辑最近取得了显著的成功，但仍然难以仅用口头描述来表达复杂和多种想法。而另一方面，图像可能是传达人们意图的更好选择，正如谚语所说：“一图胜千言”。

形式上，将源图像表示为 ${\mathbf{x}}_{s} \in {\mathbb{R}}^{H \times W \times 3}$ ，其中 $H$ 和 $W$ 分别代表宽度与高度。编辑区域可以是矩形或不规则形状（至少是连通的），并且用一个二值掩码 $\mathbf{m} \in$ $\{ 0,1{\} }^{H \times W}$ 来表示，其中值1指定了 ${\mathbf{x}}_{s}$ 中的可编辑位置。给定一个包含所需物体的参考图像 ${\mathbf{x}}_{r} \in {\mathbb{R}}^{{H}^{\prime } \times {W}^{\prime } \times 3}$ ，我们的目标是合成一个从 $\left\{ {{\mathbf{x}}_{s},{\mathbf{x}}_{r},\mathbf{m}}\right\}$ 来的图像 $\mathbf{y}$ ，使得 $\mathbf{m} = 0$ 的区域尽可能与源图像 ${\mathbf{x}}_{s}$ 保持一致，而 $\mathbf{m} = 1$ 的区域描绘的物体与参考图像 ${\mathbf{x}}_{r}$ 相似并且和谐地融合。

这个任务非常具有挑战性和复杂性，因为它隐含地涉及了几个非平凡的过程。首先，模型需要理解参考图像中的物体，捕捉其形状和纹理，同时忽略背景的噪声。其次，关键是要能够合成物体的一个转换视图（不同的姿态、大小、光照等），使其能够很好地适应源图像。第三，模型需要修补物体周围的区域以生成逼真的照片，展示在合并边界上的平滑过渡。最后，参考图像的分辨率可能低于编辑区域。模型在处理过程中应涉及超分辨率。

3.1.预备知识

自监督训练。由于收集和注释成对数据，即 $\left\{ {\left( {{\mathbf{x}}_{s},{\mathbf{x}}_{r},\mathbf{m}}\right) ,\mathbf{y}}\right\}$ ，对于基于示例的图像编辑训练来说是不可能的。手动绘制合理的输出可能需要巨大的花费和劳动力。因此，我们提出进行自监督训练。具体来说，给定一个图像和图像中一个对象的边界框，为了模拟训练数据，我们使用对象的边界框作为二值掩码 $\mathbf{m}$ 。我们直接将源图像边界框内的图像块视为参考图像 ${\mathbf{x}}_{r} = \mathbf{m} \odot {\mathbf{x}}_{s}$ 。自然地，图像编辑结果应该是原始源图像 ${\mathbf{x}}_{s}$ 。因此，我们的训练数据由 $\left\{ {\left( {\overline{\mathbf{m}} \odot {\mathbf{x}}_{s},{\mathbf{x}}_{r},\mathbf{m}}\right) ,{\mathbf{x}}_{s}}\right\}$ 组成，其中 $\overline{\mathbf{m}} = \mathbb{1} - \mathbf{m}$ 代表掩码 $\mathbf{m}$ 的补集，而 $\mathbb{1}$ 表示全为1的矩阵。

一个朴素解法。扩散模型在合成前所未有的图像质量方面取得了显著进展，并且已经成功应用于许多基于文本的图像编辑工作中 $\left\lbrack {{30},{31},{40},{52}}\right\rbrack$ 。对于我们的基于示例的图像编辑任务，一个朴素的解法是直接将文本条件替换为参考图像条件。

具体来说，扩散模型通过逐渐逆转一个马尔可夫前向过程来生成图像 $\mathbf{y}$ 。从 ${\mathbf{y}}_{0} = {\mathbf{x}}_{s}$ 开始，前向过程产生一系列噪声逐渐增加的图像 $\left\{ {{\mathbf{y}}_{t} \mid t \in \left\lbrack {1,T}\right\rbrack }\right\}$ ，其中 ${\mathbf{y}}_{t} =$ $\sqrt{{\bar{\alpha }}_{t}}{\mathbf{y}}_{0} + \sqrt{1 - {\bar{\alpha }}_{t}}\epsilon ,\epsilon$ 是高斯噪声，而 ${\alpha }_{t}$ 随时间步长 $t$ 减小。对于生成过程，扩散模型通过最小化以下损失函数，在给定条件 $\mathbf{c}$ 的情况下，逐步去噪最后一步的噪声图像：

$\mathcal{L} = {\mathbb{E}}_{t,{\mathbf{y}}_{0},\epsilon }{\begin{Vmatrix}{\epsilon }_{\theta }\left( {\mathbf{y}}_{t},\overline{\mathbf{m}} \odot {\mathbf{x}}_{s},\mathbf{c},t\right) - \epsilon \end{Vmatrix}}_{2}^{2}. \tag{1}$

图2. 朴素解法的复制和粘贴伪影示例。生成的图像极其不自然。

对于文本引导的修复模型，条件 $\mathbf{c}$ 是给定的文本，通常由预训练的CLIP [44] 文本编码器处理，输出77个token。同样，一个简单的方法是直接用CLIP图像嵌入替换它。我们利用预训练的CLIP图像编码器输出257个token，包括1个类别token和256个补丁token，表示为 $\mathbf{c} = {\operatorname{CLIP}}_{\text{all }}\left( {\mathbf{x}}_{r}\right)$ 。

这个简单的方法在训练集上收敛得很好。然而，当我们应用到测试图像时，我们发现生成的结果远非令人满意。编辑区域存在明显的复制粘贴痕迹，使得生成的图像极其不自然，如图2所示。我们认为这是因为，在简单的训练方案下，模型学习了一个平凡映射函数： $\overline{\mathbf{m}} \odot {\mathbf{x}}_{s} + {\mathbf{x}}_{r} = {\mathbf{x}}_{s}$ 。这阻碍了网络理解参考图像中的内容和与源图像的联系，导致无法将泛化失败应用到测试场景，其中参考图像任意给定，而不是原始图像的补丁。

我们的动机。如何防止模型学习这样的平凡映射函数，并以自监督训练方式促进模型理解是一个具有挑战性的问题。在本文中，我们提出了三个原则。1) 我们引入了内容瓶颈，以强制网络理解和重新生成参考图像的内容，而不仅仅是复制。2) 我们采用强烈的增强来减轻训练-测试不匹配问题。这有助于网络不仅从示例对象学习转换，还从背景学习转换。3) 对于基于示例的图像编辑的另一个关键特性是可控性。我们能够控制编辑区域的形状以及编辑区域与参考图像之间的相似度。

3.2. 模型设计

3.2.1 内容瓶颈

压缩表示。我们重新分析文本条件与图像条件之间的差异。对于文本条件，模型自然被迫学习语义，因为文本是一种内在的语义信号。关于图像条件，记住信息而不是理解上下文信息并复制内容，很容易得到一个平凡解。为了避免这种情况，我们打算通过压缩参考图像的信息来增加重建遮罩区域的难度。具体来说，我们只利用示例图像的预训练CLIP图像编码器的类令牌作为条件。它将参考图像从空间大小 $224 \times {224} \times 3$ 压缩到一个维数为1024的一维向量。

图3。我们的训练流程。

我们发现这种高度压缩的表示倾向于忽略高频细节，同时保留语义信息。它迫使网络理解参考内容，并防止生成器直接复制粘贴以达到训练的最优结果。出于表现力的考虑，我们添加了几个额外的全连接（FC）层来解码特征，并通过交叉注意力将其注入扩散过程。

图像先验。为了进一步避免直接记住参考图像的平凡解，我们利用一个训练有素的扩散模型进行初始化，作为一个强图像先验。具体来说，我们采用了一个文本到图像生成模型，Stable Diffusion [51]，主要基于两个原因。首先，它具有很强的生成高质量野外图像的能力，这要归功于其给定任何位于潜在空间中的向量都将导致一个可信图像的特性。其次，一个预训练的CLIP [44] 模型被用来提取语言信息，它与我们所采用的CLIP图像嵌入具有相似的表示，因此是一个好的初始化。

3.2.2 强增强

自监督训练的另一个潜在问题是训练和测试之间的领域差距。这种训练-测试不匹配源于两个方面。

参考图像增强。第一个不匹配之处在于，训练过程中参考图像 ${\mathbf{x}}_{r}$ 是从源图像 ${\mathbf{x}}_{s}$ 派生出来的，这在测试场景中几乎不会出现。为了减少这种差异，我们对参考图像采用多种数据增强技术（包括翻转、旋转、模糊和弹性变换），以打破与源图像的联系。我们将这些数据增强称为 $\mathcal{A}$ 。正式地，输入扩散模型的条件表示为：

$\mathbf{c} = \operatorname{MLP}\left( {\operatorname{CLIP}\left( {\mathcal{A}\left( {x}_{r}\right) }\right) }\right) \tag{2}$

掩码形状增强。另一方面，来自边界框的掩码区域 $\mathbf{m}$ 确保参考图像包含一个完整的对象。因此，生成器学习尽可能完整地填充一个对象。然而，在实际场景中这可能并不成立。为了解决这个问题，我们基于边界框生成一个任意形状的掩码并在训练中使用它。具体来说，对于边界框的每一条边，我们首先构建一条贝塞尔曲线来拟合它，然后在这条曲线上均匀地采样20个点，并随机地将其坐标偏移 $1 - 5$ 个像素。最后，我们依次用直线连接这些点形成一个任意形状的掩码。掩码 $\mathbf{m}$ 上的随机扭曲 $\mathcal{D}$ 打破了归纳偏置，减少了训练和测试之间的差异。即，

$\overline{\mathbf{m}} = \mathbb{1} - \mathcal{D}\left( \mathbf{m}\right) \tag{3}$

我们发现这两种增强在面对不同的参考引导时可以大大提高鲁棒性。

3.2.3 控制掩码形状

遮罩形状增强的另一个好处是它提高了在推理阶段对遮罩形状的控制。在实际应用场景中，矩形遮罩通常无法精确表示遮罩区域，例如图1中的太阳伞。在某些情况下，人们希望编辑特定区域，同时尽可能保留其他区域，这导致了处理不规则遮罩形状的需求。通过将这些不规则遮罩纳入训练，我们的模型能够生成给定各种形状遮罩的照片级真实感结果。

3.2.4 控制相似度

为了控制编辑区域与参考图像之间的相似度，我们发现无分类器采样策略 [25] 是一个强大的工具。之前的工作 [62] 发现无分类器指导实际上是先验和后验约束的结合。

$\log p\left( {{\mathbf{y}}_{t} \mid \mathbf{c}}\right) + \left( {s - 1}\right) \log p\left( {\mathbf{c} \mid {\mathbf{y}}_{t}}\right) \tag{4}$

$\propto \log p\left( {\mathbf{y}}_{t}\right) + s\left( {\log p\left( {{\mathbf{y}}_{t} \mid \mathbf{c}}\right) - \log p\left( {\mathbf{y}}_{t}\right) }\right) ,$

其中 $s$ 表示无分类器指导的尺度。它也可以被视为控制生成图像与参考图像相似度的尺度。较大的尺度因子 $s$ 表示融合结果更依赖于条件参考输入。在我们的实验中，我们遵循 [62] 中的设置，并在训练期间用可学习向量 $\mathbf{v}$ 替换 ${20}\%$ 参考条件。这个术语旨在在固定条件输入 $p\left( {{\mathbf{y}}_{t} \mid \mathbf{v}}\right)$ 的帮助下建模 $p\left( {\mathbf{y}}_{t}\right)$ 。在推理阶段，每个去噪步骤都使用修改后的预测：

${\widetilde{\epsilon }}_{\theta }\left( {{\mathbf{y}}_{t},\mathbf{c}}\right) = {\epsilon }_{\theta }\left( {{\mathbf{y}}_{t},\mathbf{v}}\right) + s\left( {{\epsilon }_{\theta }\left( {{\mathbf{y}}_{t},\mathbf{c}}\right) - {\epsilon }_{\theta }\left( {{\mathbf{y}}_{t},\mathbf{v}}\right) }\right) . \tag{5}$

为了避免混淆，这里省略了参数 $t$ 和 $\overline{\mathbf{m}} \odot {\mathbf{x}}_{s}$ 。首先，我们方法的整体框架在图3中有所说明。

4. 实验

4.1 实施细节与评估

实施细节。为了操作现实世界中的图像，首先我们使用一个强大的文本到图像生成模型，Stable Diffusion [51]，作为初始化来提供强大的图像先验。然后我们选择OpenImages [32]作为我们的训练数据集。它包含了总共1600万个边界框，涵盖600个对象类别，分布在190万张图像上。在训练过程中，我们将图像分辨率预处理为 $512 \times {512}$ ，并训练我们的模型40个周期，这在大约64个NVIDIA V100 GPU上需要大约7天时间。

测试基准。据我们所知，之前没有 works 针对基于样本的语义图像编辑（或语义图像组合）。因此，我们构建了一个测试基准用于定性和定量分析。具体来说，我们从MSCOCO [35] 验证集中手动选择了3500张源图像 $\left( {\mathbf{x}}_{s}\right)$ ，每张图像只包含一个边界框 $\left( \mathbf{m}\right)$ ，并且遮罩区域不超过整个图像的一半。然后我们从MSCOCO训练集中手动检索一个参考图像补丁 $\left( {\mathbf{x}}_{r}\right)$ 。参考图像通常与遮罩区域具有相似的语义，以确保组合是合理的。我们将其命名为基于样本的图像编辑基准，简称为COCOEE。我们将发布这个基准，希望吸引更多后续工作在这一领域。

评估指标。我们的目标是将军参考图像融合到源图像中，同时编辑区域应与参考图像相似，融合结果应具有照片级真实感。为了独立衡量这两个方面，我们使用以下三个指标来评估生成的图像。1) FID [23] 分数，这是广泛用于评估生成结果的指标。我们遵循 [33] 并使用 CLIP 模型提取特征，计算 3,500 张生成图像与 COCO 测试集中所有图像之间的 FID 分数。2) 质量分数（QS）[17]，旨在评估每张单一图像的真实性。我们计算其平均值以衡量生成图像的整体质量。3) CLIP 分数 [44]，评估编辑区域与参考图像之间的相似度。具体来说，我们将这两个图像缩放到 $224 \times {224}$ 大小，通过 CLIP 图像编码器提取特征，并计算它们的余弦相似度。更高的 CLIP 分数表明编辑区域与参考图像更相似。

4.2. 对比分析

考虑到之前没有研究旨在基于示例图像进行语义上的局部图像编辑，我们选择了四种相关方法作为我们方法的基线：1) Blended Diffusion [3]，它利用CLIP模型提供梯度来指导扩散采样过程。我们使用文本提示“一张 $C$ 的照片”来计算CLIP损失，其中 $C$ 表示示例图像中的对象类别。2) 我们通过使用参考图像来计算CLIP损失，对Blended Diffusion进行了轻微修改，记作Blended Diffusion (image)。3) Stable Diffusion [51]。我们将文本提示作为条件来表示参考图像，并修复遮罩区域。4) 我们还选择了最先进的图像协调方法DCCF [67]作为基线。考虑到它只能将前景图像融合到背景中，我们首先使用无条件图像修复模型LAMA [61]来修复整个遮罩区域，然后通过额外的语义遮罩提取参考图像的前景，最后将其协调到源图像中。

图4. 与其他方法的定性比较。我们的方法能够生成在感知质量上与输入参考图像语义一致的结果。

定性分析。我们在图4中提供了这些方法的定性比较。文本引导的混合扩散能够在期望区域内生成物体，但它们不真实且与源图像不兼容。另一种基于文本的方法——稳定扩散能够生成更加真实的结果，但由于文本信息的表示有限，仍然无法保留参考图像的特征。同时，图像引导的混合扩散也受到与参考图像不相似的困扰。我们认为这可能是由于梯度引导策略无法保留足够的内容信息。最后，图像调和生成的结果几乎与示例图像相同，与背景非常不协调。内在原因是示例图像的外观在大多数情况下无法直接与源图像匹配。生成模型应该自动转换形状、大小或姿态以适应源图像。在图4的最后一列中，我们的方法在保持与参考图像相似的同时，实现了照片级真实感的结果。

表1。不同方法的定量比较。我们通过FID和QS评估生成图像的质量，并通过CLIP得分评估与参考图像的语义一致性。

Method	QS (↑)CLIP Score (↑)
Blended Diffusion-Image [3]	4.60	67.14	80.65
Blended Diffusion-Text [3]	7.52	55.89	72.62
DCCF [67]	3.78	71.49	82.18
Stable Diffusion [51]	3.66	73.20	75.33
Ours	3.18	77.80	84.97

定量分析。表1展示了定量比较的结果。基于图像的编辑方法（包括混合扩散（图像）和DCCF）达到了很高的CLIP得分，表明它们能够保留条件图像的信息，而生成图像的质量较差。稳定扩散生成的结果根据FID和QS更加可信。然而，它几乎无法融入图像的条件信息。我们的方法在这三个指标上都取得了最佳性能，验证了它不仅能够生成高质量图像，还能保持条件信息。

用户研究。为了获取用户对生成图像的主观评价，我们对50名无相关背景的学生参与者进行了一项用户研究。在研究中，我们使用了30组图像，每组包含两个输入和五个输出。每组中的所有结果并排随机展示给参与者。参与者有无限的时间对两个独立的角度（图像质量和与参考图像的相似度）进行评分，从1分（最佳）到5分（最差）。我们在表2中报告了平均排名分数。总体而言，图像和谐化方法DCCF与参考图像最为相似，因为它直接复制了参考图像。尽管如此，用户由于我们结果的逼真质量，更偏好我们的结果。

图5。我们方法中各个组件的视觉消融研究。我们通过这些技术逐渐消除了边界伪影，最终得到了可信的生成结果。

表2。图像质量和语义一致性的平均排名分数。1分最佳，5分最差。用户评价我们的质量最佳，语义一致性仅次于复制示例图像的图像和谐化方法。

Method	Quality ( $\downarrow$ )	Consistency $\left( \downarrow \right)$
Blended Diffusion-Image [3]	3.83	3.84
Blended Diffusion-Text [3]	3.93	3.95
DCCF [67]	3.09	1.66
Stable Diffusion [51]	2.36	3.48
Ours	1.79	2.07

4.3. 消融研究

为了实现高质量的基于样本的图像编辑，我们引入了四种关键技术，即利用图像先验、强增强、内容瓶颈和无分类器引导。在本节中，我们进行了五个逐渐变化的设置来验证它们：1）我们将第3.1节中的朴素解法作为基线。它是直接从文本引导的修复模型修改而来，将文本替换为条件信号。2）我们利用预训练的文本到图像生成模型进行初始化作为图像先验。3）为了减少训练-测试差距，我们对参考图像采用强增强。4）为了进一步避免陷入平凡解，我们在训练过程中通过高度压缩图像信息来增加重构输入图像的难度，我们将其称为内容瓶颈。5）最后，我们使用无分类器引导来进一步提高性能。

表3. 不同变体的定量比较。

Method	FID()	QS (↑)	CLIP Score (↑)
Baseline	3.61	76.71	85.90
+ Prior	3.40	77.63	88.79
+ Augmentation	3.44	76.86	81.68
+ Bottleneck	3.26	76.62	81.41
+ Classfier-Free	3.18	77.80	84.97

我们在表3和图5中展示了结果。基线解决方案存在明显的边界伪影，使得生成的图像极其不自然。通过利用图像先验，根据较低的FID得分，图像质量得到提高。然而，它仍然受到复制和粘贴问题的困扰。添加增强可以部分缓解这个问题。当我们进一步利用内容瓶颈技术压缩信息时，这些边界伪影可以被完全消除。同时，由于遮罩区域应该生成而不是直接复制，该区域的质量会下降，因为生成器的难度显著增加。最后，我们添加了无分类器引导，使得生成区域更接近参考图像，这极大地提高了整体图像质量并实现了最佳性能。

图6. 无分类器引导尺度 $\lambda$ 的效果。较大的 $\lambda$ 使得生成区域更接近参考图像。

图 7. 逐渐精确的文本描述与图像引导的比较。使用图像作为条件可以保持更细粒度的细节。

图 8. 实际场景中的示例基于图像编辑结果。

同时，我们还研究了无分类器规模如何影响我们的结果。如图 6 所示，随着规模 $\lambda$ 的增长，生成的区域越来越像参考输入。在我们的实验中，我们默认设置 $\lambda = 5$ 。

4.4. 从语言到图像条件

在图 7 中，我们提供了语言和图像之间可控性的比较。从左到右，我们尝试使用逐渐详细的语言描述来修复蒙版区域。随着语言描述的精确度提高，生成的结果确实越来越接近参考图像。但与图像引导的结果相比，仍然存在很大的差距。我们的方法能够保持毛发、表情，甚至是脖子上的衣领。

4.5. 实际场景图像编辑

得益于扩散过程中的随机性，我们的方法可以从相同的输入生成多个输出。我们在图 9 中展示了多样化的生成结果。尽管合成的图像各不相同，但它们都保留了参考图像的关键特征。例如，所有狗都有黄色的毛发、白色的胸部和下垂的耳朵。更多基于示例的图像编辑结果展示在图 8 和附录中。

图 9. 我们的框架能够从相同的源图像和示例图像合成逼真且多样化的结果。

5. 结论

在本文中，我们介绍了一种新颖的图像编辑方法，即“示例绘画”，该方法旨在基于示例图像语义地改变图像内容。我们通过利用基于扩散模型的自监督训练来实现这一目标。朴素方法导致了边界伪影问题，我们仔细分析了这一问题，并通过提出一组技术解决了它，这些技术包括利用图像先验、强增强、内容瓶颈和无分类器引导。我们的算法使用户能够精确控制编辑，并在实际图像上取得了令人印象深刻的表现。我们希望这项工作能够作为一个坚实的基础，并帮助支持基于示例的图像编辑领域的未来研究。

致谢：这项工作得到了中国国家级重点研发计划的部分资助，项目编号为2020AAA0108600，并得到了中国国家自然科学基金的部分资助，项目编号为62076230和62032006。

参考文献

[1] Rameen Abdal, Peihao Zhu, John Femiani, Niloy Mitra, and Peter Wonka. Clip2stylegan: Unsupervised extraction of stylegan edit directions. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1-9, 2022. 3

[2] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hy-pernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18511-18521, 2022. 1, 3

[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208-18218, 2022. 2, 3, 5, 6,7

[4] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv preprint arXiv:2103.10951, 2021. 3

[5] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020. 1, 3

[6] Bor-Chun Chen and Andrew Kae. Toward realistic image compositing with adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415-8424, 2019. 2

[7] Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6306-6314, 2018. 1

[8] Daniel Cohen-Or, Olga Sorkine, Ran Gal, Tommer Leyvand, and Ying-Qing Xu. Color harmonization. In ACM SIG-GRAPH 2006 Papers, pages 624-630. 2006. 2

[9] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8394-8403, 2020. 2

[10] Xiaodong Cun and Chi-Man Pun. Improving the harmony of the composite image by spatial-separated attention module. IEEE Transactions on Image Processing, 29:4759- 4771, 2020. 2

[11] Yubin Deng, Chen Change Loy, and Xiaoou Tang. Aesthetic-driven image enhancement by adversarial learning. In Proceedings of the 26th ACM international conference on Multimedia, pages 870-878, 2018. 1

[12] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022. 2

[13] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 3

[14] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1-13, 2022. 3

[15] Yue Gao, Fangyun Wei, Jianmin Bao, Shuyang Gu, Dong Chen, Fang Wen, and Zhouhui Lian. High-fidelity and arbitrary face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16115-16124, 2021. 3

[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing $\mathrm{{Xu}}$ ,David Warde-Farley,Sherjil Ozair,Aaron Courville,and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139-144, 2020. 1

[17] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Giqa: Generated image quality assessment. In European Conference on Computer Vision, pages 369-385. Springer, 2020. 5

[18] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. Mask-guided portrait editing with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3436- 3445, 2019. 3

[19] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696-10706, 2022. 2

[20] Zonghui Guo, Dongsheng Guo, Haiyong Zheng, Zhaorui $\mathrm{{Gu}}$ ,Bing Zheng,and Junyu Dong. Image harmonization with transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14870-14879, 2021. 2

[21] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, and Bing Zheng. Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16367-16376, 2021. 2

[22] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 2, 3

[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017. 5

[24] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840-6851, 2020. 2

[25] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 2, 5

[26] Xianxu Hou, Xiaokang Zhang, Hanbang Liang, Linlin Shen, Zhihui Lai, and Jun Wan. Guidedstyle: Attribute knowledge guided style manipulation for semantic face editing. Neural Networks, 145:209-220, 2022. 3

[27] Jiaya Jia, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum. Drag-and-drop pasting. ACM Transactions on Graphics (TOG), 25(3):631-637, 2006. 2

[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401-4410, 2019. 1

[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110-8119, 2020. 3

[30] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022. 2, 3

[31] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif-fusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426- 2435, 2022. 3

[32] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4. International Journal of Computer Vision, 128(7):1956-1981, 2020. 5

[33] Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fr\'echet inception distance. arXiv preprint arXiv:2203.06026, 2022. 5

[34] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019. 3

[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740-755. Springer, 2014. 5

[36] Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331-16345, 2021. 1, 3

[37] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and Arun Mallya. Generative adversarial networks for image and video synthesis: Algorithms and applications. Proceedings of the IEEE, 109(5):839-862, 2021. 1

[38] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. arXiv preprint arXiv:2112.05744, 2021. 3

[39] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021. 3

[40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021. 3

[41] Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, and Liqing Zhang. Making images real again: A comprehensive survey on deep image composition. arXiv preprint arXiv:2106.14490, 2021. 1

[42] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085-2094, 2021. 3

[43] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. In ACM SIGGRAPH 2003 Papers, pages 313- 318. 2003. 2

[44] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748-8763. PMLR, 2021. 3, 4, 5

[45] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2, 3

[46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821-8831. PMLR, 2021. 3

[47] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley. Color transfer between images. IEEE Computer Graphics and Applications, 21(5):34-41, 2001. 2

[48] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2287-2296, 2021. 3

[49] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 42(1):1-13, 2022. 1

[50] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 2

[51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684-10695, 2022. 4, 5, 6, 7

[52] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022. 2, 3

[53] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1- 10, 2022. 1

[54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed

Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2, 3

[55] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9243-9252, 2020. 1, 3

[56] Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, and Xi-aoou Tang. Faceid-gan: Learning a symmetry three-player gan for identity-preserving face synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 821-830, 2018. 3

[57] Yujun Shen and Bolei Zhou. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1532-1540, 2021. 1, 3

[58] Assaf Shocher, Yossi Gandelsman, Inbar Mosseri, Michal Yarom, Michal Irani, William T Freeman, and Tali Dekel. Semantic pyramid for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7457-7466, 2020. 2

[59] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020. 2

[60] Kalyan Sunkavalli, Micah K Johnson, Wojciech Matusik, and Hanspeter Pfister. Multi-scale image harmonization. ACM Transactions on Graphics (TOG), 29(4):1-10, 2010. 2

[61] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2149-2159, 2022. 1, 6

[62] Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Improved vector quantized diffusion models. arXiv preprint arXiv:2205.16007, 2022. 5

[63] Michael W Tao, Micah K Johnson, and Sylvain Paris. Error-tolerant image compositing. In European Conference on Computer Vision, pages 31-44. Springer, 2010. 2

[64] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3789-3797, 2017. 2

[65] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022. 3

[66] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2256- 2265, 2021. 3

[67] Ben Xue, Shenghui Ran, Quan Chen, Rongfei Jia, Binqiang Zhao, and Xing Tang. Dccf: Deep comprehensible color filter learning framework for high-resolution image harmonization. arXiv preprint arXiv:2207.04788, 2022. 1, 2, 6, 7

[68] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5505- 5514, 2018. 1

[69] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun-jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin-fei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022. 2

[70] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11304-11314, 2022. 1

[71] Bo Zhang, Mingming He, Jing Liao, Pedro V Sander, Lu Yuan, Amine Bermak, and Dong Chen. Deep exemplar-based video colorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8052-8061, 2019. 1

[72] He Zhang, Jianming Zhang, Federico Perazzi, Zhe Lin, and Vishal M Patel. Deep image compositing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 365-374, 2021. 1

[73] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5143-5153, 2020. 3

[74] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision, pages 649-666. Springer, 2016. 1

[75] Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin Bao, Dong Chen, Zhongfei Zhang, and Fang Wen. Cocosnet v2: Full-resolution correspondence learning for image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11465-11475, 2021. 3