Generative Photomontage ：允许用户通过合成多个生成的图像来创建想要的图像

Phoenixtree_DongZhao

于 2024-08-16 17:05:17 发布

阅读量364

点赞数 6

分类专栏： Image Generation 文章标签：图像生成

本文链接：https://blog.csdn.net/u014546828/article/details/141235255

版权

Image Generation 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Generative Photomontage

Generative Photomontage Project

Abstract

Text-to-image models are powerful tools for image creation. However, the generation process is akin to a dice roll and makes it difficult to achieve a single image that captures everything a user wants [道出本文动机].

In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images [核心思想], in essence forming a Generative Photomontage.

Given a stack of images generated by ControlNet using the same input condition and different seeds, we let users select desired parts from the generated results using a brush stroke interface. We introduce a novel technique that takes in the user's brush strokes, segments the generated images using a graph-based optimization in diffusion feature space, and then composites the segmented regions via a new feature-space blending method. [两句介绍具体方法]

Our method faithfully preserves the user-selected regions while compositing them harmoniously. [优点特点]

We demonstrate that our flexible framework can be used for many applications, including generating new appearance combinations, fixing incorrect shapes and artifacts, and improving prompt alignment. [应用范围]

We show compelling results for each application and demonstrate that our method outperforms existing image blending methods and various baselines. [结果]

文本到图像的模型是强大的图像创作工具。然而，生成过程类似于掷骰子，使得难以获得一幅能够捕捉到用户想要的所有内容的单一图像。

本文提出了一个框架，通过从生成图像的不同部分组合来创建所需的图像，本质上形成了一个Generative Photomontage。

给定由ControlNet使用相同输入条件和不同种子生成的一系列图像，允许用户使用笔触界面从生成结果中选择所需的部分。

本文引入了一种新技术，该技术接受用户的笔触，在扩散特征空间中使用基于图的优化方法对生成图像进行分割，然后通过一种新的特征空间融合方法将分割区域组合起来。

该方法在将用户选择的区域组合得和谐一致的同时，也忠实地保留了这些区域。

本文证明了本文的灵活框架可用于多种应用，包括生成新的外观组合、修正错误的形状和伪影，以及改进提示对齐。

本文展示了每个应用中的令人信服的结果，并证明了本文的方法优于现有的图像融合方法和各种基线方法。

Problem

Text-to-image models are powerful tools for image creation. However, these models may not achieve exactly what a user envisions. For example, the prompt "a robot from the future" can map to any sample in a large space of robot images. From the user's perspective, this process is akin to a dice roll. In particular, it is often challenging to achieve a single image that includes everything the user wants: the user may like the robot from one generated result and the background in a different result. They may also like certain parts of the robot (e.g., the arm) in another result.

文本到图像的模型是强大的图像创作工具。然而，这些模型可能无法完全实现用户的设想。例如，提示“来自未来的机器人”可以在包含大量机器人图像的广阔空间中映射到任何样本。从用户的角度来看，这个过程类似于掷骰子。特别是，要实现一幅包含用户想要的所有内容的单一图像往往具有挑战性：用户可能喜欢一个生成结果中的机器人，而喜欢另一个结果中的背景。他们还可能喜欢另一个结果中机器人的某些部分（如手臂）。

Key Idea

In this paper, we propose a different approach -- we suggest the possibility of synthesizing the desired image by compositing it from different parts of generated images. In our approach, users can first generate many results (roll the dice first) and then choose exactly what they want (composite across the dice rolls).

Our key idea is to treat generated images as intermediate outputs, let users select desired parts from the generated results, and then composite the user-selected regions to form the final image. This approach allows users to take advantage of the model's generative capabilities while retaining fine-grained control over the final result.

本文提出了一种不同的方法——本文建议通过从生成图像的不同部分组合来合成所需的图像的可能性。在该方法中，用户可以首先生成许多结果（先掷骰子），然后选择他们真正想要的内容（跨骰子结果组合）。

核心理念是将生成的图像视为中间输出，让用户从生成结果中选择所需的部分，然后将用户选择的区域组合成最终图像。这种方法允许用户利用模型的生成能力，同时保持对最终结果的精细控制。

Method

Our method takes in a stack of generated images and produces a final image based on sparse user strokes.

(a) In our image stack, images are generated normally through ControlNet, using one or more prompts. The generated images share common spatial structures, as they are produced using the same input condition (e.g., edge maps or depth maps).

(b) Upon browsing the image stack, the user selects desired objects and regions via broad brush strokes on the images. In the example below, the user wishes to remove the rock at the Apple bite in the first image and add the red leaf from the third image. To do so, the user draws strokes on the base rock in the first image, the patch of grass in the second image, and the red leaf in the third image. Our system takes the user input and performs a multi-label graph cut optimization in self-attention feature space (K features) to find a segmentation of image regions across the stack that minimizes seams.

(c) The graph-cut result is then used to form composite Q, K, V features, which are then injected into the self-attention layers. The final image is a harmonious composite of the user-selected regions.

本文的方法接收一系列生成的图像，并基于稀疏的用户笔触生成最终图像。

（a）在图像堆栈中，图像通过ControlNet正常生成，使用一个或多个提示。生成的图像具有共同的空间结构，因为它们使用相同的输入条件（如边缘图或深度图）生成。

（b）在浏览图像堆栈时，用户通过在图像上绘制粗略的笔触来选择所需的对象和区域。在下面的示例中，用户希望移除第一张图像中苹果咬痕上的岩石，并添加第三张图像中的红叶。为此，用户在第一张图像的基岩、第二张图像的草地斑块和第三张图像的红叶上绘制笔触。本文的系统接收用户输入，并在自注意力特征空间（K特征）中执行多标签图割优化，以找到跨堆栈的图像区域分割，从而最小化接缝。

（c）然后，将图割结果用于形成复合的Q、K、V特征，这些特征随后被注入自注意力层。最终图像是用户选择区域的和谐组合。

Results

Appearance Mixing

Here, we show applications in creative and artistic design, where users refine images based on subjective preference. This is useful in cases where the user may not realize what they want until they see it (e.g., creative exploration).

这里展示了在创意和艺术设计中的应用，用户可以根据主观偏好对图像进行细化。这在用户可能直到看到结果才意识到自己想要什么的情况下非常有用（例如，创意探索）。

Shape and Artifacts Correction

While users can provide a sketch to guide ControlNet's output, ControlNet may fail to adhere to the user's input condition, especially when asked to generate objects with uncommon shapes. In such cases, our method can be used to "correct" object shapes and scene layouts, given a replaceable image patch within the stack.

虽然用户可以提供草图来指导 ControlNet 的输出，但 ControlNet 可能无法完全遵守用户的输入条件，特别是在要求生成具有不寻常形状的对象时。在这种情况下，本文的方法可用于在堆栈内给定可替换的图像块时，“校正”对象形状和场景布局。

Prompt Alignment

In addition, Generative Photomontage can be used to increase prompt alignment in cases where the generated output does not accurately follow the input prompt. For example, it is difficult for ControlNet to follow all aspects of long complicated prompts (a). Using our method, users can create the desired image by breaking it up into simpler prompts and selectively combining the outputs (b).

此外，在生成的输出未准确遵循输入提示的情况下，Generative Photomontage 可用于增加提示对齐。例如，ControlNet 很难遵循长而复杂的提示的所有方面（a）。使用本文的方法，用户可以通过将其分解为更简单的提示并选择性地组合输出来创建所需的图像（b）。

Qualitative Comparison with Related Works

Below, we show qualitative comparisons between our method and related works. In Interactive Digital Photomontage [Agarwala et al. 2004], pixel-space graph-cut may cause seams to fall on undesired edges, and their gradient-domain blending in general does not preserve color, e.g., the bird's yellow beak is not preserved in (f). Blended latent diffusion [Avrahami et al. 2023] and MasaCtrl+ControlNet [Cao et al. 2023] may also lead to color changes (c, f) and structural changes (a, b, d, e).

下面展示了本文的方法与相关工作的定性比较。在交互式数字拼贴画（Interactive Digital Photomontage）[Agarwala 等人，2004]中，像素空间的图割可能导致接缝落在不希望的边缘上，并且其梯度域融合通常不会保留颜色，例如，（f）中鸟的黄色喙没有得到保留。混合潜在扩散（Blended latent diffusion）[Avrahami 等人，2023]和 MasaCtrl+ControlNet [Cao 等人，2023] 也可能导致颜色变化（c, f）和结构变化（a, b, d, e）。

Phoenixtree_DongZhao

关注

6
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Generative Photomontage ：允许用户通过合成多个生成的图像来创建想要的图像

singleimage that captures everything a user wants [道出本文动机].it from various parts of generated images [核心思想]ControlNet。
复制链接

扫一扫

专栏目录