论文翻译《LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation》

论文地址:https://arxiv.org/abs/2303.17189
代码地址:https://github.com/ZGCTroy/LayoutDiffusion

Abstract

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layoutto-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allow*Equal contribution. †Corresponding author. ing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ ZGCTroy/LayoutDiffusion.

最近,扩散模型在图像合成方面取得了巨大成功。然而,在布局到图像的生成过程中,图像往往是由多个物体组成的复杂场景,如何对全局布局图和每个细节物体进行强有力的控制仍然是一项具有挑战性的任务。在本文中,我们提出了一种名为 “布局扩散”(LayoutDiffusion)的扩散模型,与之前的研究相比,它能获得更高的生成质量和更强的可控性。为了克服图像与布局多模态融合的困难,我们提出利用区域信息构建结构图像块,并将图像块转化为特殊布局,以统一的形式与普通布局融合。此外,我们还提出了布局融合模块(LFM)和对象感知交叉注意(OaCA),以模拟多个对象之间的关系,并设计成对象感知和位置敏感的,允许*平等贡献。†通讯作者。大量实验表明,我们的 LayoutDiffusion 在 FID、CAS 和 VG 上的表现分别比之前的 SOTA 方法高出 46.35%、26.70% 和 44.29%、41.82%。代码见 https://github.com/ ZGCTroy/LayoutDiffusion。

1.Introduction

在这里插入图片描述

Figure 1. Compared to text, the layout allows diffusion models to obtain more control over the objects while maintaining high quality. Unlike the prevailing methods, we propose a diffusion model named LayoutDiffusion for layout-to-image generation. We transform the difficult multimodal fusion of the image and layout into a unified form by constructing a structural image patch with region information and regarding the patched image as a special layout.

图1。与文本相比,布局允许扩散模型在保持高质量的同时获得对对象更多的控制。与流行的方法不同,我们提出了一个名为LayoutDiffusion的扩散模型用于布局到图像的生成。我们通过构建具有区域信息的结构化图像块,并将拼接后的图像视为一种特殊的布局,将图像和布局的困难多模态融合转化为统一的形式。

Recently, the diffusion model has achieved encouraging progress in conditional image generation, especially in textto-image generation such as GLIDE [28], Imagen [36], and Stable Diffusion [35]. However, text-guided diffusion models may still fail in the following situations. As shown in Fig. 1 (a), when aiming to generate a complex image with multiple objects, it is hard to design a prompt properly and comprehensively. Even input with well-designed prompts, problems such as missing objects and incorrectly generating objects’ positions, shapes, and categories still occur in the state-of-the-art text-guided diffusion model [28, 35, 36]. This is mainly due to the ambiguity of the text and its weakness in precisely expressing the position of the image space [6, 15, 22, 43–45]. Fortunately, this is not a problem when using the coarse layout as guidance, which is a set of objects with the annotation of the bounding box (bbox) and object category. With both spatial and high-level semantic information, the diffusion model can obtain more powerful controllability while maintaining the high quality.

最近,扩散模型在条件图像生成方面取得了令人鼓舞的进展,尤其是在文本到图像的生成方面,如 GLIDE [28]、Imagen [36] 和 Stable Diffusion [35]。然而,文本引导的扩散模型在以下情况下仍有可能失败。如图 1 (a)所示,当目标是生成包含多个对象的复杂图像时,很难设计出适当而全面的提示。即使输入了设计良好的提示,在最先进的文本引导扩散模型 [28, 35, 36]中仍然会出现遗漏对象和错误生成对象的位置、形状和类别等问题。这主要是由于文本的模糊性及其在精确表达图像空间位置方面的弱点[6, 15, 22, 43-45]。幸运的是,当使用粗布局作为指导时,这并不是一个问题,粗布局是一组标注了边界框(bbox)和对象类别的对象。有了空间信息和高级语义信息,扩散模型就能在保持高质量的同时获得更强的可控性。

However, early studies [2, 16, 47, 51] on layout-to-image generation are almost limited to generative adversarial networks (GANs) and often suffer from unstable convergence [1] and mode collapse [31]. Despite the advantages of diffusion models in easy training [11] and significant quality improvement [8], few studies have considered applying diffusion in the layout-to-image generation task. To our knowledge, only LDM [35] supports the condition of layout and has shown encouraging progress in this field.

然而,早期关于从布局到图像生成的研究[2, 16, 47, 51]几乎仅限于生成式对抗网络(GANs),而且经常出现收敛不稳定[1]和模式崩溃[31]的问题。尽管扩散模型具有易于训练[11]和显著提高质量[8]的优势,但很少有研究考虑将扩散模型应用于从布局到图像的生成任务中。据我们所知,只有 LDM [35] 支持布局条件,并在这一领域取得了令人鼓舞的进展。

In this paper, different from LDM that applies the simple multimodal fusion method (e.g., the cross attention) or direct input concatenation for all conditional input, we aim to specifically design the fusion mechanism between layout and image. Moreover, instead of conditioning only in the second stage like LDM, we propose an end-to-end one-stage model that considers the condition for the whole process, which may have the potential to help mitigate loss in the task that requires fine-grained accuracy in pixel space [35]. The fusion between image and layout is a difficult multimodal fusion problem. Compared to the fusion of text and image, the layout has more restrictions on the position, size, and category of objects. This requires a higher controllability of the model and often leads to a decrease in the naturalness and diversity of the generated image. Furthermore, the layout is more sensitive to each token and the loss in token of layout will directly lead to the missing objects.

在本文中,不同于 LDM 采用简单的多模态融合方法(如交叉注意)或直接对所有条件输入进行输入串联,我们旨在专门设计布局和图像之间的融合机制。此外,与LDM仅在第二阶段进行条件限制不同,我们提出了一个端到端的单阶段模型,该模型考虑了整个过程的条件,这可能有助于减轻在像素空间中要求细粒度精度的任务中的损失[ 35 ]。图像与布局之间的融合是一个困难的多模态融合问题。相比于文字和图像的融合,布局对物体的位置、大小和类别有更多的限制。这对模型的可控性要求较高,往往会导致生成图像的自然性和多样性降低。此外,布局对每个token更加敏感,布局中token的丢失将直接导致对象的丢失。

To address the problems mentioned above, we propose treating the patched image and the input layout in a unified form. Specifically, we construct a structural image patch at multi-resolution by adding the concept of region that contains information of position and size. As a result, each patch of the image is transformed into a special type of object, and the entire patched image will also be regarded as a layout. Finally, the difficult problem of multimodal fusion between image and layout will be transformed into a simple fusion with a unified form in the same spatial space of the image. We name our model LayoutDiffuison, a layout-conditional diffusion model with Layout Fusion Module (LFM), object-aware Cross Attention Mechanism (OaCA), and corresponding classifier-free training and sampling scheme. In detail, LFM fuses the information of each object and models the relationship among multiple objects, providing a latent representation of the entire layout. To make the model pay more attention to the information related to the object, we propose an object-aware fusion module named OaCA. Cross-attention is made between the image patch feature and layout in a unified coordinate space by representing the positions of both of them as bounding boxes. To further improve the user experience of LayoutDiffuison, we also make several optimizations on the speed of the classifier-free sampling process and could significantly outperform the SOTA models in 25 iterations.

为了解决上述问题,我们建议以统一的形式处理图像块和输入布局。具体来说,我们通过添加包含位置和大小信息的区域概念来构建多分辨率的结构图像块。因此,图像的每个块都被转化为一种特殊类型的对象,而整个块图像也将被视为一个布局。最后,图像与布局之间的多模态融合难题将被转化为在同一图像空间内以统一形式进行的简单融合。我们将模型命名为 LayoutDiffuison,它是一种布局条件扩散模型,包含布局融合模块(LFM)、对象感知交叉注意机制(OaCA)以及相应的无分类器训练和采样方案。具体来说,LFM 融合了每个对象的信息,并对多个对象之间的关系进行建模,从而提供了整个布局的潜在表示。为了使模型更加关注与对象相关的信息,我们提出了一个名为 OaCA 的对象感知融合模块。通过将图像块特征和布局的位置表示为边界框,在统一的坐标空间中实现了图像块特征和布局之间的交叉关注。为了进一步改善 LayoutDiffuison 的用户体验,我们还对无分类器采样过程的速度进行了多次优化,并在 25 次迭代中明显优于 SOTA 模型。

Experiments are conducted on COCO-stuff [5] and Visual Genome (VG) [21]. Various metrics ranging from quality, diversity, and controllability show that LayoutDiffusion significantly outperforms both state-of-the-art GAN-based and diffusion-based methods.

实验在COCO - stuff [ 5 ]和Visual Genome ( VG ) [ 21 ]上进行。从质量、多样性和可控性等多个指标来看,LayoutDiffusion明显优于现有的基于GAN和基于扩散的方法。

Our main contribution is listed below.

我们的主要贡献如下。

Instead of using the dominated GAN-based methods, we propose a diffusion model named LayoutDiffusion for layout-to-image generations, which can generate images with both high-quality and diversity while maintaining precise control over the position and size of multiple objects.

与主流的基于GAN的方法相比,我们提出了一种用于布局到图像生成的扩散模型LayoutDiffusion,该模型可以生成高质量和多样性的图像,同时保持对多个对象的位置和大小的精确控制。

We propose to treat each patch of the image as a special object and accomplish the difficult multimodal fusion of layout and image in a unified form. LFM and OaCA are then proposed to fuse the multi-resolution image patches with user’s input layout.

我们提出将图像的每个块作为一个特殊的对象,以统一的形式完成布局和图像的困难多模态融合。然后提出LFM和OaCA,将多分辨率图像块与用户输入布局进行融合。

LayoutDiffuison outperforms the SOTA layout-toimage generation method on FID, DS, CAS by relatively around 46.35%, 9.61%, 26.70% on COCO-stuff and 44.29%, 11.30%, 41.82% on VG.

LayoutDiffuison在FID、DS、CAS上比SOTA布局-图像生成方法在COCO - stuff上分别提高了约46.35 %、9.61 %、26.70 %,在VG上分别提高了约44.29 %、11.30 %、41.82 %。

2. Related work

The related works are mainly from layout-to-image generation and diffusion models.

相关工作主要从布局到图像的生成和扩散模型。

Layout-to-Image Generation. Before the layout-to-image generation is formally proposed, the layout is usually used as as a complementary feature [17, 34, 49] or an intermediate representation in text-to-image [13], scene-to-image generation [16]. The first image generation directly from the layout appears in Layout2Im [56] and is defined as a set of objects annotated with category and bbox. Models that work well with fine-grained semantic maps at the pixel level can also be easily transformed to this setting [14, 30, 52]. Inspired by StyleGAN [18], LostGAN-v1 [46], LostGANv2 [47] used a reconfigurable layout to obtain better control over individual objects. For interactive image synthesis, PLGAN [51] employed panoptic theory [20] by constructing stuff and instance layouts into separate branches and proposed Instance- and Stuff-Aware Normalization to fuse into panoptic layouts. Despite encouraging progress in this field, almost all approaches are limited to the generative adversarial network (GAN) and may suffer from unstable convergence [1] and mode collapse [31]. As a multimodal diffusion model, LDM [35] supports the condition of coarse layout and has shown great potential in layout-guided image generation.

Layout-to-Image生成。 在正式提出布局到图像的生成之前,布局通常被用作补充特征 [17, 34, 49] 或文本到图像 [13] 和场景到图像 [16] 生成的中间表示。第一个直接从布局生成图像的方法出现在 Layout2Im [56] 中,它被定义为一组标注了类别和 bbox 的对象。在像素级精细语义地图上运行良好的模型也可以很容易地转换为这种设置 [14、30、52]。受 StyleGAN [18] 的启发,LostGAN-v1 [46] 和 LostGANv2 [47] 采用了可重新配置的布局,以更好地控制单个对象。在交互式图像合成方面,PLGAN[51]采用了全景理论[20],将事物和实例布局构建成不同的分支,并提出了实例和事物感知归一化(Instance-and Stuff-Aware Normalization),以融合成全景布局。尽管该领域取得了令人鼓舞的进展,但几乎所有方法都局限于生成式对抗网络(GAN),可能会出现收敛不稳定 [1] 和模式崩溃 [31]。作为一种多模态扩散模型,LDM[35]支持粗布局条件,并在布局引导图像生成方面显示出巨大潜力。

Diffusion Model. Diffusion models [3, 11, 29, 35, 39, 41, 42, 53] are being recognized as a promising family of generative models that have proven to be state-of-the-art sample quality for a variety of image generation benchmarks [7,50,54], including class-conditional image generation [8,57], text-to-image generation [28,35,36], and imageto-image translation [19, 26, 37]. Classifier guidance was introduced in ADM-G [8] to allow diffusion models to condition the class label. The gradient of the classifier trained on noised images could be added to the image during the sampling process. Then Ho et al. [12] proposed a classifierfree training and sampling strategy by interpolating between predictions of a diffusion model with and without condition input. For the acceleration of training and sampling speed, LDM proposed to first compress the image into smaller resolution and then apply denoising training in the latent space.

扩散模型。 扩散模型[3, 11, 29, 35, 39, 41, 42, 53]被认为是很有前途的生成模型系列,已被证明在各种图像生成基准[7,50,54]中具有最先进的样本质量,包括类条件图像生成[8,57]、文本到图像生成[28,35,36]和图像到图像翻译[19,26,37]。ADM-G [8]引入了分类器引导功能,允许扩散模型为类别标签提供条件。在采样过程中,可以将在噪声图像上训练的分类器的梯度添加到图像中。随后,Ho 等人[12]提出了一种无分类器训练和采样策略,即在有条件输入和无条件输入的扩散模型预测之间进行插值。为了加快训练和采样速度,LDM 建议首先将图像压缩到更小的分辨率,然后在潜空间进行去噪训练。

3. Method

在这里插入图片描述

Figure 2. The whole pipeline of LayoutDiffusion. The layout that consisted of bounding box b b b and objects categories c c c is transformed into embedding B L , C L , L B_{\mathcal{L}},C_{\mathcal{L}},L BL,CL,L. Then Layout Fusion Module fuses layout embedding L L L to output the fused layout embedding L ′ L^{\prime} L. Finally, Image-Layout Fusion Module including direct addition used for global conditioning and Object-aware Cross Attention (OaCA) used for local conditioning, will fuse the layout related B L , C L , L ′ B_{\mathcal{L}},C_{\mathcal{L}},L^{\prime} BL,CL,L and the image feature I I I at multiple resolutions.

图 2。 LayoutDiffusion的整个流程。由边界框 b b b 和对象类别 c c c 组成的布局会被转换为嵌入值 B L , C L , L B_{\mathcal{L}},C_{\mathcal{L}},L BL,CL,L。然后,布局融合模块将布局embedding L L L 融合,输出融合后的布局embedding L ′ L^{\prime} L。最后,图像-布局融合模块(包括用于全局调节的直接加法和用于局部调节的对象感知交叉注意(OaCA))将在多个分辨率下融合布局相关的 B L , C L , L p r i m e B_{\mathcal{L}},C_{\mathcal{L}},L^{prime} BL,CL,Lprime 和图像特征 I I I

In this section, we propose our LayoutDiffusion, as shown in Fig. 2. The whole framework consists mainly of four parts: (a) layout embedding that preprocesses the layout input, (b) layout fusion module that encourages more interaction between objects of layout, © image-layout fusion module that constructs the structal image patch and objectaware cross attention developed with the specific design for layout and image fusion, (d) the layout-conditional diffusion model with training and accelerated sampling methods.

在本节中,我们提出了我们的LayoutDiffusion,如图2所示。整个框架主要包括4个部分:( a )布局嵌入,对布局输入进行预处理;( b )布局融合模块,鼓励布局对象之间更多的交互;( c )图像-布局融合模块,构建结构图像块和对象感知交叉注意力,并根据布局和图像融合的具体设计进行开发;( d )带有训练和加速采样方法的布局条件扩散模型。

3.1. Layout Embedding

A layout l = { o 1 , o 2 , ⋯   , o n } l=\{o_1,o_2,\cdots,o_n\} l={o1,o2,,on} is a set of n n n objects. Each object o i o_i oi is represented as o i = { b i , c i } o_{i}=\{b_{i},c_{i}\} oi={bi,ci}, where b i = ( x 0 i , y 0 i , x 1 i , y 1 i ) ∈ [ 0 , 1 ] 4 b_i = (x_0^i,y_0^i,x_1^i,y_1^i)\in[0,1]^4 bi=(x0i,y0i,x1i,y1i)[0,1]4 denotes a bounding box (bbox) and c i ∈ [ 0 , C + 1 ] c_i\in[0,\mathcal{C}+1] ci[0,C+1] is its category id.

一个布局 l = { o 1 , o 2 , ⋯   , o n } l=\{o_1,o_2,\cdots,o_n\} l={o1,o2,,on}是n个对象的集合。每个对象 o i o_i oi表示为 o i = { b i , c i } o_{i}=\{b_{i},c_{i}\} oi={bi,ci},其中 b i = ( x 0 i , y 0 i , x 1 i , y 1 i ) ∈ [ 0 , 1 ] 4 b_i = (x_0^i,y_0^i,x_1^i,y_1^i)\in[0,1]^4 bi=(x0i,y0i,x1i,y1i)[0,1]4表示一个边界框( bbox ), c i ∈ [ 0 , C + 1 ] c_i\in[0,\mathcal{C}+1] ci[0,C+1]是它的类别id。

To support the input of a variable length sequence, we need to pad l l l to a fixed length k k k by adding one o l o_l ol in the front and some padding o p o_p op in the end, where o l o_l ol represents the entire layout and o p o_p op represents no object. Specifically, b l = ( 0 , 0 , 1 , 1 ) , c l = 0 b_l=(0,0,1,1),c_l=0 bl=(0,0,1,1),cl=0 denotes a object that covers the whole image and b p = ( 0 , 0 , 0 , 0 ) , c p = C + 1 b_p=(0,0,0,0),c_p=\mathcal{C}+1 bp=(0,0,0,0),cp=C+1 denotes a empty object that has no shape or does not appear in the image.

为了支持可变长度序列的输入,我们需要将 l l l填充到固定长度的 k k k中,方法是在前面添加一个 o l o_l ol,并在后面添加一些填充 o p o_p op,其中 o l o_l ol表示整个布局, o p o_p op表示没有对象。具体来说, b l = ( 0 , 0 , 1 , 1 ) , c l = 0 b_l=(0,0,1,1),c_l=0 bl=(0,0,1,1),cl=0 表示覆盖整个图像的对象, b p = ( 0 , 0 , 0 , 0 ) , c p = C + 1 b_p=(0,0,0,0),c_p=\mathcal{C}+1 bp=(0,0,0,0),cp=C+1 表示没有形状或没有出现在图像中的空对象。

After the padding process, we can get a padded l = { o 1 , o 2 , ⋯   , o k } l=\{o_1,o_2,\cdots,o_k\} l={o1,o2,,ok} consisting of k k k objects, and each object has its specific position, size, and category. Then, the layout l l l is transformed into a layout embedding L = { O 1 , O 2 , ⋯   , O k } ∈ R k × d L L=\{O_1,O_2,\cdots,O_k\}{\in}\mathbb{R}^{k\times d_\mathcal{L}} L={O1,O2,,Ok}Rk×dL by the projection matrix W B ∈ R 4 × d L W_{\mathcal{B}}\in \mathbb{R}^{4\times d_{\mathcal{L}}} WBR4×dL and W C ∈ R 1 × d L W_\mathcal{C}\in\mathbb{R}^{1\times d_\mathcal{L}} WCR1×dL using the following equation:
L = B L + C L (1) L=B_{\mathcal{L}}+C_{\mathcal{L}} \tag{1} L=BL+CL(1)
B L = b W B (2) B_{\mathcal{L}}=bW_{\mathcal{B}} \tag{2} BL=bWB(2)
C L = c W C (3) C_{\mathcal{L}}=cW_C \tag{3} CL=cWC(3)
where B L , C L ∈ R k × d L B_{\mathcal L},C_{\mathcal L}\in\mathbb{R}^{k\times d_{\mathcal L}} BL,CLRk×dL are the bounding box embedding and the category embedding of a layout l l l, respectively. As a result, L L L is defined as the sum of B L B_{\mathcal L} BL and C L C_{\mathcal L} CL to include both the content and positional information of a entire layout, and d L d_{\mathcal L} dL is the dimension of the layout embedding.

经过填充过程,我们可以得到一个由 k k k个对象组成的填充 l = { o 1 , o 2 , ⋯   , o k } l=\{o_1,o_2,\cdots,o_k\} l={o1,o2,,ok},每个对象都有其特定的位置、大小和类别。然后,通过投影矩阵 W B ∈ R 4 × d L W_{\mathcal{B}}\in \mathbb{R}^{4\times d_{\mathcal{L}}} WBR4×dL W C ∈ R 1 × d L W_\mathcal{C}\in\mathbb{R}^{1\times d_\mathcal{L}} WCR1×dL,利用下式将布局 l l l转化为embedding L = { O 1 , O 2 , ⋯   , O k } ∈ R k × d L L=\{O_1,O_2,\cdots,O_k\}{\in}\mathbb{R}^{k\times d_\mathcal{L}} L={O1,O2,,Ok}Rk×dL的布局:
L = B L + C L (1) L=B_{\mathcal{L}}+C_{\mathcal{L}} \tag{1} L=BL+CL(1)
B L = b W B (2) B_{\mathcal{L}}=bW_{\mathcal{B}} \tag{2} BL=bWB(2)
C L = c W C (3) C_{\mathcal{L}}=cW_C \tag{3} CL=cWC(3)
其中, B L , C L ∈ R k × d L B_{\mathcal L},C_{\mathcal L}\in\mathbb{R}^{k\times d_{\mathcal L}} BL,CLRk×dL分别为布局 l l l的边界框embedding和类别embedding。因此, L L L被定义为 B L B_{\mathcal L} BL C L C_{\mathcal L} CL的总和,以包含整个layout的内容和位置信息, d L d_{\mathcal L} dL是layout embedding的维度。

3.2. Layout Fusion Module

Currently, each object in layout has no relationship with other objects. This leads to a low understanding of the whole scene, especially when multiple objects overlap and block each other. Therefore, to encourage more interaction between multiple objects of the layout to better understand the entire layout before inputting the layout embedding, we propose Layout Fusion Module (LFM), a transformer encoder that uses multiple layers of self-attention to fuse the layout embedding and can be denoted as
L ′ = L F M ( L ) (4) L^{\prime}=\mathrm{LFM}(L) \tag{4} L=LFM(L)(4)
where the output is a fused layout embedding L ′ = { O 1 ′ , O 2 ′ , ⋯   , O k ′ } ∈ R k × d L L^{\prime}=\{O_1^{\prime},O_2^{\prime},\cdots,O_k^{\prime}\}\in\mathbb{R}^{k\times d_{\mathcal{L}}} L={O1,O2,,Ok}Rk×dL.

目前,布局中的每个对象都与其他对象没有关系。这就导致对整个场景的了解程度很低,尤其是当多个对象相互重叠和遮挡时。因此,为了鼓励布局中的多个对象之间进行更多交互,以便在输入布局嵌入之前更好地理解整个布局,我们提出了布局融合模块(LFM),这是一种使用多层自注意力机制来融合布局embedding的Transformer编码器,可表示为
L ′ = L F M ( L ) (4) L^{\prime}=\mathrm{LFM}(L) \tag{4} L=LFM(L)(4)
其中,输出是融合布局embedding L ′ = { O 1 ′ , O 2 ′ , ⋯   , O k ′ } ∈ R k × d L L^{\prime}=\{O_1^{\prime},O_2^{\prime},\cdots,O_k^{\prime}\}\in\mathbb{R}^{k\times d_{\mathcal{L}}} L={O1,O2,,Ok}Rk×dL

3.3. Image-Layout Fusion Module

Structural Image Patch. The fusion of image and layout is a difficult multimodal fusion problem, and one of the most important parts lies in the fusion of position and size. However, the image patch is limited to the semantic information of the whole feature and lacks the spatial information. Therefore, we construct a structural image patch by adding the concept of region that contains the information of position and size.

结构化图像patch。 图像与布局的融合是一个困难的多模态融合问题,其中最重要的部分在于位置和尺寸的融合。然而,图像patch仅限于整个特征的语义信息,缺乏空间信息。因此,我们通过添加包含位置和尺寸信息的区域概念来构建结构图像patch。

Specifically, I ∈ R h × w × d I I\in\mathbb{R}^{h\times w\times d_{\mathcal{I}}} IRh×w×dI denotes the feature map of a entire image with height h h h, width w w w, and channel d I d_{\mathcal{I}} dI. We define that I u , v I_{u,v} Iu,v is the u t h u^{th} uth row and v t h v^{th} vth column patch of I I I and its bounding box, or the ablated region information, is defined as b I u , v b_{I_{u,v}} bIu,v by the following equation:
b I u , v = ( u h , v w , u + 1 h , v + 1 w ) (5) b_{\mathcal{I}_{u,v}}=(\frac uh,\frac vw,\frac{u+1}h,\frac{v+1}w) \tag{5} bIu,v=(hu,wv,hu+1,wv+1)(5)
The bounding box sets of a patched image I I I is defined as b I = { b I u , v ∣ u ∈ [ 0 , h ) , v ∈ [ 0 , w ) } b_{\mathcal{I}}=\{b_{\mathcal{I}_{u, v}}|u \in [0,h),v \in [0,w)\} bI={bIu,vu[0,h),v[0,w)}. As a result, the positional information of image patch and layout object is contained in the unified bounding box defined in the same spatial space, leading to better fusion of image and layout.

具体来说, I ∈ R h × w × d I I\in\mathbb{R}^{h\times w\times d_{\mathcal{I}}} IRh×w×dI 表示高度为 h h h、宽度为 w w w、通道为 d I d_{\mathcal{I}} dI的整幅图像的特征图。我们将 I u , v I_{u,v} Iu,v 定义为 I I I u t h u^{th} uth 行和 v t h v^{th} vth 列patch,其边界框或消融区域信息通过下式定义为 b I u , v b_{I_{u,v}} bIu,v
b I u , v = ( u h , v w , u + 1 h , v + 1 w ) (5) b_{\mathcal{I}_{u,v}}=(\frac uh,\frac vw,\frac{u+1}h,\frac{v+1}w) \tag{5} bIu,v=(hu,wv,hu+1,wv+1)(5)
经过分块的图像 I I I 的边界框集定义为 b I = { b I u , v ∣ u ∈ [ 0 , h ) , v ∈ [ 0 , w ) } b_{\mathcal{I}}=\{b_{\mathcal{I}_{u, v}}|u \in [0,h),v \in [0,w)\} bI={bIu,vu[0,h),v[0,w)}。因此,图像patch和布局对象的位置信息都包含在同一空间定义的统一边界框中,从而更好地实现图像和布局的融合。

Positional Embedding in Unified Space. We define the positional embedding of the image and layout as P I P_{\mathcal{I}} PI and P L P_{\mathcal{L}} PL as follows:
B I = b I W B (6) B_{\mathcal{I}}=b_{\mathcal{I}}W_{\mathcal{B}} \tag{6} BI=bIWB(6)
P I = B I W P (7) P_{\mathcal{I}}=B_{\mathcal{I}}W_{\mathcal{P}} \tag{7} PI=BIWP(7)
P L = B L W P (8) P_{\mathcal{L}}=B_{\mathcal{L}}W_{\mathcal{P}} \tag{8} PL=BLWP(8)
where W B ∈ R 4 × d L W_{\mathcal{B}}\in\mathbb{R}^{4\times d_{\mathcal{L}}} WBR4×dL is defined in Eq. 2 and works as a shared projection matrix that transforms the coordinates of bounding box into embedding of d L d_{\mathcal{L}} dL dimension. W P ∈ R d L × d I W_{\mathcal{P}} \in \mathbb{R}^{d_{\mathcal{L}}\times d_{\mathcal{I}}} WPRdL×dI is the projection matrix that transforms the B B B to the positional Embedding P P P.

统一空间中的位置嵌入。 我们将图像和布局的位置embedding定义为
B I = b I W B (6) B_{\mathcal{I}}=b_{\mathcal{I}}W_{\mathcal{B}} \tag{6} BI=bIWB(6)
P I = B I W P (7) P_{\mathcal{I}}=B_{\mathcal{I}}W_{\mathcal{P}} \tag{7} PI=BIWP(7)
P L = B L W P (8) P_{\mathcal{L}}=B_{\mathcal{L}}W_{\mathcal{P}} \tag{8} PL=BLWP(8)
其中 W B ∈ R 4 × d L W_{\mathcal{B}}\in\mathbb{R}^{4\times d_{\mathcal{L}}} WBR4×dL 的定义如公式 2 所示,它是一个共享投影矩阵,用于将边界框的坐标转换为 d L d_{\mathcal{L}} dL 维度的嵌入。 W P ∈ R d L × d I W_{\mathcal{P}} \in \mathbb{R}^{d_{\mathcal{L}}\times d_{\mathcal{I}}} WPRdL×dI 是将 B B B 转换为位置嵌入 P P P 的投影矩阵。

Pointwise Addition for Global Conditioning. With the help of LFM in Eq. 4, O 1 ′ O_{1}^{\prime} O1 can be considered as a global information of the entire layout, and O i ′ ( i ∈ [ 2 , k ] ) O_i^{\prime}(i\in[2,k]) Oi(i[2,k]) is considered as the local information embedding of single object along with the other related objects. One of the easiest ways to condition the layout in the image is to directly add O 1 ′ O_{1}^{\prime} O1, the global information of the layout, to the multiple resolution of image features. Specifically, the condition process can be defined as
I ′ = I + O 1 ′ W (9) I^{^{\prime}}=I+O_1^{\prime}W \tag{9} I=I+O1W(9)
where W ∈ R d L × d I W\in\mathbb{R}^{d_{\mathcal{L}}\times d_{\mathcal{I}}} WRdL×dI is a projection matrix and I ′ I^{\prime} I is the image feature conditioned with global embedding of layout.

全局条件化的逐点加法。 借助式(4)中的LFM,可以将 O 1 ′ O_{1}^{\prime} O1视为整个布局的全局信息,将 O i ′ ( i ∈ [ 2 , k ] ) O_i^{\prime}(i\in[2,k]) Oi(i[2,k])视为单个对象与其他相关对象的局部信息嵌入。在图像中约束布局的最简单方法之一是直接将布局的全局信息 O 1 ′ O_{1}^{\prime} O1添加到图像特征的多重分辨率中。具体来说,条件过程可以定义为
I ′ = I + O 1 ′ W (9) I^{^{\prime}}=I+O_1^{\prime}W \tag{9} I=I+O1W(9)
其中 W ∈ R d L × d I W\in\mathbb{R}^{d_{\mathcal{L}}\times d_{\mathcal{I}}} WRdL×dI 为投影矩阵, I ′ I^{\prime} I为以布局全局embedding为条件的图像特征。

Object-aware Cross Attention for Local Conditioning. Cross attention is successfully applied in [28] to condition text into image feature, where the sequence of the image patch is used as the query and the concatenated sequence of the image patch and text is applied as key and value. The equation of cross-attention is defined as
Attention ( Q , K , V ) = softmax ( Q K T d k ) V (10) \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \tag{10} Attention(Q,K,V)=softmax(dk QKT)V(10)
where Q , K , V Q,K,V Q,K,V represent the embeddings of query, key, and value, respectively. In the following paper, we will use the subscript image and layout to represent the image patch feature and layout feature, respectively.

用于局部条件反射的对象感知交叉注意力。 在[ 28 ]中,交叉注意力被成功地应用于将文本条件化为图像特征,其中图像块的序列被用作查询,图像块和文本的串联序列被用作键和值。交叉注意力的等式定义为
Attention ( Q , K , V ) = softmax ( Q K T d k ) V (10) \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \tag{10} Attention(Q,K,V)=softmax(dk QKT)V(10)
其中, Q , K , V Q,K,V Q,K,V 分别代表查询、键和值的嵌入。在下文中,我们将使用下标 image 和 layout 分别表示图像块特征和布局特征。

In text-to-image generation, each token in the text sequence is a word. The aggregation of these words constitutes the semantics of a sentence. After the transformer encoder, the first token in text sequence is well-semantic information that generalizes the whole text but may not reverse the semantic meaning of each word. However, the loss in information of one token is relatively serious in layout rather than in text. Each token in the layout sequence is a single object with a specific category, size, and position. The loss of information on a layout token will directly lead to a missing or wrong object in the generated image pixel space.

在文本到图像的生成过程中,文本序列中的每个token都是一个词。这些词的集合构成了一个句子的语义。经过Transformer编码器之后,文本序列中的第一个token是概括整个文本的良好语义信息,但不一定能反转每个词的语义。然而,一个token的信息丢失在布局上比在文字上更为严重。布局序列中的每个token都是一个具有特定类别、大小和位置的对象。布局token信息的丢失将直接导致生成的图像像素空间中出现丢失或错误的对象。

Therefore, we take into account the fusion of locations, size, and category of objects and define our object-aware cross-attention (OaCA) as
Q = Ψ 1 ( Q I , P L ) (11) Q=\Psi_1(Q_{\mathcal{I}},P_{\mathcal{L}}) \tag{11} Q=Ψ1(QI,PL)(11)
K = Ψ 1 ( Ψ 2 ( K I , K L ) , Ψ 2 ( P I , P L ) ) (12) K=\Psi_1(\Psi_2(K_\mathcal{I},K_\mathcal{L}),\Psi_2(P_\mathcal{I},P_\mathcal{L})) \tag{12} K=Ψ1(Ψ2(KI,KL),Ψ2(PI,PL))(12)
V = Ψ 2 ( V I , V L ) (13) V=\Psi_2(V_\mathcal{I},V_\mathcal{L}) \tag{13} V=Ψ2(VI,VL)(13)
where the query Q ∈ R h w × 2 d I , K ∈ R ( h w + k ) × 2 d I Q\in\mathbb{R}^{hw\times2d_{\mathcal{I}}}, K\in\mathbb{R}^{(hw+k)\times2d_{\mathcal{I}}} QRhw×2dI,KR(hw+k)×2dI, and V ∈ R ( h w + k ) × d I V \in \mathbb{R}^{(hw+k)\times d_{\mathcal{I}}} VR(hw+k)×dI. Ψ 1 \Psi_1 Ψ1 and Ψ 2 \Psi_2 Ψ2 denote concatenation on the dimension of the channel and length of the sequence, respectively.

因此,我们考虑到对象的位置、大小和类别的融合,定义我们的对象感知交叉注意力( object-aware cross- attention,OaCA )为
Q = Ψ 1 ( Q I , P L ) (11) Q=\Psi_1(Q_{\mathcal{I}},P_{\mathcal{L}}) \tag{11} Q=Ψ1(QI,PL)(11)
K = Ψ 1 ( Ψ 2 ( K I , K L ) , Ψ 2 ( P I , P L ) ) (12) K=\Psi_1(\Psi_2(K_\mathcal{I},K_\mathcal{L}),\Psi_2(P_\mathcal{I},P_\mathcal{L})) \tag{12} K=Ψ1(Ψ2(KI,KL),Ψ2(PI,PL))(12)
V = Ψ 2 ( V I , V L ) (13) V=\Psi_2(V_\mathcal{I},V_\mathcal{L}) \tag{13} V=Ψ2(VI,VL)(13)
其中 query Q ∈ R h w × 2 d I , K ∈ R ( h w + k ) × 2 d I Q\in\mathbb{R}^{hw\times2d_{\mathcal{I}}}, K\in\mathbb{R}^{(hw+k)\times2d_{\mathcal{I}}} QRhw×2dI,KR(hw+k)×2dI, V ∈ R ( h w + k ) × d I V \in \mathbb{R}^{(hw+k)\times d_{\mathcal{I}}} VR(hw+k)×dI. Ψ 1 \Psi_1 Ψ1 Ψ 2 \Psi_2 Ψ2分别表示通道维度和序列长度上的串联。

We first construct the key and value of the layout:
K L , V L = C o n v ( 1 2 ( N o r m ( C L ) + L ′ ) ) (14) K_\mathcal{L},V_\mathcal{L}=\mathrm{Conv}(\frac12(\mathrm{Norm}(C_\mathcal{L})+L^{\prime})) \tag{14} KL,VL=Conv(21(Norm(CL)+L))(14)
where K L , V L ∈ R k × d I K_{\mathcal{L}},V_{\mathcal{L}}\in\mathbb{R}^{k\times d_{\mathcal{I}}} KL,VLRk×dI and Conv is the convolution operation. The embedding of key and value in the layout is related to the category embedding C L C_{\mathcal{L}} CL and the fused layout embedding L ′ L^{\prime} L. C L C_{\mathcal{L}} CL focuses on the category information of layout and L ′ L^{\prime} L concentrates on the comprehensive information of both the object itself and other objects that may have a relationship with it. By averaging between L ′ L^{\prime} L and C L C_{\mathcal{L}} CL, we can obtain both the general information of the object and also emphasize the category information of the object.

我们首先构建布局的键和值:
K L , V L = C o n v ( 1 2 ( N o r m ( C L ) + L ′ ) ) (14) K_\mathcal{L},V_\mathcal{L}=\mathrm{Conv}(\frac12(\mathrm{Norm}(C_\mathcal{L})+L^{\prime})) \tag{14} KL,VL=Conv(21(Norm(CL)+L))(14)
其中, K L , V L ∈ R k × d I K_{\mathcal{L}},V_{\mathcal{L}}\in\mathbb{R}^{k\times d_{\mathcal{I}}} KL,VLRk×dI,Conv 是卷积运算。键和值在布局中的嵌入与类别嵌入 C L C_{\mathcal{L}} CL 和融合布局嵌入 L ′ L^{\prime} L 有关。 C L C_{\mathcal{L}} CL关注的是布局的类别信息,而 L ′ L^{\prime} L关注的是对象本身以及与之可能有关系的其他对象的综合信息。通过平均 L ′ L^{\prime} L C L C_{\mathcal{L}} CL,我们既可以得到对象的一般信息,也可以强调对象的类别信息。

We construct the query, key, and value of the image feature as follows:
Q I , K I , V I = C o n v ( N o r m ( I ) ) (15) Q_\mathcal{I},K_\mathcal{I},V_\mathcal{I}=\mathrm{Conv}(\mathrm{Norm}(I)) \tag{15} QI,KI,VI=Conv(Norm(I))(15)

构建图像特征的查询、键和值如下:
Q I , K I , V I = C o n v ( N o r m ( I ) ) (15) Q_\mathcal{I},K_\mathcal{I},V_\mathcal{I}=\mathrm{Conv}(\mathrm{Norm}(I)) \tag{15} QI,KI,VI=Conv(Norm(I))(15)

3.4. Layout-conditional Diffusion Model

Here, we follow the Gaussian diffusion models improved by [11,41]. Given a data point sampled from a real data distribution x 0 ∼ q ( x 0 ) x_{0}\sim q(x_{0}) x0q(x0), a forward diffusion process is defined by adding small amount of Gaussian noise to the x 0 x_0 x0 in T T T steps:
q ( x t ∣ x t − 1 ) : = N ( x t ; α t x t − 1 , ( 1 − α t ) I ) (16) q(x_t|x_{t-1}):=\mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)\mathbf{I}) \tag{16} q(xtxt1):=N(xt;αt xt1,(1αt)I)(16)
If the total noise added throughout the Markov chain is large enough, the x T x_T xT will be well approximated by N ( 0 , I ) \mathcal{N}(0,\mathbf{I}) N(0,I). If we add noise at each step with a sufficiently small magnitude 1 − α t 1-\alpha_{t} 1αt, the posterior q ( x t − 1 ∣ x t ) q(x_{t-1}|x_t) q(xt1xt) will be well approximated by a diagonal Gaussian. This nice property ensures that we can reverse the above forward process and sample from x T ∼ N ( 0 , I ) x_T\sim\mathcal{N}(0,\mathbf{I}) xTN(0,I), which is a Gaussian noise. However, since the entire dataset is needed, we are unable to easily estimate the posterior. Instead, we have to learn a model p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt) to approximate it:
p θ ( x t − 1 ∣ x t ) : = N ( μ θ ( x t ) , Σ θ ( x t ) ) (17) p_\theta(x_{t-1}|x_t):=\mathcal{N}(\mu_\theta(x_t),\Sigma_\theta(x_t)) \tag{17} pθ(xt1xt):=N(μθ(xt),Σθ(xt))(17)
Instead of using the tractable variational lower bound (VLB) in log ⁡ p θ ( x 0 ) \log p_\theta(x_0) logpθ(x0), Ho et al. [11] proposed to reweight the terms of the VLB to optimize a surrogate objective. Specifically, we first add t steps of Gaussian noise to a clean sample x 0 x_0 x0 to generate a noised sample x t ∼ q ( x t ∣ x 0 ) x_t \sim q(x_t|x_0) xtq(xtx0). Then train a model ϵ θ \epsilon_{\theta} ϵθ to predict the added noise using the following loss:
L : = E t ∼ [ 1 , T ] , x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) [ ∣ ∣ ϵ − ϵ θ ( x t , t ) (18) \mathcal{L}:=E_{t\thicksim[1,T],x_0\thicksim q(x_0),\epsilon\thicksim\mathcal{N}(0,\mathbf{I})}[||\epsilon-\epsilon_\theta(x_t,t) \tag{18} L:=Et[1,T],x0q(x0),ϵN(0,I)[∣∣ϵϵθ(xt,t)(18)
which is a standard mean-squared error loss.

在此,我们采用经 [11,41] 改良的高斯扩散模型。给定一个从真实数据分布 x 0 ∼ q ( x 0 ) x_{0}\sim q(x_{0}) x0q(x0) 中采样的数据点,通过在 x 0 x_0 x0 中以 T T T 步添加少量高斯噪声,定义一个前向扩散过程:
q ( x t ∣ x t − 1 ) : = N ( x t ; α t x t − 1 , ( 1 − α t ) I ) (16) q(x_t|x_{t-1}):=\mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)\mathbf{I}) \tag{16} q(xtxt1):=N(xt;αt xt1,(1αt)I)(16)
如果我们在每一步都添加幅度足够小的噪声 1 − α t 1-\alpha_{t} 1αt,后验 q ( x t − 1 ∣ x t ) q(x_{t-1}|x_t) q(xt1xt) 将被一个对角高斯很好地近似。这个很好的特性确保我们可以反转上述正向过程,从 x T ∼ N ( 0 , I ) x_T\sim\mathcal{N}(0,\mathbf{I}) xTN(0,I)中采样,这是一个高斯噪声。然而,由于需要整个数据集,我们无法轻松估计后验。相反,我们必须学习一个模型 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt1xt) 来逼近它:
p θ ( x t − 1 ∣ x t ) : = N ( μ θ ( x t ) , Σ θ ( x t ) ) (17) p_\theta(x_{t-1}|x_t):=\mathcal{N}(\mu_\theta(x_t),\Sigma_\theta(x_t)) \tag{17} pθ(xt1xt):=N(μθ(xt),Σθ(xt))(17)
Ho 等人[11]没有使用 log ⁡ p θ ( x 0 ) \log p_\theta(x_0) logpθ(x0) 中的可变下界(VLB),而是建议对 VLB 的项进行重新加权,以优化代用目标。具体来说,我们首先在干净样本 x 0 x_0 x0 中添加 t 步高斯噪声,生成一个噪声样本 x t ∼ q ( x t ∣ x 0 ) x_t \sim q(x_t|x_0) xtq(xtx0)。然后训练一个模型 ϵ θ \epsilon_{\theta} ϵθ 来使用以下损失预测添加的噪声:
L : = E t ∼ [ 1 , T ] , x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) [ ∣ ∣ ϵ − ϵ θ ( x t , t ) (18) \mathcal{L}:=E_{t\thicksim[1,T],x_0\thicksim q(x_0),\epsilon\thicksim\mathcal{N}(0,\mathbf{I})}[||\epsilon-\epsilon_\theta(x_t,t) \tag{18} L:=Et[1,T],x0q(x0),ϵN(0,I)[∣∣ϵϵθ(xt,t)(18)
这是一个标准均方误差损失。

To support the layout condition, we apply classifier-free guidance, a technique proposed by Ho et al. [12] for conditional generation that requires no additional training of the classifier. It is accomplished by interpolating between predictions of a diffusion model with and without condition input. For the condition of layout, we first construct a padding layout l ϕ = { o l , o p , ⋯   , o p } l_\phi=\{o_l,o_p,\cdots,o_p\} lϕ={ol,op,,op}. During training, the condition of layout l l l of diffusion model will be replaced with l ϕ l_\phi lϕ with a fixed probability. When sampling, the following equation is used to sample a layout-condional image:
ϵ ^ θ ( x t , t ∣ l ) = ( 1 − s ) ⋅ ϵ θ ( x t , t ∣ l ϕ ) + s ⋅ ϵ θ ( x t , t ∣ l ) (19) \hat{\epsilon}_\theta(x_t,t|l)=(1-s)\cdot\epsilon_\theta(x_t,t|l_\phi)+s\cdot\epsilon_\theta(x_t,t|l) \tag{19} ϵ^θ(xt,tl)=(1s)ϵθ(xt,tlϕ)+sϵθ(xt,tl)(19)
where the scale s s s can be used to increase the gap between ϵ θ ( x t , t ∣ l ϕ ) \epsilon_\theta(x_t,t|l_\phi) ϵθ(xt,tlϕ) and ϵ θ ( x t , t ∣ l ) \epsilon_\theta(x_t,t|l) ϵθ(xt,tl) to enhance the strength of conditional guidance.

为了支持布局条件,我们采用了无分类器引导技术,这是 Ho 等人[12]提出的一种无需额外训练分类器的条件生成技术。它是通过在有条件输入和无条件输入的扩散模型预测之间进行插值来实现的。对于布局条件,我们首先构建一个填充布局 l ϕ = { o l , o p , ⋯   , o p } l_\phi=\{o_l,o_p,\cdots,o_p\} lϕ={ol,op,,op}。在训练过程中,扩散模型的布局条件 l l l 将以固定概率被替换为 l ϕ l_\phi lϕ。在采样时,使用以下公式对布局相关图像进行采样:
ϵ ^ θ ( x t , t ∣ l ) = ( 1 − s ) ⋅ ϵ θ ( x t , t ∣ l ϕ ) + s ⋅ ϵ θ ( x t , t ∣ l ) (19) \hat{\epsilon}_\theta(x_t,t|l)=(1-s)\cdot\epsilon_\theta(x_t,t|l_\phi)+s\cdot\epsilon_\theta(x_t,t|l) \tag{19} ϵ^θ(xt,tl)=(1s)ϵθ(xt,tlϕ)+sϵθ(xt,tl)(19)
其中,尺度s可以用来增大 ϵ θ ( x t , t ∣ l ϕ ) \epsilon_\theta(x_t,t|l_\phi) ϵθ(xt,tlϕ) ϵ θ ( x t , t ∣ l ) \epsilon_\theta(x_t,t|l) ϵθ(xt,tl)之间的差距,以增强条件引导的强度。

To further improve the user experience of LayoutDiffuison, we also make several optimizations on the speed of the classifier-free sampling process and could significantly outperform the SOTA models in 25 iterations. Specifically, we adapt DPM-solver [25] for the conditional classifier-free sampling, a fast dedicated high-order solver for diffusion ODEs [42] with the convergence order guarantee, to accelerate the conditional sampling speed.

为了进一步改善 LayoutDiffuison 的用户体验,我们还对无分类器采样过程的速度进行了多项优化,并在 25 次迭代中明显优于 SOTA 模型。具体来说,我们将 DPM 求解器 [25] 用于条件无分类器采样,这是一种具有收敛阶次保证的扩散 ODE 快速专用高阶求解器 [42],可加快条件采样速度。

4. Experiments

In this section, we evaluate our LayoutDiffusion on different benchmarks in terms of various metrics. First, we introduce the datasets and evaluation metrics. Second, we show the qualitative and quantitative results compared with other strategies. Finally, some ablation studies and analysis are also mentioned. More details can be found in Appendix, including model architecture, training hyperparameters, reproduction results, more experimental results and visualizations.

在本节中,我们将根据各种指标对不同基准上的 LayoutDiffusion 进行评估。首先,我们介绍数据集和评估指标。其次,我们展示了与其他策略相比的定性和定量结果。最后,我们还提到了一些消融研究和分析。更多详情可参见附录,包括模型架构、训练超参数、重现结果、更多实验结果和可视化效果。

4.1. Datasets

We conduct our experiments on two popular datasets, COCO-Stuff [5] and Visual Genome [21].

我们在两个流行的数据集 COCO-Stuff [5] 和 Visual Genome [21] 上进行了实验。

COCO-Stuff has 164K images from COCO 2017, of which the images contain bounding boxes and pixel-level segmentation masks for 80 categories of thing and 91 categories of stuff, respectively. Following the settings of LostGANv2 [47], we use the COCO 2017 Stuff Segmentation Challenge subset that contains 40K / 5k / 5k images for train / val / test-dev set, respectively. We use images in the train and val set with 3 to 8 objects that cover more than 2% of the image and not belong to crowd. Finally, there are 25,210 train and 3,097 val images.

COCO-Stuff 有来自 COCO 2017 的 164K 幅图像,其中分别包含 80 类事物和 91 类目标的边界框和像素级分割掩码。按照 LostGANv2 [47] 的设置,我们使用 COCO 2017 Stuff Segmentation Challenge 子集,其中包含 40K / 5k / 5k 图像,分别用于 train / val / test-dev 集。我们在训练集和测试集中使用了 3 至 8 个对象的图像,这些对象占图像的 2% 以上,且不属于人群。最后,共有 25,210 张训练图像和 3,097 张验证图像。

Visual Genome collects 108,077 images with dense annotations of objects, attributes, and relationships. Following the setting of SG2Im [16], we divide the data into 80%, 10%, 10% for the train, val, test set, respectively. We select the object and relationship categories occurring at least 2000 and 500 times in the train set, respectively, and select the images with 3 to 30 bounding boxes and ignoring all small objects. Finally, the training / validation / test set will have 62565 / 5062 / 5096 images, respectively.

Visual Genome 收集了 108,077 幅图像,这些图像上有密集的对象、属性和关系注释。按照 SG2Im [16] 的设置,我们将数据分为 80%、10%、10%,分别作为训练集、评估集和测试集。我们选择在训练集中至少出现 2000 次的对象和 500 次的关系类别,并选择具有 3 至 30 个边界框的图像,忽略所有小对象。最后,训练集/验证集/测试集将分别有 62565 张/5062 张/5096 张图像。

4.2. Evaluation Metrics & Protocols

We use five metrics to evaluate the quality, diversity, and controllability of generation.

我们使用五个指标来评估生成的质量、多样性和可控性。

Fr‘echet Inception Distance (FID) [10] shows the overall visual quality of the generated image by measuring the difference in the distribution of features between the real images and the generated images on an ImageNet-pretrained Inception-V3 [48] network.

Fr‘echet Inception Distance (FID) [10]通过测量真实图像和生成图像在ImageNet预训练的Inception-V3 [48]网络上特征分布的差异来表示生成图像的整体视觉质量。

Inception Score (IS) [38] uses an Inception-V3 [48] pretrained on ImageNet network to compute the statistical score of the output of the generated images.

Inception Score(IS) [38] 使用在 ImageNet 网络上预训练的 Inception-V3 [48],计算生成图像输出的统计分数。

Diversity Score (DS) calculates the diversity between two generated images of the same layout by comparing the LPIPS [55] metric in a DNN feature space between them.

多样性得分(Diversity Score,DS) 通过比较 DNN 特征空间中的 LPIPS [55] 指标,计算出相同布局的两个生成图像之间的多样性。

Classification Score (CAS) [32] first crops the ground truth box area of images and resizing them at a resolution of 32×32 with their class. A ResNet-101 [9] classifier is trained with generated images and tested on real images.

分类得分(CAS)[32] 首先裁剪图像的ground truth区域,并根据图像的类别以 32×32 的分辨率调整其大小。使用生成的图像训练 ResNet-101 [9] 分类器,并在真实图像上进行测试。

YOLOScore [23] evaluates 80 thing categories bbox mAP on generated images using a pretrained YOLOv4 [4] model, and shows the precision of control in one generated model.

YOLOScore [23] 使用预训练的 YOLOv4 [4] 模型对生成图像上的 80 个事物类别 bbox mAP 进行了评估,并显示了一个生成模型的控制精度。

In summary, FID and IS show the generation quality, DS shows the diversity, CAS and YOLOScore represent the controllability. We follow the architecture of ADM [8], which is mainly a UNet. All experiments are conducted on 32 NVIDIA 3090s with mixed precision training [27]. We set batch size 24, learning rate 1e-5. We adopt the fixed linear variance schedule. More details can be found in the Appendix.

总之,FID 和 IS 表示生成质量,DS 表示多样性,CAS 和 YOLOScore 表示可控性。我们沿用了 ADM [8] 的架构,主要是一个 UNet。所有实验均在 32 台英伟达 3090 电脑上进行,并采用混合精度训练[27]。我们设定批量大小为 24,学习率为 1e-5。我们采用固定线性方差计划。更多详情请参见附录。

4.3. Qualitative results

在这里插入图片描述

Figure 3. Visualization of comparision with SOTA methods on COCO-stuff 256×256. LayoutDiffusion has better generation quality and stronger controllability compared to the other methods.

图3。在COCO-stuff 256 × 256数据集上与SOTA方法的可视化比较。LayoutDiffusion与其他方法相比,具有更好的生成质量和更强的可控性。

在这里插入图片描述

Figure 4. The diversity of LayoutDiffusion. Each row of images are from the same layout and have great difference.

图 4。LayoutDiffusion的多样性。每行图片来自同一布局,但差异很大。

在这里插入图片描述

Figure 5. The interactivity of LayoutDiffusion. We add extra layout continuously, and the new objects are also with high quality.

图 5。LayoutDiffusion 的交互性。我们不断添加额外的布局,新对象的质量也很高。

Comparison of generated 256 × 256 images on the COCO-Stuff [5] with our method and previous works [2, 47, 51] is shown in Fig. 3.

图 3 显示了 COCO-Stuff [5]上生成的 256 × 256 图像与我们的方法和以前的工作 [2, 47, 51]的比较。

LayoutDiffusion generates more accurate high quality images, which has more recognizable and accurate objects corresponding to their layouts. Grid2Im [2], LostGANv2 [47] and PLGAN [51] generate images with distorted and unreal objects.

LayoutDiffusion 能生成更准确的高质量图像,图像中的物体与其布局相对应,具有更高的可识别性和准确性。Grid2Im [2]、LostGANv2 [47] 和 PLGAN [51] 生成的图像具有扭曲和不真实的物体。

Especially when input a set of multiple objects with complex relationships, previous work can hardly generate recognizable objects in the position corresponding to layouts. For example, in Fig. 3 (a), ©, and (e), the main objects (e.g. train, zebra, bus) in images are poorly generated in previous work, while our LayoutDiffusion generates well. In Fig. 3 (b), only our LayoutDiffusion generates the laptop in the right place. The images generated by our LayoutDiffusion are more sensorially similar to the real ones.

特别是当输入一组具有复杂关系的多个物体时,以往的工作很难在布局对应的位置生成可识别的物体。例如,在图 3(a)、©和(e)中,图像中的主要物体(如火车、斑马、公交车)在以前的工作中生成得很差,而我们的 LayoutDiffusion 生成得很好。在图 3 (b)中,只有我们的 LayoutDiffusion 能在正确的位置生成笔记本电脑。我们的 LayoutDiffusion 生成的图像在传感器上与真实图像更加相似。

We show the diversity of LayoutDiffusion in Fig. 4. Images from the same layouts have high quality and diversity (different lighting, textures, colors, and details).

我们在图 4 中展示了 LayoutDiffusion 的多样性。来自相同布局的图像具有很高的质量和多样性(不同的光照、纹理、颜色和细节)。

We continuously add an additional layout from the initial layout, the one in the upper left corner, as shown in Fig. 5. In each step, LayoutDiffusion adds the new object in very precise locations with consistent image quality, showing user-friendly interactivity.

如图 5 所示,我们在初始布局(即左上角的布局)的基础上不断添加新的布局。在每一步中,LayoutDiffusion 都会在非常精确的位置添加新对象,而且图像质量始终如一,显示出用户友好的交互性。

4.4. Quantitative results

在这里插入图片描述

Table 1. Quantitative results on COCO-stuff [5] and VG [21]. The proposed diffusion method has made great progress in all evaluation metrics, showing better quality, controllability, diversity, and accuracy than previous works. For COCO-stuff, we evaluate on 3097 layout and sample 5 images for each layout. For VG, we evaluate on 5096 layout and sample 1 image for each layout. We also report reproduction scores of previous works in Appendix.

表 1。COCO-stuff [5] 和 VG [21] 的定量结果。所提出的扩散方法在所有评价指标上都取得了长足进步,在质量、可控性、多样性和准确性上都优于之前的研究成果。对于 COCO-stuff,我们在 3097 个布局上进行了评估,每个布局抽样 5 幅图像。对于 VG,我们对 5096 个布局进行了评估,每个布局取样 1 幅图像。我们还在附录中报告了以前作品的再现分数。

在这里插入图片描述

Table 2. Ablation study of Layout Fusion Module (LFM), Object-aware Cross Attention (OaCA), Cross Attention (CA). We use the model trained for 300,000 iterations on COCO-stuff 128×128. The value in brackets denotes the discrepancy to our proposed method(+LFM+OaCA), where red denotes better and green denotes worse.

表 2。布局融合模块(LFM)、物体感知交叉注意(OaCA)、交叉注意(CA)的消融研究。我们使用在 COCO-stuff 128×128 上经过 300,000 次迭代训练的模型。括号中的值表示与我们提出的方法(+LFM+OaCA)的差异,红色表示更好,绿色表示更差。

在这里插入图片描述

Table 3. Comparison with SOTA diffusion-based methods LDM on COCO-stuff 256×256. We generate the same 2048 images of LDM for a fair comparision.

表 3。在 COCO-stuff 256×256 上与基于 SOTA 扩散方法 LDM 的比较。为了进行公平比较,我们生成了与 LDM 相同的 2048 幅图像。

Tab. 1 provides the comparison among previous works and our method in FID, IS, DS, CAS and YOLOScore. Compared to the SOTA method, the proposed method achieves the best performance in comparison.

表 1 比较了前人和我们的方法在 FID、IS、DS、CAS 和 YOLOScore 方面的表现。与 SOTA 方法相比,我们提出的方法取得了最佳性能。

In overall generation quality, our LayoutDiffusion outperforms the SOTA model by 46.35% and 29.29% at most in FID and IS, respectively. While maintaining high overall image quality, we also show precise and accurate controllability, LayoutDiffusion outperforms the SOTA model by 122.22% and 41.82% at most on YOLOScore and CAS, respectively. As for diversity, our LayoutDiffusion still achieves 11.30% imporvement at most accroding to the DS. Experiments on these metrics show that our methods can successfully generate the higher-quality images with better location and quantity control.

在整体生成质量方面,我们的 LayoutDiffusion 在 FID 和 IS 方面分别比 SOTA 模型高出 46.35% 和 29.29%。在保持高整体图像质量的同时,我们还展示了精确和准确的可控性,LayoutDiffusion 在 YOLOScore 和 CAS 上分别比 SOTA 模型高出 122.22% 和 41.82%。在多样性方面,与 DS 相比,我们的 LayoutDiffusion 仍然最多提高了 11.30%。对这些指标的实验表明,我们的方法可以成功生成质量更高的图像,并具有更好的位置和数量控制。

In particular, we conduct experiments compared to LDM [35] in Tab. 3. “Ours-small” uses comparable GPU resources to have better FID performance with much fewer parameters and better throughout compared to LDM-8 when “Ours-small” outperforms LDM-4 in all respects. The results of “Ours” indicate that LayoutDiffusion can have better FID performance, 31.6, at a higher cost. From these results, LayoutDiffusion always achieves better performance at different cost levels compared with LDM [35].

特别地,我们在表 3 中进行了与 LDM [35] 相比的实验。3. 当 "Ours-small "在各方面都优于 LDM-4 时,"Ours-small "使用了与 LDM-8 相当的 GPU 资源,以更少的参数获得了更好的 FID 性能,并且自始至终都优于 LDM-8。我们的 "结果表明,LayoutDiffusion 可以以更高的成本获得更好的 FID 性能(31.6)。从这些结果来看,与 LDM 相比,LayoutDiffusion 总是能在不同的成本水平上实现更好的性能 [35]。

4.5. Ablation studies

We validate the effectiveness of LFM and OaCA in Tab. 2, using the evaluation metrics in Sec. 4.2. The significant improvement on FID, IS, CAS, and YOLOScore proves that the application of LFM and OaCA allows for higher generation quality and diversity, along with more controllability. Furthermore, when applying both, considerable performance, 13.37 / 6.58 / 39.77 / 27.00 on FID / IS / CAS / YOLOScore, is gained.

我们在表 2 中使用第 4.2 节中的评估指标验证了 LFM 和 OaCA 的有效性。2 中使用第 4.2 节中的评估指标验证了 LFM 和 OaCA 的有效性。FID、IS、CAS 和 YOLOScore 的显著改善证明,应用 LFM 和 OaCA 可以获得更高的生成质量和多样性,以及更强的可控性。此外,当同时应用这两种方法时,在 FID / IS / CAS / YOLOScore 上分别获得了 13.37 / 6.58 / 39.77 / 27.00 的可观性能。

An interesting phenomenon is that the change of the Diversity Score (DS) is in the opposite direction of other metrics. This is because DS, which stands for diversity, is physically the opposite of the controllability represented by other metrics such as CAS and YOLOScore. The precise control offered on generated image leads to more constraints on diversity. As a result, the Diversity Score (DS) has a slight drop compared to the baseline.

一个有趣的现象是,多样性得分(DS)的变化方向与其他指标相反。这是因为代表多样性的 DS 与 CAS 和 YOLOScore 等其他指标所代表的可控性正好相反。对生成图像的精确控制导致对多样性的更多限制。因此,与基线相比,多样性得分(DS)略有下降。

5. Limitations & Societal Impacts

Limitations. Despite the significant improvements in various metrics, it is still difficult to generate a realistic image with no distortion and overlap, especially for a complex multi-object layout. Moreover, the model is trained from scratch in the specific dataset that requires detection labels. How to combine text-guided diffusion models and inherit parameters pre-trained on massive text-image datasets remains a future research.

局限性。 尽管在各种指标上都有很大改进,但要生成没有失真和重叠的真实图像仍然很困难,尤其是对于复杂的多目标布局。此外,模型是在需要检测标签的特定数据集上从头开始训练的。如何结合文本引导的扩散模型,并继承在海量文本图像数据集上预先训练的参数,仍是未来研究的重点。

Societal Impacts. Trained on the real-world datasets such as COCO [5] and VG [21], LayoutDiffusion has the powerful ability to learn the distribution of data and we should pay attention to some potential copyright infringement issues.

社会影响。 在COCO [ 5 ]、VG [ 21 ]等真实数据集上进行训练,LayoutDiffusion具有强大的学习数据分布的能力,我们应该关注一些潜在的版权侵权问题。

6. Conclusion

In this paper, we have proposed a one-stage end-to-end diffusion model named LayoutDiffuison, which is novel for the task of layout-to-image generation. With the guidance of layout, the diffusion model allows more control over the individual objects while maintaining higher quality than the prevailing GAN-based methods. By constructing a structural image patch with region information, we regrad each patch as a special object and accomplish the difficult multimodal image-layout fusion in a unified form. Specifically, Layout Fusion Module and Object-aware Cross Attention are proposed to model the relationship among multiple objects and fuse the patched image feature with layout at multiple resolutions, respectively. Experiments in challenging COCO-stuff and Visual Genome (VG) show that our proposed method significantly outperforms both stateof-the-art GAN-based and diffusion-based methods in various evaluation metrics.

在本文中,我们提出了一种新的单阶段端到端扩散模型,称为LayoutDiffuison。在布局的指导下,扩散模型允许对单个对象进行更多的控制,同时保持比当前基于gan的方法更高的质量。通过构造带有区域信息的结构图像小块,将每个小块视为一个特殊的对象,以统一的形式完成了复杂的多模态图像布局融合。其中,布局融合模块(Layout Fusion Module)和目标感知交叉注意模块(Object-aware Cross - Attention)分别用于多目标间关系建模和多分辨率布局与补丁图像特征融合。在具有挑战性的COCO-stuff和视觉基因组(VG)中进行的实验表明,我们提出的方法在各种评估指标上都明显优于最先进的基于gan和基于扩散的方法。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值