MyDLNote - GAN : 2020 CVPR 通过注入结构噪声实现解耦（Disentangled）图像生成

最新推荐文章于 2024-01-03 18:00:08 发布

Phoenixtree_DongZhao

最新推荐文章于 2024-01-03 18:00:08 发布

阅读量1.9k

点赞数 3

分类专栏： MyDLNote-Network deep learning 文章标签：深度学习

本文链接：https://blog.csdn.net/u014546828/article/details/108152806

版权

deep learning 同时被 2 个专栏收录

112 篇文章 8 订阅

订阅专栏

MyDLNote-Network

34 篇文章 2 订阅

订阅专栏

Disentangled Image Generation Through Structured Noise Injection

Yazeed Alharbi King Abdullah University for Science and Technology (KAUST) yazeed.alharbi@kaust.edu.sa

Peter Wonka KAUST

本文提出了一个Structured noise injection方法，即引入的噪声编码是空间的，每个位置对应不同特征的解耦。可以说是，空间的多解耦表示！

难点在于，提出的空间解耦，需要使得每位空间单元之间编码是相互独立的。

实现方法：只要简单地将输入张量划分成不同的区域，每个区域提供不同的噪声编码，以及每个编码到相应区域的单独映射，就可以获得空间解耦。

Abstract

We explore different design choices for injecting noise into generative adversarial networks (GANs) with the goal of disentangling the latent space.

Instead of traditional approaches, we propose feeding multiple noise codes through separate fully-connected layers respectively. The aim is restricting the influence of each noise code to specific parts of the generated image. We show that disentanglement in the first layer of the generator network leads to disentanglement in the generated image.

Through a grid-based structure, we achieve several aspects of disentanglement without complicating the network architecture and without requiring labels. We achieve spatial disentanglement, scale-space disentanglement, and disentanglement of the foreground object from the background style allowing fine-grained control over the generated images.

Examples include changing facial expressions in face images, changing beak length in bird images, and changing car dimensions in car images. This empirically leads to better disentanglement scores than state-of-the-art methods on the FFHQ dataset.

文章探索不同的设计选择，以注入噪音到生成的对抗网络(GANs)，目的是解开潜在的空间。

与传统的方法不同，作者提出通过单独的全连通层分别输入多个噪声码。其目的是限制每个噪声码对生成图像的特定部分的影响。作者表明，在生成器器网络的第一层的解耦导致在生成的图像的解耦。

通过基于网格的结构，我们在不增加网络结构复杂性和不需要标签的情况下实现了几个方面的解耦。文章实现了空间分离、尺度空间分离和前景对象从背景样式的分离，允许对生成的图像进行细粒度控制。

案例包括在面部图像中改变面部表情，在鸟类图像中改变喙的长度，以及在汽车图像中改变汽车的尺寸。在FFHQ数据集上，这种经验导致了比最先进的方法更好的分离分数。

Problem

Improving disentanglement is an open area of research as one of the main criticisms of state-of-the-art GANs is the difficulty of controlling generated images. The goal is to change certain attributes of the generated image without changing the other attributes. For example, it would be desirable to be able to add smile to a face image without changing the identity or the background.

改善解耦是一个开放的研究领域，因为最先进的GANs的主要批评之一是控制生成的图像的困难。目标是更改生成的图像的某些属性，而不更改其他属性。例如，我们希望能够在不改变身份或背景的情况下向人脸图像添加微笑。

There are two main approaches to generating an image from the intermediate representation. The first approach maps the code using fully-connected layers to obtain a tensor with spatial dimensions that is up-sampled and convolved to generate an image. The second approach starts with a common input tensor, and uses the input code to modulate the feature maps in a spatially-invariable manner. Both approaches are inherently entangled structures, i.e. every element of the latent code can influence every part in the generated image.

当前的GAN架构将去相关的输入噪声编码映射到定义生成图像的中间表示。

从中间表示生成图像有两种主要方法。

第一种方法使用全连接层映射编码，以获得一个具有空间维度的张量，并对其进行上采样和卷积以生成图像。

第二种方法从一个公共输入张量开始，并使用输入编码以空间不变的方式调制特征图。

这两种方法都是固有的耦合结构，即潜在编码的每个元素都会影响生成图像的每个部分。

Motivation

文章认为，通过更好地构造噪声编码码注入，可以实现一种通用的且细粒度的解耦形式。具体地说，通过设计网络，使输入噪声的每一部分控制生成图像的特定部分。

首先，作者建议将空间不变的概念与空间可变的概念分开。为了实现这一点，文章使用了两个输入编码：一个空间不变的编码和一个空间可变的编码。

空间不变编码用于计算一个AdaIN [8] 参数。它以相同的方式对feature map中的每个像素进行操作，而不考虑位置。

空间变量编码产生输入张量到生成器的上采样和卷积层。它与最终的图像有高度的空间对应。

其次，作者建议使用结构化的空间变量编码来控制生成图像的特定区域。空间变量编码包含特定于每个位置的编码、某些位置之间共享的编码以及所有位置之间共享的编码。

Related Works

Disentanglement in StyleGAN [12] is mainly scale-based. Low-level features can be changed while maintaining high-level features, but it is incredibly difficult to change specific attributes individually.

On the other hand, HoloGAN [17] disentangles pose from identity, but it uses a specific geometry-based architecture that does not apply to other attributes.

Method

文章的核心方法叫做：

Structured noise injection

该方法可以通过以下几个部分展开描述：

Motivation and intuition

There are several motivating observations for our method. The first observation is the evident difficulty of controlling the output image of state-of-the-art GANs. Previous methods map the input noise through a linear layer or several fully-connected layers to produce a tensor with spatial dimensions. We refer to this tensor as the input tensor, as it is typically the first input to the upsampling and convolution blocks of GANs.

对于本文的方法，基于以下几个观察结果。

第一个观察是：

想要控制SOTA的GAN网络输出图像是一件非常困难的事。以往的方法是将输入噪声映射到一个线性层或几个全连通层，从而得到一个具有空间维数的张量。我们称这个张量为输入张量，因为它是GANs的上采样和卷积的第一个输入。

InputTensor = W z + b (1)

W ∈ R (4·4·512)×128

z ∈ R 128×1

b ∈ R 512×1

InputTensor ∈ R 4×4×512

如公式1所示，传统的方法学习一个矩阵W，将整个输入噪声映射到向量，然后将其重塑为具有宽度、高度和通道。

这种设计选择是固有的耦合，因为允许输入噪声的每个入口修改输入张量的所有空间位置。

这一观察结果促使作者探索利用每个空间位置单独的噪音编码，从本质上限制空间变异源之间的交流（核心）。怎么实现呢？

We find that by simply dividing the input tensor into regions, providing a different noise code per region, and a separate mapping from each code to the corresponding region, we are able to obtain spatial disentanglement. We restructure the mapping in our method such that z consists of independently sampled parts that are each mapped using an independent part of W.

只要简单地将输入张量划分成不同的区域，每个区域提供不同的噪声编码，以及每个编码到相应区域的单独映射，就可以获得空间解耦。

在本文的方法中，作者重构了映射，使z由独立采样的部分组成，每个部分使用W的独立部分进行映射。在本例中，W是独立稀疏的，如图3所示（下面公式）。

Figure 3

Figure 2: An illustration of the spatially-variable code mapping in our method. Our noise injection structure utilizes separating mapping parameters per code grid cell. Each cell contains a mixture of unique location-specific codes, codes that are shared with neighbors, and codes that are shared with all cells. We show that disentanglement in the input tensor leads to disentanglement in the generated images.

A key contribution of this paper is showing how the noise injection structure can lead to spatial disentanglement.

本文的一个关键贡献是展示了如何噪声注入结构可以导致空间解耦。

第二个观察：

与风格传递方法有关，进一步指导对噪声注入结构的具体选择。

风格转换问题通常分为两个相互竞争的部分：保存原始内容和添加示例样式。

内容对应于图像的标识 (一种特定的汽车，或具有特定布局的建筑物)，它通常通过从生成的图像特征到输入内容图像特征的每像素距离来影响损失。

风格对应于诸如配色方案和边缘笔画等概念。风格通过生成的特征图统计数据和典型特征图统计数据之间的某种相关性度量来影响损失，通常是通过计算gram矩阵来实现的。

内容丢失更多地依赖于空间值的排列，而不是风格。这是因为内容丢失是基于位置的，通常使用L2距离来衡量，而风格丢失是基于对整个特征地图的汇总统计，通常使用相关性来衡量。

本文的另一个关键贡献是提出了一种噪声注入结构，该结构可以使前景对象 (空间可变的内容) 和背景类型 (特征图范围的风格) 之间的解耦，从而使前景空间区域的变化不会影响背景风格。

Structure design space

One factor is the complexity of the mapping. Some methods [20, 2] use a single linear layer to move from noise to latent space, while other methods [12, 1] employ a fully-connected network with several layers.

Another factor is the way the noise code is used, whether it is used to generate an input tensor to upsample and convolve or used to modulate feature map statistics.

We find that sampling independent noise codes spatially and pushing them through independent mapping layers is sufficient to achieve disentanglement.

作者选择注入两种类型的噪音编码，结合了传统和最近的噪音编码的优点。

风格编码：用于调整feature map统计。
空间变量代码，用传统的方式生成一个具有空间维度的张量来提供给上行采样和卷积模块。该编码包括：输入张量的每像素独立局部编码和共享编码。

空间变量代码构成如图2。包括：8x8 个cell编码；每2x2个cell的空间又共享一个编码（原文叫做2x2编码）；所有celll又共享一个编码。这样，就构成了一个局部和全局的解耦方法。

按照上述方法，可以实现三个解耦：

Foreground-background disentanglement

通过本文的方法获得的第一个在图像的背景风格和前景之间的解耦。

By providing independent spatially-varying values, allowing them to completely define earlier layers, and applying style modulation in later layers, we encourage the spatially-variable code to define the foreground, and the style code to define background and general image appearance.

通过提供独立的空间变化值，允许他们定义浅层，并在深层中应用风格调制。

空间变化代码来定义前景，风格代码来定义背景和一般图像外观。

Local disentanglement

第二个方面的解耦是前景中的局部区域之间的分离。本文的方法能够解开概念耦合的主要原因之一是噪声张量映射。作者把映射的输入张量看作一个网格，其中每个单元格都是通过一个特定位置的映射函数来推动输入噪声的特定位置部分的结果。因此，我们的方法允许更改单个单元格来更改生成图像的相应部分。虽然这些变化是高保真的局域化的，但预计由于产生层的空间重叠，一个单元的变化可以导致周围区域的微小改变。

Local-global disentanglement

第三个解耦方面是局部和全局影响。只使用局部值来确定最终图像有一些缺点。由于图像中的一些概念默认是全局的(比如姿势和性别)，此时网络被迫将全局概念与许多局部值联系起来。例如，改变一个包含嘴或下巴部分的网格cell编码，有时也会改变姿势和性别，这是不可取的。从局部图像中分离出整体图像会更加友好，例如，改变包含嘴部的细胞可能会改变笑容或面部毛发，但不会改变姿势或年龄。

建议保留空间可变噪声代码的某些项为全局的，并在局部单元间共享。所以在噪声到张量映射之前，将全局代码与每个局部代码连接起来。这与FineGAN的方法没有太大的不同，FineGAN的方法将父编码与细节层的代码连接起来。有趣的是，在没有任何监督的情况下，在对FFHQ人脸图像进行训练时，网络学会了将全局入口与姿势联系起来。对不同长度的全局编码码进行了几次实验，结论是：全局代码的表现力越强，局部代码的影响就越小，而全局代码的耦合性就越强。

Phoenixtree_DongZhao

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
MyDLNote - GAN : 2020 CVPR 通过注入结构噪声实现解耦（Disentangled）图像生成

Disentangled Image Generation Through Structured Noise InjectionYazeed Alharbi King Abdullah University for Science and Technology (KAUST) yazeed.alharbi@kaust.edu.saPeter Wonka KAUST本文提出了一个Structured noise injection方法，即引入的噪声编码是空间的，每个位置对...
复制链接

扫一扫

专栏目录