生成网络论文阅读styleGAN1（二）：A Style-Based Generator Architecture for Generative Adversarial Networks

CUHK-SZ-relu

已于 2022-09-21 16:21:32 修改

阅读量741

点赞数

分类专栏：生成网络文章标签：论文阅读计算机视觉人工智能

于 2022-09-21 16:20:34 首次发布

本文链接：https://blog.csdn.net/qq_43210957/article/details/126948896

版权

生成网络专栏收录该内容

9 篇文章 2 订阅

订阅专栏

如果您想在阅读之前先快速了解论文全貌，请参考styleGAN1简要介绍

0.Abstract

0.1 逐句翻译

We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature.
我们借鉴风格转移文献，提出了生成对抗网络的替代生成器架构。（就是从其他的工作当中借鉴了模式的概念）

The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis.
新的架构可以自动学习、无监督地分离高级属性(例如，在人脸上训练时的姿势和身份)和生成图像中的随机变化(例如，雀斑、头发)，并且它能够对合成进行直观的、特定尺度的控制。

The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation.
新的生成器在传统分布质量度量方面改进了最先进的技术，得到了明显更好的插值特性，也更好地分离了潜在的变化因素。
（这里的插值特性，就是指的是随着输入的逐渐变化，输出的图片也会逐渐的一点点的变化。）

To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture.
为了量化插值质量和解纠缠，我们提出了两种新的自动化方法，适用于任何生成网络架构。
（大约就是提出了两种新的评价标准）

Finally, we introduce a new, highly varied and high-quality dataset of human faces.
最后，我们介绍了一个新的、高度多样化和高质量的人脸数据集。

0.2总结

大约文章做了这几个事情：

1.实现了一个可以实现风格分离的网络结构（其实分离的能力也有限）
2.插值的效果好
3.提出了两种评价标准
4.提供了一个新的数据集。

1. Introduction

1.1 逐句翻译

第一段（介绍当前研究现状和不足）

The resolution and quality of images produced by generative methods— especially generative adversarial networks (GAN) [21]— have seen rapid improvement recently [28,41, 4].
生成方法生成的图像——特别是生成对抗网络(GAN)[21]——的分辨率和质量最近得到了快速提高[28,41,4]。

Yet the generators continue to operate as black boxes, and despite recent efforts [2], the understanding of various aspects of the image synthesis process, e.g., the origin of stochastic features, is still lacking.
然而，生成器继续作为黑匣子运行，尽管最近的努力[2]，但对图像合成过程的各个方面的理解，例如，随机特征的起源，仍然缺乏。（细节和风格的来源研究的还不充分，就算研究了，对于这些特征和变化的来源任然没有研究充分）

The properties of the latent space are also poorly understood, and the commonly demonstrated latent space interpolations [12, 48, 34] provide no quantitative way to compare different generators against each other.
人们对潜空间的性质也知之甚少，通常证明的潜空间插值[12,48,34]并没有提供定量的方法来比较不同的生成网络。（大约就是大约就是他们没有给出评价标准）

第二段（简单介绍本文的工作的动机和效果，并描述和其他是不一样的）

Motivated by style transfer literature [26], we re-design the generator architecture in a way that exposes novel ways to control the image synthesis process.
受风格转移文献[26]的激励，我们重新设计了生成器架构，以暴露出控制图像合成过程的新方法。

Our generator starts from a learned constant input and adjusts the “style” of the image at each convolution layer based on the latent code, therefore directly controlling the strength of image features at different scales.
我们的生成器从一个习得的常数输入开始，根据潜码在每个卷积层调整图像的“风格”，从而直接控制不同尺度下图像特征的强度。

Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes (e.g., pose, identity) from stochastic variation (e.g., freckles, hair) in the generated images, and enables intuitive scalespecific mixing and interpolation operations.
与直接注入到网络中的噪声相结合，这种体系结构变化导致自动、无监督地分离生成图像中的高级属性(例如，姿势、身份)与随机变化(例如，雀斑、头发)，并实现直观的比例特定的混合和插值操作。

We do not modify the discriminator or the loss function in any way, and our work is thus orthogonal to the ongoing discussion about GAN loss functions, regularization, and hyper-parameters [23, 41, 4, 37, 40, 33].
我们没有以任何方式修改鉴别器或损失函数，因此我们的工作与正在进行的关于GAN损失函数、正则化和超参数的讨论正交[23,41,4,37,40,33]。（正交的意思就是我这个网络可以和这些列出来的东西是完全不一样的，和他们有区别）

第三段（描述输入的 latent space 不应该和原来的数据集有相同的组合，其实是作者把输入通过一个网络映射到了对应数据集上）

Our generator embeds the input latent code into an intermediate latent space, which has a profound effect on how the factors of variation are represented in the network.
我们的生成器将输入潜码嵌入到中间潜空间中，这对网络中变化因素的表示方式产生了深远的影响。

The input latent space must follow the probability density of the training data, and we argue that this leads to some degree of unavoidable entanglement.
输入潜空间必须遵循训练数据的概率密度，我们认为这导致了某种程度的不可避免的纠缠。

Our intermediate latent space is free from that restriction and is therefore allowed to be disentangled.
我们的中间潜空间不受这种限制，因此可以被解耦。

As previous methods for estimating the degree of latent space disentanglement are not directly applicable in our case, we propose two new automated metrics— perceptual path length and linear separability— for quantifying these aspects of the generator.
由于先前估计潜在空间解纠缠程度的方法并不直接适用于我们的情况，我们提出了两个新的自动化度量—感知路径长度和线性可分离性—用于量化生成器的这些方面。

Using these metrics, we show that compared to a traditional generator architecture, our generator admits a more linear, less entangled representation of different factors of variation.
使用这些指标，我们表明，与传统的生成器体系结构相比，我们的生成器允许对不同的变化因素进行更线性、更少纠缠的表示。

第四段（讲提供的数据集）

Finally, we present a new dataset of human faces (Flickr-Faces-HQ, FFHQ) that offers much higher quality and covers considerably wider variation than existing high-resolution datasets (Appendix A).
最后，我们提出了一个新的人脸数据集(Flickr-Faces-HQ, FFHQ)，该数据集比现有的高分辨率数据集(附录a)提供了更高的质量，涵盖了更广泛的变化。

We have made this dataset publicly available, along with our source code and pretrained networks.作者给了个链接
我们已经公开了这个数据集，以及我们的源代码和预先训练的网络

The accompanying video can be found under the same link.
随附的视频可以在相同的链接下找到。

1.2总结

1.实现了一种基于style的生成，可以自动分离各种
2.提出了 latent space的输入需要调整
3.新提出了一个数据集

2. Style-based generator

2.0简述部分

2.0.1 逐句翻译

在这里插入图片描述

第一段（讲怎么进行风格迁移，几乎是这一段把整个文章的网络全讲完了）

Traditionally the latent code is provided to the generator through an input layer, i.e., the first layer of a feedforward network (Figure 1a).
传统上，潜伏代码通过输入层(即前馈网络的第一层)提供给生成器(图1a)。
（这个的传统上其实就是指的是PGGAN）

We depart from this design by omitting the input layer altogether and starting from a learned constant instead (Figure 1b, right).
与此设计不同的是，我们完全忽略了输入层，而是从学习到的常量开始(图1b，右)。
（他这里比较特别的最初始的输入是一个学习到的常量）

Given a latent code z in the input latent space Z, a non-linear mapping network f : Z → W first produces w ∈ W (Figure 1b, left). For simplicity, we set the dimensionality of both spaces to 512, and the mapping f is implemented using an 8-layer MLP, a decision we will analyze in Section 4.1.
给定输入潜空间z中的潜码z，非线性映射网络f: z→W首先产生W∈W(图1b，左)。为了简单起见，我们将两个空间的维度设置为512，并且使用8层MLP实现映射f，我们将在4.1节中分析这个决定。
（这里的Z实际上是一个高斯分布的玩意，也就是正态分布的玩意。映射之后就不是正态分布了，更加适应于数据集）

Learned affine transformations then specialize w to styles y = (ys, yb) that control adaptive instance normalization (AdaIN) [26, 16, 20, 15] operations after each convolution layer of the synthesis network g. The AdaIN operation is defined as
学习仿射变换然后专一w到样式y = (ys, yb)，在合成网络g的每个卷积层之后控制自适应实例归一化(AdaIN)[26,16,20,15]操作。AdaIN操作定义为
在这里插入图片描述
（这个东西是前人已经确定了这种改变分布可以修改输出的style）

where each feature map xi is normalized separately, and then scaled and biased using the corresponding scalar components from style y.
其中每个特征映射xi分别归一化，然后使用样式y中对应的标量组件进行缩放和偏置

Thus the dimensionality of y is twice the number of feature maps on that layer.
因此，y的维度是该层上特征映射数量的两倍。（这里理解一下子为什么是两倍，这里的乘算是针对每个通道进行一次的，一个通道需要一个ysi和一个ybi因此最终就是通道的两倍了）

第二段（主要讲怎么调整图像的风格）

Combined with noise injected directly into the network, this architectural change leads to automatic, unsupervised separation of high-level attributes
结合直接注入到网络中的噪声，这种体系结构变化导致了高级属性的自动、无监督分离。

第三段（增加噪声可以增加细节）

Finally, we provide our generator with a direct means to generate stochastic detail by introducing explicit noise inputs.
最后，我们通过引入显式噪声输入，为生成器提供了生成随机细节的直接方法。

These are single-channel images consisting of uncorrelated Gaussian noise, and we feed a dedicated noise image to each layer of the synthesis network. 这些是由不相关的高斯噪声组成的单通道图像，我们为合成网络的每一层提供专用的噪声图像。

The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution, as illustrated in Figure 1b.
使用学习到的每个特征的比例因子将噪声图像广播到所有的特征映射，然后将噪声图像添加到相应卷积的输出中，如图1b所示。
（这里添加的方法主要是每一个通道都有不同的噪声）

The implications of adding the noise inputs are discussed in Sections 3.2 and 3.3.
增加噪音输入的影响将在第3.2和3.3节讨论。

2.0.2总结

分三段讲了三件事：
1.怎么加入 latent code 来影响输出结果？怎么使用这些逐渐影响风格？
主要是从原来的高斯分布映射到当前分布的这个操作和前人不同，并且通过前人研究的改变分布来影响style
2.怎么使得这些风格分层？
想要区分出来这些风格，让这些风格逐层变化，发现逐层变化之后，自动就分离了
3.怎么实现细节？
加入噪声可以增加细节

2.1. Quality of generated images

第一段（什么都说了的感觉，包含：训练效果好、训练细节）

Before studying the properties of our generator, we demonstrate experimentally that the redesign does not compromise image quality but, in fact, improves it considerably.
在研究我们的生成器的特性之前，我们通过实验演示了重新设计不仅不会影响图像质量，反而实际上，大大提高了生成图片的质量。（前面都说怎么设计的，这里开始描述取得了很好的效果）

Table 1 gives Fréchet inception distances (FID) [24] for various generator architectures in CelebA-HQ [28] and our new FFHQ dataset (Appendix A).
表1给出了CelebA-HQ[28] 和我们新的FFHQ数据集中各种生成器架构的Fréchet初始距离(FID)[24](附录A)。
(就是上面的表1给出了上面所述的各种变化对生成质量的影响)

Results for other datasets are given in the supplement. Our baseline configuration (a) is the Progressive GAN setup of Karras et al. [28], from which we inherit the networks and all hyperparameters except where stated otherwise. We first switch to an improved baseline (b) by using bilinear up/downsampling operations [58], longer training, and tuned hyperparameters.
其他数据集的结果在附录中给出。我们的基线配置(a)是Karras等人[28]的渐进式GAN设置，我们从它继承网络和所有超参数，除非另有说明。我们首先通过使用双线性上/下采样操作[58]、更长的训练和调优超参数切换到改进的基线(b)。

A detailed description of training setups and hyperparameters is included in the supplement.
关于训练设置和超参数的详细描述包含在附录中。

We then improve this new baseline further by adding the mapping network and AdaIN operations ©, and make a surprising observation that the network no longer benefits from feeding the latent code into the first convolution layer.
然后，我们通过添加映射网络和AdaIN操作©进一步改进这个新的基线，并得到一个令人惊讶的观察:网络不再从将潜在代码输入到第一卷积层中获益。

We therefore simplify the architecture by removing the traditional input layer and starting the image synthesis from a learned 4 × 4 × 512 constant tensor (d).
因此，我们通过删除传统的输入层并从学习到的4 × 4 × 512常数张量(d)开始图像合成来简化架构。

We find it quite remarkable that the synthesis network is able to produce meaningful results even though it receives input only through the styles that control the AdaIN operations.
我们发现，合成网络能够产生有意义的结果，即使它只通过控制AdaIN操作的样式接收输入。