DCGAN论文翻译

最新推荐文章于 2024-09-19 17:25:59 发布

问世间是否此山最高

最新推荐文章于 2024-09-19 17:25:59 发布

阅读量1.7k

点赞数 2

文章标签：计算机视觉

本文链接：https://blog.csdn.net/qq_43668591/article/details/116854155

版权

UNSUPERVISED REPRESENTATION LEARNING WITH DEEP　CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
（使用DCGAN的无监督表示学习）

ABSTRACT（摘要）

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image　representations.

近几年，使用CNN的监督学习在计算机视觉领域得到了广泛的应用。相反，使用CNN的无监督学习并未获得太多关注。在本工作中，我们减少了CNN在监督学习上大获成功与无监督学习上不成功的差距。我们提出了一种DCGAN，对它的网络架构做了约束，并证明了它是无监督学习强有力的候选者。通过在多种数据集上训练，我们展示了有说服力的证据来证明在我们的DCGAN中，无论是生成器还是判别器，都能学到从单个对象到全局场景的特征。另外，我们将学到的特征用于一些有趣的任务–证明了他们作为一般图像表示的适用性。

1 INTRODUCTION（介绍）

Learning reusable feature representations from large unlabeled datasets has been an area of active research. In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks such as image classification. We propose that one way to build　good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques.One can additionally argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attractive to representation learning. GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs.There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.
In this paper, we make the following contributions
• We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
• We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.
• We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.
• We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.

从大量无标签的数据集中学习可重用的特征表示已经是一个活跃的研究领域。在计算机视觉的背景下，我们能从几乎无限数量的无标签图像和音视频中学到不错的中间表示，然后用于各种各样的有监督学习任务中，比如图像分类。我们提出，GAN是一种建立良好的图像表示的方法，之后我们会重组生成器和鉴别器网络作为监督任务的特征提取器。因为除了GAN对于最大化可能的技术提供了一个有吸引力的选择外，我们还认为它的学习过程缺乏一种对表示学习有吸引力的启发式代价函数（比如像素级独立均方误差）。GAN也以训练不稳定出名，常会导致生成器产生荒谬的输出。在尝试理解和可视化GAN所学到的东西以及多层结构的GAN的中间表示上，仅有非常有限的已发表的研究成果。
在本论文中，我们做了以下几个贡献：
我们提出并评估了一种卷积GAN的结构，使得他们在大多数情况下可以稳定的训练，我们把它叫做DCGAN
我们使用训练好的判别器用于图像分类任务，要比其它的无监督算法更具有竞争力。
我们对GAN学到的filters进行了可视化，表明filters学到了如何绘制特定的对象。
针对生成器，我们提出了一个有趣的算法向量，它可以容易的操控生成的样例上的语义质量。

2 RELATED WORK（相关工作）

2.1 REPRESENTATION LEARNING FROM UNLABELED DATA

Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classification scores. In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and　where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible. These methods have also been shown to learn good feature representations from image pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.
2.1从无标签数据中进行表示学习：
无监督表示学习在计算机视觉领域和图像研究中已经得到了充分的利用。无监督表示学习一个经典例子就是聚类，之后利用聚类来提升分类的结果。在图像研究中，人们可以对图像补丁进行层次化的聚类，来学习强大的图像表示。另一个流行的方法是训练自动编码器，一种是分类编码中向量的意义和位置，另外一种是利用编码的梯度结构，这两种方式都能对图像进行紧凑的编码，并且解码器能尽可能精确还原图像。这些方法已经被证明可以从图像像素中来学习较好的特征表示，同样，深度置信网络也能学习到有效的层次表示。

2.2 GENERATING NATURAL IMAGES

Generative image models are well studied and fall into two categories: parametric and nonparametric. The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007). Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry. Another approach generates images using an iterative forward diffusion process　(Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.
2.2 生成自然的图像
生成图像模型已经得到的充分的研究，可分为两类：有参模型和无参模型。无参模型通常从现有图像数据库进行匹配，通常用于匹配图像补丁，纹理合成（Efros等，1999）、超分辨率（弗里曼等，2002）和绘画（Hays&Efros，2007）。有参模型在图像生成领域已被广泛的探索(例如在MNIST数字或纹理合成上（波蒂拉和西蒙塞利，2000)）。然而，生成现实世界的自然图像直到最近也没有取得多大成功。一种生成图像的变分采样方法（金玛和韦林，2013年）取得了一些成功，但样本往往存在模糊的问题。另一种方法使用迭代前向扩散过程生成图像（索尔-迪克斯坦等人，2015）。生成对抗网络（Goodfell等人，2014）从噪音中生成了图像。这种方法的拉普拉斯金字塔扩展(Denton等人，2015)生成了了更高质量的图像，但由于链接多个模型中引入了噪声，生成对象看起来仍是不稳定的。递归网络方法（Gregor等，2015）和反卷积网络方法（Dosovitstki等，2014）最近在生成自然图像方面也取得了一些成功。然而，他们并没有利用生成来执行监督任务。

2.3 VISUALIZING THE INTERNALS OF CNNS

One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and filtering the maximal activations, one can find the approximate purpose of each convolution filter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of filters (Mordvintsev et al.).
2.3可视化cnns的内部结构
使用神经网络的一个持续的质疑是，它们采用黑盒方法，很难理解他们在human-consumable algorithm的形式下做了些什么。关于CNNs的研究，Zeiler等人(Zeeler&Fergus，2014)表明，通过使用反卷积和 filtering the maximal activations，我们可以找到网络中每个卷积核的近似用途。类似地，如果对输入图像使用梯度下降，我们可以看到卷积核某些子集上所激活的理想图像（莫德文tsev等)。

3 APPROACH AND MODEL ARCHITECTURE（方法和模型架构）

Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identified a family of architectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models.
使用CNN来增强GAN去构造图像的历史尝试并未成功。这激励了LAPGAN的作者提出一种可通过不断迭代来升级低分辨率的图像生成方法，使得建模更加可靠。我们在实现有监督学习文献里常见的使用CNN架构来增强GAN的方法时，也遇到了很多困难。不论如何，在扩展了模型的研究之后，我们定义出一组网络架构，该架构可以稳定的训练多种数据集并且能够训练更高分辨率和更深层次的模型。

Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.
我们方法的核心，是对CNN架构最近的三个改进的采用和修正。

The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.
第一，全部使用卷积层替代原有的池化层，允许网络在它自己的空间学习下采样，我们在生成器和判别器中都使用了该方法。

Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classification models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.
第二，消除卷积特征顶部的全连接层已经成为一种趋势。最有力的证明是最先进分类模型中全局平均池化。我们发现，全局平均池化虽然增强了模型的稳定性，但减缓了模型的收敛速度。我们将最高卷积特征直接连接到发生器的输出和鉴别器的输入的中间地带，效果很好。GAN的第一层采用Z维的均匀噪声分布作为输入，仅仅通过一个矩阵乘法，可以看做全连接层，得到4维的张量并用作卷积栈的开始。对于判别器，最后一个卷积层被打平，然后通过一个sigmond函数输出。该模型架构的可视化见图 1.

在这里插入图片描述
图1:DCGAN生成器使用LSUN场景模型。一个小空间的100维均匀分布Z通过卷积生成大量的特征图。然后通过一系列fractional-strided卷积（在最近的一些论文中，这些被错误地称为反卷积）将高层的表示转换成64*64像素的矩阵。显然，没有使用全连接层和池化层。

Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers
however, resulted in sample oscillation and model instability. This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer.
第三，Batch Normalization，这种方法将每一层的输入都归一化为均值为0，方差为1的变量。它帮助处理了训练时的初始化问题也缓解了深层模型中的梯度溢出问题。这对生成器的初始训练至关重要，它能防止所有的样本崩溃到一个单点上，那往往会导致GAN的失败。然而，直接应用Batch Normalization到所有层也不行，同样会导致样本崩溃和模型不稳定。要避免将Batch Normalization用在生成器的输出层和判别器的输入层。
　　
　　The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).
　　第四，生成器的激活函数除输出层使用tanh激活函数外都使用ReLU函数，我们观测到使用有界的激活函数可以更快的饱和和覆盖训练分布的颜色空间。我们发现在判别器中使用leaky激活函数效果更好，特别是训练高分辨率的模型。这与原始的GAN论文不同，它使用了ｍａｘｏｕｔ激活函数。
　　
　　Architecture guidelines for stable Deep Convolutional GANs
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
• Use batchnorm in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.
　　稳定的DCGAN的结构如下：
使用卷积代替所有的池化层，判别器中使用带步长的卷积，生成器中使用fractional-strided 卷积。
在生成器和判别器中都使用batchnorm
去除深度架构中的全连接层
在生成器中除输出层使用Tａｎｈ激活函数外，都使用RｅＬＵ激活函数
在判别器的所有层都使用LeakyReLU激活函数

4 DETAILS OF ADVERSARIAL TRAINING（对抗训练的细节）

We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015), Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below.
　　我们使用三个数据集来训练DCGAN，,Large-scale Scene Understanding (LSUN) (Yu et al., 2015),Imagenet-1k 和一个新的人脸数据集。三个训练数据集的使用细节将在下文给出。

No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.
　　训练图像未经过预处理，只是都缩放到到ｔａnｈ激活函数【－１，１】之内。所有的模型使用 mini-batchSGD训练，一个 mini-batch使１２８张图片。所有的权重用（0，0.02）的正态分布初始化。在 LeakyReLU激活函数中， slope of the leak设置为０．２。先前的GAN模型使用了动量来加速训练，我们使用Adam优化器来优化超参数。我们发现学习率使用0.001太大了，最终改为0.002替代。此外，我们发现动量参数=0.9时会导致训练震荡不稳定，于是将动量参数减小至0.5以稳定训练。

4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality　samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.
4.1LSUN
　　随着生成的图像视觉质量的提高，对训练样本过拟合的关注度开始增加。为了演示我们的模型如何利用更多的数据和更高的分辨率进行图像样本生成，我们在LSUN卧室数据集上训练一个模型，其中包含超过300万个训练例子。最近的分析表明，模型的学习速度与其泛化性能之间存在直接联系（Hardt等人，2015）。我们展示了来自训练一轮的样本（图2)，和收敛后的样本(图3)，证明我们的模型并不是通过简单的过拟合/记忆训练例子来产生高质量的样本。未对图像做任何数据增强操作。
在这里插入图片描述
图2：训练数据集生成卧室后，理论上，模型能够学习并记忆训练的样本，但在实验上并不是，因为我们使用了一个小的学习率与小批量SGB。我们知道之前的实验都不能证明SGD和小的学习率能够使模型记住训练的样本。
在这里插入图片描述
图3：5个epoch之后生成的卧室。通过多个样本的重复噪声纹理(如一些床的基板)，似乎是视觉拟合不足的证据。

4.1.1 DEDUPLICATION
　　To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
4.1.1 重复数据删除
为了进一步降低生成器记住输入的可能性，我们做了一个简单的重复数据删除。我们在训练样本中使用降采样中心裁剪32×32大小的图像，在其上施加了一个3072-128-3072的去噪 dropout+ReLU的自编码器。编码结果层使用ReLU激活阈值进行二值化（这已经被证明是一种有效的信息保存手段），它提供了一个语义hash的简单形式，允许在线性时间内进行去重。这个hash编码可视化结果表明其精确度很高，错误率低于1%。此外，该技术检测到并删除了大约275,000个近似重复项，表明召回率很高。

4.2 FACES

We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are sufficiently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images.
4.2人脸数据集
　　我们随机从网络上抓取了包含人脸的图像。这些人的名字是从dbpedia中获得的，其标准是他们出生在现代。这个数据集有300万张图片，来自1万人。我们在这些图像上运行了一个OpenCV人脸检测器，保持了足够高的分辨率，这给了我们大约35万张人脸。我们使用这些人脸来进行训练。未对图像做任何数据增强操作。

4.3 IMAGENET-1K

We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train on 32 × 32 min-resized center crops. No data augmentation was applied to the images.
４.３IMAGENET-1K数据集
　　我们使用Imagenet-1k（邓等人，2009）作为无监督训练的自然图像的来源。我们训练了中心裁剪后32*32大小的图像。未对图像做任何数据增强操作。

5 EMPIRICAL VALIDATION OF DCGANS CAPABILITIES （验证DCGAN的能力）

5.1 CLASSIFYING CIFAR-10 USING GANS AS A FEATURE EXTRACTOR

One common technique for evaluating the quality of unsupervised representation learning algo- rithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer) compared to K-means based techniques, but does result in a larger total feature vector size due to the many layers of 4 × 4 spatial locations.The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented, exemplar samples from the source dataset. Further improvements could be made by finetuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.
5.1 CIFAR-10 上使用GAN作为特征提取器进行分类
　　评估无监督表示学习算法的一种常用技术是应用它们作为有监督数据集的一个特征提取器，并评价线性模型拟合在这些特征上的的性能。在CIFAR-10数据集上，利用 K-means 作为特征学习算法的一个的单层特征提取pipeline，证明了其非常强大的基础性能。通过使用大量的特征图，这种技术达到了80.6％的准确率。一种无监督基本算法的多层扩展达到了82.0％的准确率。为了评估dcgan用于监督任务表示学习的表现，我们在Image-1K数据集上训练，然后使用判别器所有层的卷积特征，最大池化每个层的表示，产生一个44的特征图。然后这些特征图被打平连接成一个28672维的向量，使用一个正则L2−SVM来做分类器。这达到了82.8％的准确率，优于所有基于k-means的方法。显然，相比于基于k-means的技术，我们的判别器使用了更少的特征图（最高层５１２个），却产生了一个更大的特征向量，这是由于44特征图的数量很多。dcgan的性能仍然低于Exemplar CNN，训练Exemplar CNNs用于无监督任务分别在增强的数据集和源数据集上训练，应该会效果明显。通过微调判别器的表示能进一步提升性能，但是我们把这项工作留到以后。此外，由于我们的dcgan从来没有训练在CIFAR-10上，证明了其学到特征的鲁棒性。
在这里插入图片描述
表1：CIFAR-10使用我们预训练模型的结果，我们的DCGAN只在Imagenet-1k上训练过，没有在CIFAR-10上训练过，特征直接用于CIFAR-10的分类。

5.2 CLASSIFYING SVHN DIGITS USING GANS AS A FEATURE EXTRACTOR

On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a signficantly higher 28.87% validation error.
　　在StreetView House Numbers数据集(SVHN)上(Netzer等人，2011)，当标签数据稀缺时，我们使用DCGAN的鉴别器的特征进行监督。按照与CIFAR-10实验中类似的数据集准备规则，我们从数据集中分离出10,000个示例的验证集，并将其用于所有超参数和模型选择。随机选取1000个均匀类分布训练实例，在CIFAR-10中使用的特征提取pipeline 的基础上训练标准线性L2-SVM分类器。这达到了22.48%的测试误差的最新水平(用于使用1000个标签进行分类)，改进了另一种利用未标记数据的cnn（Zhao et al.， 2015)。此外，我们在相同数据集上通过训练一种相同架构的有监督cnn，随机选取了超过64个超参数的实验，验证了在DCGAN中使用的CNN架构并不是模型性能的关键贡献因素。

6 INVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS（检测和可视化网络的内部）

We investigate the trained generators and discriminators in a variety of ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.
　　我们用不同的方式检测了训练的生成器和判别器。我们没有在测试集上做任何最近邻的搜索。因为像素或特征空间中的最近邻搜索会被小图像的变换所欺骗（Theis等人，2015)。我们也没有使用log可能性对模型进行定性评估，同样这是一种不好的评估。

6.1 WALKING IN THE LATENT SPACE

The first experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.
６.１潜在空间的工作
　　我们做的第一个实验是了解隐空间发生了什么。调整初始向量的调节可以告诉我们图片生成过程中记住了什么（如果有急剧的过渡)，以及空间是如何收敛的。如果图像生成过程中隐空间发生了语义上的改变（比如对象的增加或删除），我们推测模型会学到相关且有趣的表示。该结果如图4所示。

在这里插入图片描述

图4：顶行：Z中9个随机点之间的插值表明，学习的空间有平滑的过渡，空间中的每一个图像看起来都像一间卧室。在第六行，你会看到一个没有窗户的房间慢慢地变成一个有巨大窗户的房间。在第10行，你会看到一个电视正在慢慢变成一个窗户的东西。

6.2 VISUALIZING THE DISCRIMINATOR FEATURES

Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting. Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same figure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.
先前的工作展示了在大规模图像数据集上有监督的训练cnn能得到非常强大的特征。此外，用于训练场景分类任务的有监督cnn上也能学习目标检测器(Oquab等人，2014)。我们证明了在大型图像数据集上训练的无监督DCGAN也可以学习有趣的特征。使用(Springenberg et al.， 2014)提出的反向传播(guided backpropagation)方法，我们在图5中显示出我们的鉴别器学到了在卧室的典型部分特征(如床和窗户)。为了做对比，在同一幅画中，我们给予了未激活的随机初始化特征的一个基线，它具有语义相关性而且有趣。
在这里插入图片描述

图5：在右边，从判别器中最后一个卷积层学到的前6个特征的最大反向传播可视化。注意，相当一部分特征是响应床的——LSUN卧室数据集中的中心对象。在左边是随机的filter基线。对比之前的响应，几乎没有随机结构。

6.3 MANIPULATING THE GENERATOR REPRESENTATION（生成器表示）

6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS
In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.

６.３.１忘记画某些对象
在了解鉴别器学习的表示后，还有一个问题是生成器学到了什么。样本的质量表明生成器学习了场景主要组成部分的某些对象表示，如床，门，和各种各样的家具。为了探索这些对象的表示形式，我们进行了一个从生成器中尝试移出窗户的实验。

On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was fit to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives.
在150个样本中，我们手工绘制了52个窗口边框。对于第二深的卷积层特征，logistic回归拟合预测一个特征激活是否在窗口上(或不是)。使用的标准是，在绘制的边界框内的激活是积极的，从相同的图像随机采样是消极的。

Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
然后，没有被删除的特征映射随机生成了新样本。生成的有和没有窗口的图像如图6所示，有趣的是，网络模型大多忘记在卧室画窗户，而是用其他物体代替了它们。

在这里插入图片描述
图6：顶行：来自模型的未修改样本。底行：去掉窗户之后产生的样本。一些窗户被删除，其它的转换成为相似的物体，如门或镜子。尽管视觉质量降低，整体的场景构成保持一致，表明了生成器将场景表示从对象表示中分离出去做的很好。可以进行扩展实验，从图像中删除其他对象，并修改生成器画出的对象。

6.3.2 VECTOR ARITHMETIC ON FACE SAMPLES
In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8)
word2vec表明，简单的算术运算在表示空间中显示出丰富的线性结构。一个典型的例子证明了向量(“King”)-向量(“Man”)+向量(“Woman”)得到了一个最近的邻居是Queen向量的向量。我们探索能否在生成器的z表示中合并相似的结构。对视觉概念上的同一样本集合的Z向量执行了类似的算法。实验表明，对每个视觉概念只处理单个样本的实验不稳定，但对同一语义上三个样本的Z向量进行平均之后可以一致且稳定的生成。除了如图7所示的对象操作之外，我们还证明了人脸姿态在Z空间中也是线性建模的(图8)。

在这里插入图片描述
图7：视觉概念的向量算法。对于每一列，样本的Z向量均为平均值。然后算法对均值向量进行操作生成一个新的向量Y。右侧的中心样本是由Y作为输入送给生成器产生的。为了证明生成器的插值能力，将范围在-0.25-+0.25之间的均值噪声添加到Y中产生8个其它的样本。在输入空间中应用算法（底下的两个例子）会导致噪声重叠。（单张图片并不稳定，而三张图片则可以学到表情和墨镜等特征。）

在这里插入图片描述
图８：一个“turn”向量是由四个脸向左的看平均样本和四个脸向右看的平均样本创建的。沿着这个轴向随机样本添加插值，我们能够可靠地变换它们的姿态。

These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.
这表明我们的模型学到的z表示能够开发出有趣的应用。之前已经证明，条件生成模型可以学习令人信服地对象属性，如比例、旋转和位置。据我们所知，这是在纯无监督模型中的第一次展示。进一步探索上面提到的向量算法能极大的减少复杂图像条件生成模型需要的大量数据。

7 CONCLUSION AND FUTURE WORK（总结和展望）

We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.
我们提出了一种更稳定的结构训练GAN，我们证明了对抗网络在监督学习和生成模型中学到了更好的图像表示。该模型任然存在着一些不稳定，我们注意到模型训练太久时他们有时会从一系列卷积核崩溃至一个震荡模型。
Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latent space would be interesting as well.
还需要进一步的工作来解决不稳定的问题。我们认为，扩展这一框架到其他领域，如视频（用于帧预测)和音频(语音合成的预训练特征）应该是非常有趣的。对所学习到的潜在空间的性质的进一步研究也将是有趣的。