Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

进击的小老虎丶

已于 2022-05-12 16:38:33 修改

阅读量865

点赞数

分类专栏：论文翻译文章标签：深度学习

于 2022-04-30 12:25:01 首次发布

原文链接：https://arxiv.org/abs/1703.10593

版权

论文翻译专栏收录该内容

18 篇文章

订阅专栏

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Abstract
1. Introduction
2. Related work
3. Formulation
4. Implementation
5. Results
- 5.1.1 Evaluation Metrics
- 5.2各种方法比较

Abstract

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However , for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to enforce F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer , object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
图像到图像的转换是一类视觉和图形问题，其目标是使用配对图像的训练集学习输入图像和输出图像之间的映射。然而，对于许多任务，配对训练数据将不可用。我们提出了一种在非配对示例的情况下学习将图像从源域X转换到目标域Y的方法。我们的目标是学习G:X映射→ 使得来自G(X) 的图像分布与使用对抗损失的分布Y不可区分。因为这个映射是高度欠约束的，所以我们将其与逆映射F:Y→X 引入循环一致性损失来强制F(G(X))≈X（反之亦然）。定性结果显示了几个不存在配对训练数据的任务，包括收集方式转换、对象变形、季节转换、照片增强等。与几种先前方法的定量比较表明了我们方法的优越性。

1. Introduction

What did Claude Monet see as he placed his easel by the bank of the Seine near Argenteuil on a lovely spring day in 1873 ? A color photograph, had it been invented, may have documented a crisp blue sky and a glassy river reflecting it. Monet conveyed his impression of this same scene through wispy brush strokes and a bright palette.
1873年一个美好的春日，克劳德·莫内把画架放在阿金泰尔附近的塞纳河边，他看到了什么？一张彩色照片，如果它被发明出来的话，可能已经记录了一片清澈的蓝天和一条玻璃般的河流。莫奈通过纤细的笔触和明亮的调色板传达了他对同一场景的印象。

What if Monet had happened upon the little harbor in Cassis on a cool summer evening? A brief stroll through a gallery of Monet paintings makes it possible to imagine how he would have rendered the scene: perhaps in pastel shades, with abrupt dabs of paint, and a somewhat flattened dynamic range.
如果莫奈在一个凉爽的夏夜遇到卡西斯的小港口呢？在莫奈的画作画廊里漫步片刻，就可以想象他会如何渲染这一场景：可能是柔和的色调，突然的一抹颜料，以及有点平坦的动态范围。

We can imagine all this despite never having seen a side by side example of a Monet painting next to a photo of the scene he painted. Instead, we have knowledge of the set of Monet paintings and of the set of landscape photographs. We can reason about the stylistic differences between these two sets, and thereby imagine what a scene might look like if we were to “translate” it from one set into the other.
我们可以想象这一切，尽管我们从未见过莫奈画在他画的场景照片旁边的并排例子。取而代之的是，我们了解莫奈的一套绘画和一套风景照片。我们可以对这两个场景的风格差异进行推理，从而想象如果我们将场景从一个场景“转换”到另一个场景，会是什么样子。

In this paper, we present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples.
在本文中，我们提出了一种方法，可以学习做同样的事情：捕捉一个图像集合的特殊特征，并找出如何将这些特征转化为另一个图像集合，所有这些都不需要任何成对的训练示例。

This problem can be more broadly described as imageto-image translation, converting an image from one representation of a given scene, x, to another, y, e.g., grayscale to color, image to semantic labels, edge-map to photograph. Y ears of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs {xi, yi} are available. However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation, and they are relatively small. Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring. For many tasks, like object transfiguration, the desired output is not even well-defined.
这个问题可以更广泛地描述为图像到图像的转换，将图像从给定场景的一种表示形式x转换为另一种表示形式y，例如灰度到颜色，图像到语义标签，边缘映射到照片。计算机视觉、图像处理、计算摄影和图形学领域的研究已经在有监督的环境中产生了强大的翻译系统，在这些环境中，可以使用示例图像对{xi，yi}。然而，获得成对的训练数据可能既困难又昂贵。例如，只有几个数据集用于语义分割之类的任务，而且它们相对较小。由于所需的输出非常复杂，通常需要艺术创作，因此为艺术风格化等图形任务获取输入-输出对可能会更加困难。对于许多任务，比如对象变形，所需的输出甚至没有定义好。

We therefore seek an algorithm that can learn to translate between domains without paired input-output examples. We assume there is some underlying relationship between the domains – for example, that they are two different renderings of the same underlying scene – and seek to learn that relationship. Although we lack supervision in the form of paired examples, we can exploit supervision at the level of sets: we are given one set of images in domain X and a different set in domain Y . We may train a mapping G : X→Y such that the output ˆy = G(x), is indistinguishable from images y ∈ Y by an adversary trained to classify ˆy apart from y. In theory, this objective can induce an output distribution over ˆy that matches the empirical distribution pdata(y). The optimal G thereby translates the domain X to a domain ˆY distributed identically to Y . However, such a translation does not guarantee that an individual input x and output y are paired up in a meaningful way – there are infinitely many mappings G that will induce the same distribution over ˆy. Moreover, in practice, we have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the wellknown problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress.
因此，我们寻求一种算法，可以学习在没有成对输入输出示例的情况下在域之间进行转换。我们假设这些域之间存在某种潜在关系——例如，它们是同一潜在场景的两个不同渲染——并试图了解这种关系。虽然我们缺乏成对示例形式的监控，但我们可以利用集合级别的监控：我们在域X中获得一组图像，在域Y中获得另一组图像。我们可以训练一个G:X映射→使得输出ˆY=G(X) 与图像Y无法区分被训练将ˆY与Y区分开来的对手的Y。理论上，这个目标可以诱导ˆY上的输出分布与经验分布pdata(Y) 相匹配。因此，最优G将域X转换为与Y相同分布的域ˆY。然而，这样的转换并不能保证单个输入x和输出y以有意义的方式配对——有无限多个映射G会在ˆy上产生相同的分布。此外，在实践中，我们发现很难孤立地优化敌对目标：标准程序通常会导致众所周知的模式崩溃问题，所有输入图像映射到同一输出图像，优化无法取得进展。

These issues call for adding more structure to our objective. Therefore, we exploit the property that translation should be “cycle consistent”, in the sense that if we translate, e.g., a sentence from English to French, and then translate it back from French to English, we should arrive back at the original sentence. Mathematically, if we have a translator G:X→Y and another translator F:Y→X, then G and F should be inverses of each other, and both mappings should be bijections. We apply this structural assumption by training both the mapping G and F simultaneously, and adding a cycle consistency loss that encourages F (G(x)) ≈ x and G(F (y)) ≈ y. Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation.
这些问题要求为我们的目标增加更多的结构。因此，我们利用了转换应该是“循环一致”的特性，也就是说，如果我们把一个句子从英语翻译成法语，然后再把它从法语翻译回英语，我们应该回到原来的句子。从数学上来说，如果我们有一个翻译器G:X→Y和另一位翻译F:Y→x、那么G和F应该是彼此的倒数，两个映射都应该是双射。我们通过同时训练映射G和F，并添加一个鼓励F（G（x））的循环一致性损失，来应用这种结构假设≈ x和G（F（y））≈ y、将这种损失与域X和y上的对抗性损失结合起来，我们就可以实现未配对图像到图像的转换。

We apply our method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement. We also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that our method outperforms these baselines. We provide both PyTorch and Torch implementations. Check out more results at our website.
我们将我们的方法应用到广泛的应用中，包括收集样式转换、对象变形、季节转换和照片增强。我们还与以前的方法进行了比较，这些方法要么依赖于手工定义的风格和内容分解，要么依赖于共享的嵌入函数，并且表明我们的方法优于这些基线。我们提供PyTorch和Torch实现。在我们的网站上查看更多结果。

2. Related work

Generative Adversarial Networks (GANs) have achieved impressive results in image generation, image editing, and representation learning. Recent methods adopt the same idea for conditional image generation applications, such as text2image, image inpainting, and future prediction, as well as to other domains like videos and 3D data. The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos. This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We adopt an adversarial loss to learn the mapping such that the translated images cannot be distinguished from images in the targetdomain.
生成性对抗网络在图像生成、图像编辑和表征学习方面取得了令人印象深刻的成果。最近的方法在条件图像生成应用中采用了相同的思想，例如text2image、图像修复和未来预测，以及视频和3D数据等其他领域。甘斯成功的关键在于对抗性的失败，这迫使生成的图像在原则上与真实照片无法区分。这种损失对于图像生成任务来说尤其严重，因为这正是许多计算机图形学旨在优化的目标。我们采用对抗性损失来学习映射，这样转换后的图像就无法与目标域中的图像区分开来。

在这里插入图片描述

Image-to-Image Translation The idea of image-toimage translation goes back at least to Hertzmann et al.’s Image Analogies, who employ a non-parametric texture model on a single input-output training image pair. More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs. Our approach builds on the “pix2pix” framework of Isola et al., which uses a conditional generative adversarial network to learn a mapping from input to output images. Similar ideas have been applied to various tasks such as generating photographs from sketches or from attribute and semantic layouts. However, unlike the above prior work, we learn the mapping without paired training examples.
图像到图像的转换图像到图像的转换思想至少可以追溯到Hertzmann等人的图像类比，他们在单个输入输出训练图像对上使用非参数纹理模型。最近的方法使用输入输出示例数据集来学习使用CNN的参数转换函数。我们的方法基于Isola等人的“pix2pix”框架，该框架使用条件生成对抗网络来学习从输入到输出图像的映射。类似的想法已经应用于各种任务，例如从草图或属性和语义布局生成照片。然而，与前面的工作不同，我们学习映射时没有成对的训练示例。

Unpaired Image-to-Image Translation Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: X and Y . Rosales et al. propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images. More recently, CoGAN and cross-modal scene networks use a weight-sharing strategy to learn a common representation across domains. Concurrent to our method, Liu et al. extends the above framework with a combination of variational autoencoders and generative adversarial networks . Another line of concurrent work encourages the input and output to share specific “content” features even though they may differ in “style“. These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space, image pixel space , and image feature space.
未配对图像到图像的转换其他几种方法也可以处理未配对设置，其目标是关联两个数据域：X和Y。Rosales等人提出了一个贝叶斯框架，该框架包括从源图像计算的基于补丁的马尔可夫随机场的先验知识和从多个样式图像获得的似然项。最近，CoGAN和跨模态场景网络使用权重共享策略来学习跨域的公共表示。与我们的方法同时，Liu等人将上述框架扩展为可变自动编码器和生成性对抗网络的组合。另一种并行工作鼓励输入和输出共享特定的“内容”特征，即使它们在“风格”上可能有所不同。这些方法还使用对抗性网络，并使用额外的术语来强制输出在预定义的度量空间（如类标签空间、图像像素空间和图像特征空间）中接近输入。

Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function between the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space. This makes our method a general-purpose solution for many vision and graphics tasks. We directly compare against several prior and contemporary approaches in Section 5.1.
与上述方法不同，我们的公式不依赖于输入和输出之间任何特定于任务的预定义相似性函数，也不假设输入和输出必须位于同一低维嵌入空间中。这使我们的方法成为许多视觉和图形任务的通用解决方案。我们在第5.1节中直接比较了几种先前和当代的方法。

Cycle Consistency The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators , as well as by machines. More recently, higher-order cycle consistency has been used in structure from motion, 3D shape matching, cosegmentation, dense semantic alignment, and depth estimation. Of these, Zhou et al. and Godard et al. are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training. In this work, we are introducing a similar loss to push G and F to be consistent with each other. Concurrent with our work, in these same proceedings, Yi et al. independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation.
循环一致性使用及物性作为规范化结构化数据的方法的想法由来已久。在视觉跟踪中，强制执行简单的前后一致性几十年来一直是一个标准技巧。在语言领域，通过“反向翻译和协调”验证和改进翻译是人类翻译人员以及机器使用的一种技术。最近，高阶循环一致性已被用于运动结构、三维形状匹配、共分段、密集语义对齐和深度估计。其中，Zhou等人和Godard等人与我们的工作最为相似，因为他们使用循环一致性损失作为使用及物性来监督CNN培训的一种方式。在这项工作中，我们引入了一个类似的损耗来推动G和F相互一致。与我们的工作同时，在这些相同的程序中，Yi等人在机器翻译中的双重学习启发下，独立使用了一个类似的目标来进行非配对图像到图像的转换。

Neural Style Transfer is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features. Our primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher-level appearance structures. Therefore, our method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well. We compare these two methods in Section 5.2.
神经风格转换是执行图像到图像翻译的另一种方法，它通过匹配预先训练的深度特征的Gram矩阵统计信息，将一幅图像的内容与另一幅图像（通常是一幅画）的风格相结合，从而合成一幅新图像。另一方面，我们主要关注的是通过尝试捕捉更高层次的外观结构之间的对应关系，学习两个图像集合之间的映射，而不是两个特定图像之间的映射。因此，我们的方法可以应用于其他任务，例如绘画→ 照片、物体变形等，单一样本转移方法效果不佳。我们在第5.2节中比较了这两种方法。

3. Formulation

Our goal is to learn mapping functions between two domains X and Y given training samples {xi} and {yj} . We denote the data distribution as x∼pdata(x) and y∼ pdata(y). As illustrated in Figure 3(a), our model includes two mappings G:X→Y and F:Y→X. In addition, we introduce two adversarial discriminators Dx and Dy , where Dx aims to distinguish between images {x} and translated images {F(y)}; in the same way, Dy aims to discriminate between {y} and {G(x)}. Our objective contains two types of terms: adversarial losses for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.
我们的目标是学习给定训练样本{xi}和{yj}的两个域X和Y之间的映射函数。我们将数据分布表示为x∼pdata（x）和y∼ pdata（y）。如图3（a）所示，我们的模型包括两个映射G:X→Y和F:Y→此外，我们引入了两个对抗性鉴别器Dx和Dy，其中Dx旨在区分图像{X}和翻译图像{F（y）}；同样，Dy的目的是区分{y}和{G（x）}。我们的目标包括两类术语：将生成的图像分布与目标域中的数据分布相匹配的对抗性损失；循环一致性损失，以防止学习到的映射G和F相互矛盾。

3.1. Adversarial Loss

We apply adversarial losses to both mapping functions. For the mapping function G:X→Y and its discriminator Dy , we express the objective as:
我们将对抗性损失应用于两个映射函数。对于映射函数G:X→Y及其鉴别器Dy，我们将目标表示为：
在这里插入图片描述

3.2. Cycle Consistency Loss

Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi. To further reduce the space of possible mapping functions, we argue that the learned mapping functions should be cycle-consistent: as shown in Figure 3(b), for each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., x→G(x)→F(G(x))≈x. We call this forward cycle consistency. Similarly, as illustrated in Figure 3©, for each image y from domain Y , G and F should also satisfy backward cycle consistency: y → F (y) → G(F(y)) ≈ y. We incentivize this behavior using a cycle consistency loss:
从理论上讲，对抗性训练可以学习映射G和F，它们产生的输出分别作为目标域Y和X均匀分布。然而，如果容量足够大，网络可以将同一组输入图像映射到目标域中的任意图像随机排列，其中任何学习到的映射都可以产生与目标分布匹配的输出分布。因此，仅对抗性损失不能保证学习函数能够将单个输入xi映射到期望的输出yi。为了进一步减少可能的映射函数的空间，我们认为学习到的映射函数应该是循环一致的：如图3（b）所示，对于来自域x的每个图像x，图像转换循环应该能够将x带回原始图像，即x→G（x）→F（G（x））≈x、我们称之为前向循环一致性。类似地，如图3（c）所示，对于来自域y、G和F的每个图像y，也应该满足反向循环一致性：y→ F（y）→ G（F（y））≈ y、我们使用周期一致性损失来激励这种行为：
在这里插入图片描述
In preliminary experiments, we also tried replacing the L1 norm in this loss with an adversarial loss between F (G(x)) and x, and between G(F (y)) and y, but did not observe improved performance.
在初步实验中，我们还尝试用F（G（x））和x之间，以及G（F（y））和y之间的对抗性损失来替换这种损失中的L1范数，但没有观察到性能的改善。

The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images F (G(x)) end up matching closely to the input images x.
由循环一致性损失引起的行为可以在图4中观察到：重构图像F(G(x)) 最终与输入图像x紧密匹配。

3.3. Full Objective

Our full objective is:
where λ controls the relative importance of the two objectives. We aim to solve:
其中λ控制两个目标的相对重要性。我们的目标是解决：
在这里插入图片描述
Notice that our model can be viewed as training two “autoencoders” : we learn one autoencoder F◦G:X→X jointly with another G◦F:Y→Y. However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders”, which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution. In our case, the target distribution for the X→X autoencoder is that of the domain Y .
注意，我们的模型可以被视为训练两个“自动编码器”：我们学习一个自动编码器F◦G:X→X 联合另一个G◦F：Y→Y，这些自动编码器都有特殊的内部结构：它们通过中间表示将图像映射到自身，中间表示是将图像转换到另一个域中。这种设置也可以被视为“对抗性自动编码器”的特例，它使用对抗性损失来训练自动编码器的瓶颈层，以匹配任意目标分布。在我们的例子中，X的目标分布→X自动编码器是域Y的自动编码器。

In Section 5.1.4, we compare our method against ablations of the full objective, including the adversarial loss LGAN alone and the cycle consistency loss Lcyc alone, and empirically show that both objectives play critical roles in arriving at high-quality results. We also evaluate our method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem.
在第5.1.4节中，我们将我们的方法与整个目标的烧蚀进行了比较，包括单独的对抗性损失LGAN和单独的周期一致性损失Lcyc，并从经验上证明，这两个目标在获得高质量结果方面起着关键作用。我们还评估了我们的方法，仅在一个方向上有周期损失，并表明单个周期不足以正则化这个欠约束问题的训练。

4. Implementation

Network Architecture. We adopt the architecture for our generative networks from Johnson et al. who have shown impressive results for neural style transfer and superresolution.This network contains three convolutions, several residual blocks, two fractionally-strided convolutions with stride 1/2,and one convolution that maps features to RGB. We use 6 blocks for 128×128 images and 9 blocks for 256×256 and higher-resolution training images. Similar to Johnson et al., we use instance normalization. For the discriminator networks we use 70×70 PatchGANs, which aim to classify whether 70×70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarilysized images in a fully convolutional fashion .
网络架构。我们采用Johnson等人的生成网络架构。Johnson等人在神经风格转换和超分辨率方面取得了令人印象深刻的结果。该网络包含三个卷积、几个剩余块、两个步幅为1/2的分步卷积和一个将特征映射到RGB的卷积。我们对128×128图像使用6个块，对256×256及更高分辨率的训练图像使用9个块。与Johnson等人类似，我们使用实例规范化。对于鉴别器网络，我们使用70×70补丁，其目的是分类70×70重叠图像补丁是真是假。这种贴片级鉴别器结构的参数比全图像鉴别器少，并且可以以完全卷积的方式处理任意化的图像。

Training details. We apply two techniques from recent works to stabilize our model training procedure. First, for LGAN, we replace the negative log likelihood objective by a least-squares loss. This loss is more stable during training and generates higher quality results.
训练细节。我们应用最近工作中的两种技术来稳定我们的模型训练过程。首先，对于LGAN，我们将负对数似然目标替换为最小二乘损失。这种损失在训练期间更稳定，并产生更高质量的结果。

Second, to reduce model oscillation [15], we follow Shrivastava et al.’s strategy and update the discriminators using a history of generated images rather than the ones produced by the latest generators. We keep an image buffer that stores the 50 previously created images.
Nator使用生成图像的历史记录，而不是最新的生成器生成的图像。我们保留一个图像缓冲区，用于存储之前创建的50幅图像。

For all the experiments, we set λ = 10 in Equation 3. We use the Adam solver with a batch size of 1. All networks were trained from scratch with a learning rate of 0.0002. We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs. Please see the appendix (Section 7) for more details about the datasets, architectures, and training procedures.
对于所有实验，我们在方程3中设置λ=10。我们使用批次大小为1的Adam解算器。所有网络都从零开始接受培训，学习率为0.0002。我们在前100个时期保持相同的学习率，在接下来的100个时期内线性衰减到零。有关数据集、体系结构和培训程序的更多详细信息，请参见附录（第7节）。

5. Results

5.1.1 Evaluation Metrics

FCN score Although perceptual studies may be the gold standard for assessing graphical realism, we also seek an automatic quantitative measure that does not require human experiments. For this, we adopt the “FCN score”, and use it to evaluate the Cityscapes labels→photo task. The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm. The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below. The intuition is that if we generate a photo from a label map of “car on the road”, then we have succeeded if the FCN applied to the generated photo detects “car on the road”.
FCN分数虽然知觉研究可能是评估图形真实性的黄金标准，但我们也寻求一种不需要人类实验的自动定量测量方法。为此，我们采用“FCN分数”，并用它来评估城市景观标签→拍照任务。FCN度量根据现成的语义分割算法评估生成的照片的可解释性。FCN预测生成照片的标签图。然后，可以使用下面描述的标准语义分割度量，将该标签图与输入地面真相标签进行比较。直觉是，如果我们从“道路上的汽车”的标签地图生成一张照片，那么如果应用于生成的照片的FCN检测到“道路上的汽车”，我们就成功了。

Semantic segmentation metrics.To evaluate the performance of photo→labels, we use the standard metrics from the Cityscapes benchmark, including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union .
语义分割度量。评估照片的性能→标签，我们使用城市景观基准中的标准度量，包括每像素精度、每类精度和联合上的平均类交集。

AMT perceptual studies地图与遥感图像迁移：
在地图↔航拍任务中，我们对AMT进行“真实vs虚假”的感知研究，以评估输出的真实性。
在这里插入图片描述