MyDLNote-Network: 2020 CVPR 基于 U-Net 判别器的生成对抗网络 A U-Net Based Discriminator for Generative Adversari

最新推荐文章于 2025-03-22 19:37:02 发布

Phoenixtree_DongZhao

最新推荐文章于 2025-03-22 19:37:02 发布

阅读量6.2k

点赞数 6

分类专栏： MyDLNote-Network 文章标签：深度学习计算机视觉

本文链接：https://blog.csdn.net/u014546828/article/details/111244358

版权

MyDLNote-Network 专栏收录该内容

45 篇文章

订阅专栏

本文提出了U-NetGAN，一种基于U-Net的生成对抗网络（GAN）判别器，旨在解决合成图像的全局和局部一致性问题。U-Net架构允许判别器同时提供全局和局部反馈，通过CutMix数据增强实现像素级一致性正则化，提升生成器的图像质量和结构真实性。实验表明，与BigGAN基线相比，U-NetGAN在多个数据集上提高了生成样本的质量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A U-Net Based Discriminator for Generative Adversarial Networks

[paper] [github]

Abstract

Among the major remaining challenges for generative adversarial networks (GANs) is the capacity to synthesize globally and locally coherent images with object shapes and textures indistinguishable from real images.

To target this issue we propose an alternative U-Net based discriminator architecture, borrowing the insights from the segmentation literature.

The proposed U-Net based architecture allows to provide detailed per-pixel feedback to the generator while maintaining the global coherence of synthesized images, by providing the global image feedback as well. Empowered by the per-pixel response of the discriminator, we further propose a per-pixel consistency regularization technique based on the CutMix data augmentation, encouraging the U-Net discriminator to focus more on semantic and structural changes between real and fake images. This improves the U-Net discriminator training, further enhancing the quality of generated samples. The novel discriminator improves over the state of the art in terms of the standard distribution and image quality metrics, enabling the generator to synthesize images with varying structure, appearance and levels of detail, maintaining global and local realism.

Compared to the BigGAN baseline, we achieve an average improvement of 2.7 FID points across FFHQ, CelebA, and the proposed COCO-Animals dataset.

第一部分，背景和问题：生成式对抗网络(GANs)仍面临的主要挑战之一是合成全局和局部相关图像的能力，这些图像的物体形状和纹理与真实图像难以区分。

第二部分，本文核心网络（目标和核心内容）：针对这个问题，提出了另一种基于 U-Net 的判别器架构，借鉴了来自分割文献的见解。

第三部分，网络介绍及优点：提出的基于 U-Net 的架构允许提供详细的每像素反馈给生成器，同时保持合成图像的全局一致性。利用判别器的逐像素响应，进一步提出了一种基于 CutMix 数据增强的逐像素一致性正则化技术，鼓励 U-Net 判别器更多地关注真实图像和虚假图像之间的语义和结构变化。这改进了 U-Net 判别器的训练，进一步提高了生成样本的质量。提出的判别器在标准分布和图像质量度量方面改进了艺术的状态，使生成器能够合成具有不同结构、外观和细节级别的图像，保持图像全局和局部的真实性（像真的一样）。

最后部分，实验结论。

摘要部分，没有太多的介绍网络是什么样子的，只是强调了网络的优点。原因是，网络可能并不复杂，而确实优点不少。

Introduction

The quality of synthetic images produced by generative adversarial networks (GANs) has seen tremendous improvement recently [5, 20]. The progress is attributed to large-scale training [32, 5], architectural modifications [50, 19, 20, 27], and improved training stability via the use of different regularization techniques [34, 51]. However, despite the recent advances, learning to synthesize images with global semantic coherence, long-range structure and the exactness of detail remains challenging.

第一段介绍了几方面内容（第一段最能展示作者功底，基本就是几句话概况了该行业的背景，并给出本文所关注的、解决的核心问题）：

研究大方向：GAN

该方向状况：GAN 目前从三个研究角度进行，即大规模训练、网络结构设计和正则化技术。

提出该行业存在的问题：学习合成具有全局语义一致性、长程结构和细节精确的图像仍然具有挑战性。

个人认为，一个好的工作，首先是提出了一个好的问题（新颖，行业还没注意到的、却真实存在的问题；行业最关心的大问题，目前还没有较好的解决；某个该行业很经典的算法存在的问题，这个问题可是比较重要的或者被前人误解的，可不是小问题）；然后，针对这个问题提出了非常合理，甚至巧妙的方法。

One source of the problem lies potentially in the discriminator network. The discriminator aims to model the data distribution, acting as a loss function to provide the generator a learning signal to synthesize realistic image samples. The stronger the discriminator is, the better the generator has to become. In the current state-of-the-art GAN models, the discriminator being a classification network learns only a representation that allows to efficiently penalize the generator based on the most discriminative difference between real and synthetic images. Thus, it often focuses either on the global structure or local details. The problem amplifies as the discriminator has to learn in a non-stationary environment: the distribution of synthetic samples shifts as the generator constantly changes through training, and is prone to forgetting previous tasks [7] (in the context of the discriminator training, learning semantics, structures, and textures can be considered different tasks). This discriminator is not incentivized to maintain a more powerful data representation, learning both global and local image differences. This often results in the generated images with discontinued and mottled local structures [27] or images with incoherent geometric and structural patterns (e.g. asymmetric faces or animals with missing legs) [50].

本段具体描述了目前 GAN 存在的问题，总体来看，研究角度比较新颖，当然是为提出的 U-Nnet GAN 做铺垫的。这段是非常加分的，因为只有能看到新问题，才说明作者对该领域了解地非常深透。这正是好文章的灵魂。回来，我们来看看都有哪些问题：

本段前半部分介绍了判别器的功能：目标是学习数据的分布，在生成网络中作为对抗损失函数的一部分，促使生成器生成更真实的输出。通常被作为分类网络，因此只学习一种表示，因此它通常要么关注全局结构，要么关注局部细节。

后半部分分析了传统判别器的缺点：在判别器训练的背景下，学习语义、结构和纹理可以被认为是不同的任务；当判别器必须在非平稳环境中学习时，问题就会放大，即合成样本的分布会随着发生器通过训练不断变化而变化，而且容易忘记之前的任务。如果为了使得判别器同时学习两个任务，如全局语义和细节，将导致该判别器失去表达能力（什么都学，什么都学不好）。具体地，导致生成的图像具有不连续的和斑驳的局部结构或具有不连贯的几何和结构模式的图像(例如不对称的脸或缺少腿的动物)。

To mitigate this problem, we propose an alternative discriminator architecture, which outputs simultaneously both global (over the whole image) and local (per-pixel) decision of the image belonging to either the real or fake class, see Figure 1. Motivated by the ideas from the segmentation literature, we re-design the discriminator to take a role of both a classifier and segmenter. We change the architecture of the discriminator network to a U-Net [39], where the encoder module performs per-image classification, as in the standard GAN setting, and the decoder module outputs perpixel class decision, providing spatially coherent feedback to the generator, see Figure 2. This architectural change leads to a stronger discriminator, which is encouraged to maintain a more powerful data representation, making the generator task of fooling the discriminator more challenging and thus improving the quality of generated samples (as also reflected in the generator and discriminator loss behavior in Figure S1). Note that we do not modify the generator in any way, and our work is orthogonal to the ongoing research on architectural changes of the generator [20, 27], divergence measures [25, 1, 37], and regularizations [40, 15, 34].

本段介绍了 U-Net GAN 的网络结构和说明。其包括几点特点：

1. 同时输出属于真实类或虚假类的图像的全局 (对整个图像) 和局部 (逐像素) 决策；既能充当分类器又能充当分割器的判别器；

2. 编码器模块执行每幅图像分类，就像标准 GAN 设置一样；

3. 解码器模块输出像素级决策，向生成器提供空间相干反馈；

4. 并没有以任何方式修改生成器， U-Net GAN 与目前的生成器的架构变更研究 [20,27]、发散度量 [251,37] 和正则化研究 [40,15,34] 是正交的。

The proposed U-Net based discriminator allows to employ the recently introduced CutMix [47] augmentation, which is shown to be effective for classification networks, for consistency regularization in the two-dimensional output space of the decoder. Inspired by [47], we cut and mix the patches from real and synthetic images together, where the ground truth label maps are spatially combined with respect to the real and fake patch class for the segmenter (U-Net decoder) and the class labels are set to fake for the classifier (U-Net encoder), as globally the CutMix image should be recognized as fake, see Figure 3. Empowered by per-pixel feedback of the U-Net discriminator, we further employ these CutMix images for consistency regularization, penalizing per-pixel inconsistent predictions of the discriminator under the CutMix transformations. This fosters the discriminator to focus more on semantic and structural changes between real and fake images and to attend less to domain-preserving perturbations. Moreover, it also helps to improve the localization ability of the decoder. Employing the proposed consistency regularization leads to a stronger generator, which pays more attention to local and global image realism. We call our model U-Net GAN.

本段再次剖析了 U-Net GAN 的一些细节：

提出的基于 U-Net 的鉴别器允许使用最近引入的 CutMix [47] 增强，该增强被证明对分类网络有效，用于在解码器的二维输出空间中进行一致性规则化。受 [47] 的启发，将真实图像和合成图像的 patch 块剪切并混合在一起，其中，分割器（解码器）中真实和虚假 path 类是 GT 标签图；分类器（编码器）的虚假类是分类标签。在全局中，CutMix 图像应该被认为是假的。利用 U-Net 判别器的逐像素反馈，进一步利用这些 CutMix 图像进行一致性正则化，惩罚在 CutMix 变换下的逐像素不一致的判别器预测。这就使得判别器更关注真假图像之间的语义和结构变化，而较少关注保留区域的干扰。此外，它还有助于提高译码器的定位能力。使用所提出的一致性正则化导致一个更强大的生成器，它更关注局部和全局图像真实感。

We evaluate the proposed U-Net GAN model across several datasets using the state-of-the-art BigGAN model [5] as a baseline and observe an improved quality of the generated samples in terms of the FID and IS metrics. For unconditional image synthesis on FFHQ [20] at resolution 256 × 256, our U-Net GAN model improves 4 FID points over the BigGAN model, synthesizing high quality human faces (see Figure 4). On CelebA [29] at resolution 128×128 we achieve 1.6 point FID gain, yielding to the best of our knowledge the lowest known FID score of 2.95. For class-conditional image synthesis on the introduced COCOAnimals dataset [28, 24] at resolution 128×128 we observe an improvement in FID from 16.37 to 13.73, synthesizing diverse images of different animal classes (see Figure 5).

本段给出了一下实验结论：一句话，与 BigGAN 比较，本文的较好。

【现在是凌晨3点了，困了，后面持续更新吧。。。】