SAGAN: Self-attention GAN

最新推荐文章于 2024-04-15 09:45:48 发布

连理o

最新推荐文章于 2024-04-15 09:45:48 发布

阅读量523

点赞数

分类专栏： # Generative Models

本文链接：https://blog.csdn.net/weixin_42437114/article/details/119248508

版权

Generative Models 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

paper: Self-Attention Generative Adversarial Networks

Self-attention

Motivation:

Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers. This could prevent learning about long-term dependencies for a variety of reasons:
- (i) a small model may not be able to represent them
- (ii) optimization algorithms may have trouble discovering parameter values that carefully coordinate multiple layers to capture these dependencies
- (iii) these parameterizations may be statistically brittle and prone to failure when applied to previously unseen inputs.
Increasing the size of the convolution kernels can increase the representational capacity of the network but doing so also loses the computational and statistical efficiency obtained by using local convolutional structure.

SAGAN

SAGAN allows attention-driven, long-range dependency modeling (卷积核易于捕捉局部信息，而 SAGAN 通过注意力机制引入广域依赖) for image generation tasks.
In the SAGAN, the proposed attention module has been applied to both the generator and the discriminator.
- (1) Generator: Details can be generated using cues from all feature locations.
- (2) Discriminator: the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.
- Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.

Self-attention - 全局空间信息计算

在这里插入图片描述

$x\in\R^{C\times N}$ : image features from the previous hidden layer. Here, $C$ is the number of channels and $N$ is the number of feature locations of features from the previous hidden layer.
$f(x) = W_f x, g(x) = W_gx$ : transform $x$ into two feature spaces $f$ (key), $g$ (query) to calculate the attention
- $β_{j,i}$ indicates the extent to which the model attends to the $i$ th location when synthesizing the $j$ th region
- $W_g\in\R^{\bar C\times C},W_f\in\R^{\bar C\times C}$ ; attention map: $N\times N$
The output of the attention layer is $o = (o_1, o_2, ..., o_j , ..., o_N) ∈ \R^{C×N}$ , where,
- $W_h\in\R^{\bar C\times C},W_v\in\R^{C\times\bar C}$

$W_g,W_f,W_h,W_v$ are implemented as $1 \times 1$ convolutions. Since We did not notice any significant performance decrease when reducing the channel number of $\bar C$ to be $C / k$ , where $k = 1, 2, 4, 8$ after few training epochs on ImageNet. For memory efficiency, we choose $k = 8$ (i.e., $\bar C = C/8$ ) in all our experiments.

Self-attention - 整合全局空间信息和局部信息

In addition, we further multiply the output of the attention layer by a scale parameter and add back the input feature map. Therefore, the final output is given by,
where $γ$ is a learnable scalar and it is initialized as 0.
Introducing the learnable $γ$ allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.
- The intuition for why we do this is straightforward: we want to learn the easy task first and then progressively increase the complexity of the task.

Loss

In the SAGAN, the generator and the discriminator are trained in an alternating fashion by minimizing the hinge version of the adversarial loss (Lim & Ye, 2017; Tran et al., 2017; Miyato et al., 2018 (SNGAN)),

Spectral normalization for both generator and discriminator

In SNGAN, SN is only applied to $D$ . Here, SAGAN applys spectral normalization to both GAN generator and discriminator.
- Spectral normalization in the generator can prevent the escalation of parameter magnitudes and avoid unusual gradients.
- We find empirically that spectral normalization of both generator and discriminator makes it possible to use fewer discriminator updates per generator update, thus significantly reducing the computational cost of training. The approach also shows more stable training behavior.

Imbalanced learning rate for generator and discriminator updates

In previous work, regularization of the discriminator (SNGAN; WGAN-GP) often slows down the GANs’ learning process.
- In practice, methods using regularized discriminators typically require multiple (e.g., 5) discriminator update steps per generator update step during training.
Independently, Heusel et al. (Heusel et al., 2017) have advocated using separate learning rates (TTUR; Two-Timescale Update Rule) for the generator and the discriminator.
We propose using TTUR specifically to compensate for the problem of slow learning in a regularized discriminator, making it possible to use fewer discriminator steps per generator step. Using this approach, we are able to produce better results given the same wall-clock time.
- lr for Discriminator: 0.0004
- lr for Generator: 0.0001

连理o

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SAGAN: Self-attention GAN

目录SAGANSelf-attentionSpectral normalization (SN) for both G and DImbalanced learning rate for G and D (TTUR)BigGANSAGANSelf-attentionSpectral normalization (SN) for both G and DG: GeneratorD: Discriminatorσ(W)\sigma (W)σ(W): WWW 中最大的 sigular valu
复制链接

扫一扫