目录
Self-attention
Motivation:
- Since the convolution operator has a local receptive field, long range dependencies can only be processed after passing through several convolutional layers. This could prevent learning about long-term dependencies for a variety of reasons:
- (i) a small model may not be able to represent them
- (ii) optimization algorithms may have trouble discovering parameter values that carefully coordinate multiple layers to capture these dependencies
- (iii) these parameterizations may be statistically brittle and prone to failure when applied to previously unseen inputs.
- Increasing the size of the convolution kernels can increase the representational capacity of the network but doing so also loses the computational and statistical efficiency obtained by using local convolutional structure.
SAGAN
- SAGAN allows attention-driven, long-range dependency modeling (卷积核易于捕捉局部信息,而 SAGAN 通过注意力机制引入广域依赖) for image generation tasks.
- In the SAGAN, the proposed attention module has been applied to both the generator and the discriminator.
- (1) Generator: Details can be generated using cues from all feature locations.
- (2) Discriminator: the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.
- Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.
Self-attention - 全局空间信息计算
- x ∈ R C × N x\in\R^{C\times N} x∈RC×N: image features from the previous hidden layer. Here, C C C is the number of channels and N N N is the number of feature locations of features from the previous hidden layer.
-
f
(
x
)
=
W
f
x
,
g
(
x
)
=
W
g
x
f(x) = W_f x, g(x) = W_gx
f(x)=Wfx,g(x)=Wgx: transform
x
x
x into two feature spaces
f
f
f (key),
g
g
g (query) to calculate the attention
- β j , i β_{j,i} βj,i indicates the extent to which the model attends to the i i ith location when synthesizing the j j jth region
- W g ∈ R C ˉ × C , W f ∈ R C ˉ × C W_g\in\R^{\bar C\times C},W_f\in\R^{\bar C\times C} Wg∈RCˉ×C,Wf∈RCˉ×C; attention map: N × N N\times N N×N
- The output of the attention layer is
o
=
(
o
1
,
o
2
,
.
.
.
,
o
j
,
.
.
.
,
o
N
)
∈
R
C
×
N
o = (o_1, o_2, ..., o_j , ..., o_N) ∈ \R^{C×N}
o=(o1,o2,...,oj,...,oN)∈RC×N, where,
- W h ∈ R C ˉ × C , W v ∈ R C × C ˉ W_h\in\R^{\bar C\times C},W_v\in\R^{C\times\bar C} Wh∈RCˉ×C,Wv∈RC×Cˉ
W g , W f , W h , W v W_g,W_f,W_h,W_v Wg,Wf,Wh,Wv are implemented as 1 × 1 1×1 1×1 convolutions. Since We did not notice any significant performance decrease when reducing the channel number of C ˉ \bar C Cˉ to be C / k C/k C/k, where k = 1 , 2 , 4 , 8 k = 1, 2, 4, 8 k=1,2,4,8 after few training epochs on ImageNet. For memory efficiency, we choose k = 8 k = 8 k=8 (i.e., C ˉ = C / 8 \bar C = C/8 Cˉ=C/8) in all our experiments.
Self-attention - 整合全局空间信息和局部信息
- In addition, we further multiply the output of the attention layer by a scale parameter and add back the input feature map. Therefore, the final output is given by,
where γ γ γ is a learnable scalar and it is initialized as 0. - Introducing the learnable
γ
γ
γ allows the network to first rely on the cues in the local neighborhood – since this is easier – and then gradually learn to assign more weight to the non-local evidence.
- The intuition for why we do this is straightforward: we want to learn the easy task first and then progressively increase the complexity of the task.
Loss
- In the SAGAN, the generator and the discriminator are trained in an alternating fashion by minimizing the hinge version of the adversarial loss (Lim & Ye, 2017; Tran et al., 2017; Miyato et al., 2018 (SNGAN)),
Spectral normalization for both generator and discriminator
- In SNGAN, SN is only applied to
D
D
D. Here, SAGAN applys spectral normalization to both GAN generator and discriminator.
- Spectral normalization in the generator can prevent the escalation of parameter magnitudes and avoid unusual gradients.
- We find empirically that spectral normalization of both generator and discriminator makes it possible to use fewer discriminator updates per generator update, thus significantly reducing the computational cost of training. The approach also shows more stable training behavior.
Imbalanced learning rate for generator and discriminator updates
- In previous work, regularization of the discriminator (SNGAN; WGAN-GP) often slows down the GANs’ learning process.
- In practice, methods using regularized discriminators typically require multiple (e.g., 5) discriminator update steps per generator update step during training.
- Independently, Heusel et al. (Heusel et al., 2017) have advocated using separate learning rates (TTUR; Two-Timescale Update Rule) for the generator and the discriminator.
- We propose using TTUR specifically to compensate for the problem of slow learning in a regularized discriminator, making it possible to use fewer discriminator steps per generator step. Using this approach, we are able to produce better results given the same wall-clock time.
- lr for Discriminator: 0.0004
- lr for Generator: 0.0001