BigGAN: Large scale GAN training for high fidelity natural image synthesis

最新推荐文章于 2024-10-28 10:22:26 发布

连理o

最新推荐文章于 2024-10-28 10:22:26 发布

阅读量748

点赞数

分类专栏： # Generative Models 文章标签：深度学习 tensorflow

本文链接：https://blog.csdn.net/weixin_42437114/article/details/119146679

版权

Generative Models 专栏收录该内容

11 篇文章

订阅专栏

Introduction

Motivation

In this work, we set out to close the gap in fidelity and variety between images generated by GANs and real-world images from the ImageNet dataset.
To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale.

Contribution

(1) We demonstrate that GANs benefit dramatically from scaling, and train models with two to four times as many parameters and eight times the batch size compared to prior art. We introduce two simple, general architectural changes that improve scalability, and modify a regularization scheme to improve conditioning (i.e. orthogonal regularization), demonstrably boosting performance.
(2) As a side effect of our modifications, our models become amenable to the “truncation trick” (模型适合于使用 “截断技巧”), a simple sampling technique that allows explicit, fine-grained control of the trade-off between sample variety and fidelity by reducing the variance of the Generator’s input.
(3) We discover instabilities specific to large scale GANs, and characterize them empirically. Leveraging insights from this analysis, we demonstrate that a combination of novel and existing techniques can reduce these instabilities, but complete training stability can only be achieved at a dramatic cost to performance.

Result

Our modifications lead to models which set the new state of the art in class-conditional image synthesis.
- When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 (compared to 233 for real data) and Frechet Inception Distance (FID) of 7.4, improving over the previous best (SAGAN) IS of 52.52 and FID of 18.65.
We also successfully train BigGANs on ImageNet at 256×256 and 512×512 resolution, and achieve IS and FID of 232.5 and 8.1 at 256×256 and IS and FID of 241.5 and 11.5 at 512×512.
Finally, we train our models on an even larger dataset – JFT-300M – and demonstrate that our design choices transfer well from ImageNet.

Code and weights for our pretrained generators are publicly available.

Scaling Up GANS

增大 GAN 的训练规模

In this section, we explore methods for scaling up GAN training to reap the performance benefits of larger models and larger batches.

Baseline

As a baseline, we employ the SA-GAN architecture, which uses the hinge loss GAN objective. We provide class information to $G$ with class-conditional BatchNorm (de Vries et al., 2017; Dumoulin et al., 2017) and to $D$ with projection (cGANs with Projection Discriminator).
- The optimization settings follow Zhang et al. (2018) (notably employing Spectral Norm in $G$ ) with the modification that we halve the learning rates and take two $D$ steps per $G$ step.
For evaluation, we employ moving averages of $G$ ’s weights following Karras et al. (2018); Mescheder et al. (2018); Yazc et al. (2018), with a decay of 0.9999.
We use Orthogonal Initialization (Saxe et al., 2014), whereas previous works used $\mathcal N (0, 0.02I)$ (Radford et al., 2016) or Xavier initialization.
Each model is trained on 128 to 512 cores of a Google TPUv3 Pod, and computes BatchNorm statistics in $G$ across all devices, rather than per-device as is typical. We find progressive growing (Karras et al., 2018) unnecessary even for our $512 \times 512$ models.

Increasing the batch size

We begin by increasing the batch size for the baseline model, and immediately find tremendous benefits in doing so. We conjecture that this is a result of each batch covering more modes, providing better gradients for both networks.
- Rows 1-4 of Table 1 show that simply increasing the batch size by a factor of 8 improves the state-of-the-art IS by 46%.
One notable side effect of this scaling is that our models reach better final performance in fewer iterations, but become unstable and undergo complete training collapse. We discuss the causes and ramifications of this in Section 4 (“Analysis”). For these experiments, we report scores from checkpoints saved just before collapse. (可以看到，batch size 增加到 2048 时，记录 score 的 Itr 就只有 732 了，说明之后模型就遭遇了 “training collapse”)

Increasing the width

We then increase the width (number of channels) in each layer by 50%, approximately doubling the number of parameters in both models. This leads to a further IS improvement of 21%, which we posit is due to the increased capacity of the model relative to the complexity of the dataset.

Increasing the depth

Doubling the depth did not initially lead to improvement – we addressed this later in the BigGAN-deep model, which uses a different residual block structure.

Shared embedding

We note that class embeddings $c$ used for the conditional BatchNorm layers in $G$ contain a large number of weights. Instead of having a separate layer for each embedding (SNGAN, SAGAN), we opt to use a shared embedding, which is linearly projected to each layer’s gains and biases (Perez et al., 2018).
This reduces computation and memory costs, and improves training speed (in number of iterations required to reach a given performance) by 37%.

Skip- $z$

Next, we add direct skip connections (skip- $z$ ) from the noise vector $z$ to multiple layers of $G$ rather than just the initial layer.
- The intuition behind this design is to allow $G$ to use the latent space to directly influence features at different resolutions and levels of hierarchy.
In BigGAN (hierarchical latent spaces), this is accomplished by splitting $z$ into one chunk per resolution, and concatenating each chunk to the conditional vector $c$ which gets projected to the BatchNorm gains and biases. In BigGAN-deep, we use an even simpler design, concatenating the entire $z$ with the conditional vector without splitting it into chunks.
Skip- $z$ provides a modest performance improvement of around 4%, and improves training speed by a further 18%.

Table 1

在这里插入图片描述

Trading off Variety and Fidelity with the Truncation Trick

The vast majority of previous works have chosen to draw $z$ (prior distribution) from either $\mathcal N (0, I)$ or $\mathcal U[−1, 1]$ . We question the optimality of this choice.

Truncation Trick

Truncating a $z$ vector by resampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety. (i.e. 从一个标准正态分布中采样 $z$ ，如果绝对值超过阈值，就重新采样直至绝对值落在阈值范围内) This technique allows fine-grained, post-hoc selection of the trade-off between sample quality and variety for a given $G$ .
- Figure 2(a) demonstrates this: as the threshold is reduced, and elements of $z$ are truncated towards zero (the mode of the latent distribution), individual samples approach the mode of $G$ ’s output distribution.
Notably, we can compute FID and IS for a range of thresholds, obtaining the variety-fidelity curve reminiscent of the precision-recall curve (Figure 17). As IS does not penalize lack of variety in class-conditional models, reducing the truncation threshold leads to a direct increase in IS (analogous to precision). FID penalizes lack of variety (analogous to recall) but also rewards precision, so we initially see a moderate improvement in FID, but as truncation approaches zero and variety diminishes, the FID sharply drops.

Orthogonal Regularization

The distribution shift caused by sampling with different latents than those seen in training is problematic for many models. Some of our larger models are not amenable to truncation, producing saturation artifacts (Figure 2(b)) when fed truncated noise.

Orthogonal Regularization

To counteract this, we seek to enforce amenability to truncation by conditioning $G$ to be smooth, so that the full space of $z$ will map to good output samples. For this, we turn to Orthogonal Regularization (Brock et al., 2017), which directly enforces the orthogonality condition:
where $W$ is a weight matrix and $β$ a hyperparameter. (正交正则化的目的是使权重矩阵的列形成一组单位正交基，只需将上式加到 loss 中即可)
- (Brock et al., 2017): Orthogonality is a desirable quality in ConvNet filters, partially because multiplication by an orthogonal matrix leaves the norm of the original matrix unchanged. This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding.
- This regularization is known to often be too limiting (Miyato et al., 2018; SNGAN): the orthonormal regularization destroys the information about the spectrum by setting all the singular values to one and puts equal emphasis on all feature dimensions. On the other hand, spectral normalization only scales the spectrum so that the its maximum will be one.

Relax the constraint

So we explore several variants designed to relax the constraint while still imparting the desired smoothness to our models. The version we find to work best removes the diagonal terms from the regularization, and aims to minimize the pairwise cosine similarity between filters but does not constrain their norm:
where $1$ denotes a matrix with all elements set to 1.
- We sweep $β$ values and select $10^{−4}$ , finding this small added penalty sufficient to improve the likelihood that our models will be amenable to truncation.
- Across runs in Table 1, we observe that without Orthogonal Regularization, only 16% of models are amenable to truncation, compared to 60% when trained with Orthogonal Regularization.

Analysis: Stability and Collapse

The symptoms of collapse are sharp and sudden, with sample quality dropping from its peak to its lowest value over the course of a few hundred iterations.
- The instabilities we observe occur for settings which are stable at small scale, necessitating direct analysis at large scale.

Characterizing Instability: the Generator

Metric: top three singular values

We monitor a range of weight, gradient, and loss statistics during training, in search of a metric which might presage the onset of training collapse, similar to (Odena et al., 2018). We found the top three singular values $σ_0$ , $σ_1$ , $σ_2$ of each weight matrix to be the most informative.
- They can be efficiently computed using the Alrnoldi iteration method (Golub & der Vorst, 2000), which extends the power iteration method, used in Miyato et al. (2018), to estimation of additional singular vectors and values.
A clear pattern emerges, as can be seen in Figure 3(a) and Appendix F: most $G$ layers have well-behaved spectral norms, but some layers (typically the first layer in $G$ , which is over-complete and not convolutional) are ill-behaved, with spectral norms that grow throughout training and explode at collapse.

Counteract spectral explosion

To ascertain if this pathology is a cause of collapse or merely a symptom, we study the effects of imposing additional conditioning on $G$ to explicitly counteract spectral explosion.
- First, we directly regularize the top singular values $σ_0$ of each weight, either towards a fixed value $σ_{reg}$ or towards some ratio $r$ of the second singular value, $r · sg(σ_1)$ (with $s g$ the stop-gradient operation to prevent the regularization from increasing $σ_1$ ).
- Alternatively, we employ a partial singular value decomposition to instead clamp $σ_0$ . Given a weight $W$ , its first singular vectors $u_0$ and $v_0$ , and $σ_{clamp}$ the value to which the $σ_0$ will be clamped, our weights become:
  where $σ_{clamp}$ is set to either $σ_{reg}$ or $r · sg(σ_1)$ .
We observe that both with and without Spectral Normalization these techniques have the effect of preventing the gradual increase and explosion of either $σ_0$ or $\frac{σ_0}{σ_1}$ , but even though in some cases they mildly improve performance, no combination prevents training collapse. This evidence suggests that while conditioning $G$ might improve stability, it is insufficient to ensure stability. We accordingly turn our attention to $D$ .

Characterizing Instability: the Discriminator

As with $G$ , we analyze the spectra of $D$ ’s weights to gain insight into its behavior, then seek to stabilize training by imposing additional constraints.

Noisy Spectra

Figure 3(b) displays a typical plot of $σ_0$ for $D$ (with further plots in Appendix F). Unlike $G$ , we see that the spectra are noisy, $σ_0$ is well-behaved, and the singular values grow throughout training but only jump at collapse, instead of exploding.
The spikes in $D$ ’s spectra might suggest that it periodically receives very large gradients, but we observe that the Frobenius norms are smooth (Appendix F), suggesting that this effect is primarily concentrated on the top few singular directions.

Methods to improve training stability

We posit that this noise is a result of optimization through the adversarial training process, where $G$ periodically produces batches which strongly perturb $D$ . If this spectral noise is causally related to instability, a natural counter is to employ gradient penalties, which explicitly regularize changes in $D$ ’s Jacobian. We explore the $R_1$ zero-centered gradient penalty from Mescheder et al. (2018):
With the default suggested $γ$ strength of 10, training becomes stable and improves the smoothness and boundedness of spectra in both $G$ and $D$ , but performance severely degrades, resulting in a 45% reduction in IS. Reducing the penalty partially alleviates this degradation, but results in increasingly ill-behaved spectra; even with the penalty strength reduced to 1 (the lowest strength for which sudden collapse does not occur) the IS is reduced by 20%.
Repeating this experiment with various strengths of Orthogonal Regularization, DropOut, and L2, reveals similar behaviors for these regularization strategies: with high enough penalties on $D$ , training stability can be achieved, but at a substantial cost to performance.
- With current techniques, better final performance can be achieved by relaxing this conditioning and allowing collapse to occur at the later stages of training, by which time a model is sufficiently trained to achieve good results.

We also observe that $D$ ’s loss approaches zero during training, but undergoes a sharp upward jump at collapse (Appendix F).
- One possible explanation for this behavior is that $D$ is overfitting to the training set, memorizing training examples rather than learning some meaningful boundary between real and generated images.
- As a simple test for $D$ ’s memorization (related to Gulrajani et al. (2017)), we evaluate uncollapsed discriminators on the ImageNet training and validation sets, and measure what percentage of samples are classified as real or generated. While the training accuracy is consistently above 98%, the validation accuracy falls in the range of 50-55%, no better than random guessing (regardless of regularization strategy). This confirms that $D$ is indeed memorizing the training set; we deem this in line with $D$ ’s role, which is not explicitly to generalize, but to distill the training data and provide a useful learning signal for $G$ . Additional experiments and discussion are provided in Appendix $G$ .

Intervening before Collapse

Whether it is possible to prevent or delay collapse by taking a model checkpoint several thousand iterations before collapse, and continuing training with some hyperparameters modified (e.g., the learning rate)
- We found that increasing the learning rates (relative to their initial values) in either $G$ or $D$ , or both $G$ and $D$ , led to immediate collapse.
- We also tried changing the momentum terms (Adam’s $β_1$ and $β_2$ ), or resetting the momentum vectors to zero, but this tended to either make no difference or, when increasing the momentum, cause immediate collapse.
- We found that decreasing the learning rate in $G$ , but keeping the learning rate in $D$ unchanged could delay collapse (in some cases by over one hundred thousand iterations), but also crippled training—once the learning rate in $G$ was decayed, performance either stayed constant or slowly decayed.
- Conversely, reducing the learning rate in $D$ while keeping $G$ ’s learning rate led to immediate collapse. We hypothesize that this is because of the need for $D$ to remain optimal throughout training—if its learning rate is reduced, it can no longer “keep up” with $G$ , and training collapses.
- With this in mind, we also tried increasing the number of $D$ steps per $G$ step, but this either had no effect, or delayed collapse at the cost of crippling training (similar to decaying $G$ ’s learning rate).

To further illuminate these dynamics, we construct two additional intervention experiments, one where we freeze $G$ before collapse (by ceasing all parameter updates) and observe whether $D$ remains stable, and the reverse.
- We find that when $G$ is frozen, $D$ remains stable, and slowly reduces both components of its loss towards zero. However, when $D$ is frozen, $G$ immediately and dramatically collapses, maxing out $D$ ’s loss to values upwards of 300, compared to the normal range of 0 to 3.
- This leads to two conclusions:
  - (1) $D$ must remain optimal with respect to $G$ both for stability and to provide useful gradient information. The consequence of $G$ being allowed to win the game is a complete breakdown of the training process, regardless of $G$ ’s conditioning or optimization settings.
  - (2) Favoring $D$ over $G$ (either by training it with a larger learning rate, or for more steps) is insufficient to ensure stability even if $D$ is well-conditioned.
This suggests either that in practice, (1) an optimal $D$ is necessary but insufficient for training stability, or that (2) some aspect of the system results in $D$ not being trained towards optimality. With the latter possibility in mind, we take a closer look at the noise in $D$ ’s spectra in the following section.

Spikes in the Discriminator’s Spectra

If some element of $D$ ’s training process results in undesirable dynamics, it follows that the behavior of $D$ ’s spectra may hold clues as to what that element is.

The top three singular values of $D$ differ from $G$ ’s in that they have a large noise component, tend to grow throughout training but only show a small response to collapse, and the ratio of the first two singular values tends to be centered around one, suggesting that the spectra of $D$ have a slow decay.
When viewed up close, the noise spikes resemble an impulse response: at each spike, the spectra jump upwards, then slowly decrease, with some oscillation.
- One possible explanation is that this behavior is a consequence of $D$ memorizing the training data. As it approaches perfect memorization, it receives less and less signal from real data, as both the original GAN loss and the hinge loss provide zero gradients when $D$ outputs a confident and correct prediction for a given example. If the gradient signal from real data attenuates to zero, this can result in $D$ eventually becoming biased due to exclusively received gradients that encourage its outputs to be negative. If this bias passes a certain threshold, $D$ will eventually misclassify a large number of real examples and receive a large gradient encouraging positive outputs, resulting in the observed impulse responses.
- This argument suggests several fixes.
  - First, one might consider an unbounded loss (such as the Wasserstein loss) which would not suffer this gradient attentuation.
    - We found that even with gradient penalties and brief retuning of optimizer hyperparameters, our models did not stably train for more than a few thousand iterations with this loss.
  - We instead explored changing the margin of the hinge loss as a partial compromise: for a given model and minibatch of data, increasing the margin will result in more examples falling within the margin, and thus contributing to the loss. (Unconstrained models could easily learn a different output scale to account for this margin, but the use of Spectral Normalization constrains our models and makes the specific selection of the margin meaningful.).
    - Training with a smaller margin (by a factor of 2) measurably reduces performance, but training with a larger margin (by up to a factor of 3) does not prevent collapse or reduce the noise in $D$ ’s spectra. Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss.
  - Finally, the memorization argument might suggest that using a smaller $D$ or using dropout in $D$ would improve training by reducing its capacity to memorize, but in practice this degrades training.

Experiments

Evaluation on ImageNet

We evaluate our models on ImageNet ILSVRC 2012 at $128 \times 128$ , $256 \times 256$ , and $512 \times 512$ resolutions, employing the settings from Table 1, row 8. We report IS and FID in Table 2.

As our models are able to trade sample variety for quality, it is unclear how best to compare against prior art; we accordingly report values at three settings, with complete curves in Appendix D.
- First, we report the FID/IS values at the truncation setting which attains the best FID.
- Second, we report the FID at the truncation setting for which our model’s IS is the same as that attained by the real validation data, reasoning that this is a passable measure of maximum sample variety achieved while still achieving a good level of “objectness.”
- Third, we report FID at the maximum IS achieved by each model, to demonstrate how much variety must be traded off to maximize quality.
In all three cases, our models outperform the previous state-of-the-art IS and FID scores.

BigGAN-deep

BigGAN-deep: a 4x deeper model which uses a different configuration of residual blocks.
- As can be seen from Table 2, BigGAN-deep substantially outperforms BigGAN across all resolutions and metrics.
- This confirms that our findings extend to other architectures, and that increased depth leads to improvement in sample quality.

Whether or not $G$ simply memorizes training points

To test this, we perform class-wise nearest neighbors analysis in pixel space and the feature space of pre-trained classifier networks.
In addition, we present both interpolations between samples and class-wise interpolations (where $z$ is held constant) in Figures 8 and 9.
Our model convincingly interpolates between disparate samples, and the nearest neighbors for its samples are visually distinct, suggesting that our model does not simply memorize training data.

Failure modes

We note that some failure modes of our partially-trained models are distinct from those previously observed. Most previous failures involve local artifacts (Odena et al., 2016), images consisting of texture blobs instead of objects (Salimans et al., 2016), or the canonical mode collapse.
We observe class leakage, where images from one class contain properties of another, as exemplified by Figure 4(d).
We also find that many classes on ImageNet are more difficult than others for our model; our model is more successful at generating dogs (which make up a large portion of the dataset, and are mostly distinguished by their texture) than crowds (which comprise a small portion of the dataset and have more large-scale structure). Further discussion is available in Appendix A.

Additional Evaluation on JFT-300M

To confirm that our design choices are effective for even larger and more complex and diverse datasets, we also present results of our system on a subset of JFT-300M.
- The full JFT-300M dataset contains 300M real-world images labeled with 18K categories.
- Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images – two orders of magnitude larger than ImageNet.
- For images with multiple labels, we sample a single label randomly and independently whenever an image is sampled.
- To compute IS and FID for the GANs trained on this dataset, we use an Inception v2 classifier trained on this dataset. Quantitative results are presented in Table 3.
- All models are trained with batch size 2048.
We compare an ablated version of our model – comparable to SA-GAN but with the larger batch size – against a “full” BigGAN model that makes uses of all of the techniques applied to obtain the best results on ImageNet (shared embedding, skip- $z$ , and orthogonal regularization).
- Our results show that these techniques substantially improve performance even in the setting of this much larger dataset at the same model capacity (64 base channels). We further show that for a dataset of this scale, we see significant additional improvements from expanding the capacity of our models to 128 base channels, while for ImageNet GANs that additional capacity was not beneficial.
- In Figure 19, we present truncation plots for models trained on this dataset. Unlike for ImageNet, where truncation limits of $σ \approx 0$ tend to produce the highest fidelity scores, IS is typically maximized for our JFT-300M models when the truncation value $σ$ ranges from 0.5 to 1. We suspect that this is at least partially due to the intra-class variability of JFT-300M labels, as well as the relative complexity of the image distribution, which includes images with multiple objects at a variety of scales.
- Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization, the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues.

Architectural Details

BigGAN

We use the ResNet GAN architecture of SAGAN, which is identical to that used by (Miyato et al., 2018), but with the channel pattern in $D$ modified so that the number of filters in the first convolutional layer of each block is equal to the number of output filters.
We use a single shared class embedding in $G$ , and skip connections for the latent vector $z$ (skip- $z$ ). In particular, we employ hierarchical latent spaces, so that the latent vector $z$ is split along its channel dimension into chunks of equal size ( $20$ - $D$ in our case), and each chunk is concatenated to the shared class embedding and passed to a corresponding residual block as a conditioning vector. The conditioning of each block is linearly projected to produce per-sample gains and biases for the BatchNorm layers of the block. The bias projections are zero-centered, while the gain projections are centered at 1.
- Since the number of residual blocks depends on the image resolution, the full dimensionality of $z$ is 120 for $128 \times 128$ , 140 for $256 \times 256$ , and 160 for $512 \times 512$ images.

BigGAN-deep

The BigGAN-deep model differs from BigGAN in several aspects.
It uses a simpler variant of skip- $z$ conditioning: instead of first splitting $z$ into chunks, we concatenate the entire $z$ with the class embedding, and pass the resulting vector to each residual block through skip connections.
BigGAN-deep is based on residual blocks with bottlenecks, which incorporate two additional $1 \times 1$ convolutions: the first reduces the number of channels by a factor of 4 before the more expensive $3 \times 3$ convolutions; the second produces the required number of output channels.
- While BigGAN relies on $1 \times 1$ convolutions in the skip connections whenever the number of channels needs to change, in BigGAN-deep we use a different strategy aimed at preserving identity throughout the skip connections.
  - In $G$ , where the number of channels needs to be reduced, we simply retain the first group of channels and drop the rest to produce the required number of channels.
  - In $D$ , where the number of channels should be increased, we pass the input channels unperturbed, and concatenate them with the remaining channels produced by a $1 \times 1$ convolution.
- Despite their increased depth, the BigGAN-deep models have significantly fewer parameters mainly due to the bottleneck structure of their residual blocks. For example, the $128 \times 128$ BigGAN-deep $G$ and $D$ have 50.4M and 34.6M parameters respectively, while the corresponding original BigGAN models have 70.4M and 88.0M parameters.
As far as the network configuration is concerned, the discriminator is an exact reflection of the generator. There are two blocks at each resolution (BigGAN uses one), and as a result BigGAN-deep is four times deeper than BigGAN.
All BigGAN-deep models use attention at $64 \times 64$ resolution, channel width multiplier $c h = 128$ , and $z ∈ \R^{128}$ .

Experimental Details

We employ the architectures detailed in Appendix B, with non-local blocks inserted at a single stage in each network.
Both $G$ and $D$ networks are initialized with Orthogonal Initialization (Saxe et al., 2014).
We use Adam optimizer with $β_1 = 0$ and $β_2 = 0.999$ and a constant learning rate.
- For BigGAN models at all resolutions, we use $2 · 10^{−4}$ in $D$ and $5 · 10^{−5}$ in $G$ .
- For BigGAN-deep, we use the learning rate of $2 · 10^{−4}$ in $D$ and $5 · 10^{−5}$ in $G$ for $128 \times 128$ models, and $2.5 · 10^{−5}$ in both $D$ and $G$ for $256 \times 256$ and $512 \times 512$ models.
We experimented with the number of $D$ steps per $G$ step (varying it from 1 to 6) and found that two $D$ steps per $G$ step gave the best results.
We use an exponential moving average of the weights of $G$ at sampling time, with a decay rate set to $0.9999$ . We employ cross-replica BatchNorm (Ioffe & Szegedy, 2015) in $G$ , where batch statistics are aggregated across all devices, rather than a single device as in standard implementations.
Spectral Normalization is used in both $G$ and $D$ , following SA-GAN.
We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for $128 \times 128$ , 256 for $256 \times 256$ , and 512 for $512 \times 512$ . Training takes between 24 and 48 hours for most models.
We increase $\epsilon$ from the default $10^{−8}$ to $10^{−4}$ in BatchNorm and Spectral Norm to mollify low-precision numerical issues.
We preprocess data by cropping along the long edge and rescaling to a given resolution with area resampling.

BatchNorm Statistics and Sampling

The default behavior with batch normalized classifier networks is to use a running average of the activation moments at test time. Previous works (Radford et al., 2016) have instead used batch statistics when sampling images. While this is not technically an invalid way to sample, it means that results are dependent on the test batch size (and how many devices it is split across), and further complicates reproducibility.
We find that this detail is extremely important, with changes in test batch size producing drastic changes in performance. This is further exacerbated when one uses exponential moving averages of $G$ ’s weights for sampling, as the BatchNorm running averages are computed with non-averaged weights and are poor estimates of the activation statistics for the averaged weights.
To counteract both these issues, we employ “standing statistics,” where we compute activation statistics at sampling time by running the $G$ through multiple forward passes (typically 100) each with different batches of random noise, and storing means and variances aggregated across all forward passes. Analogous to using running statistics, this results in $G$ ’s outputs becoming invariant to batch size and the number of devices, even when producing a single sample.

Inception Scores of ImageNet Images

We compute the IS for both the training and validation sets of ImageNet.
- At $128 \times 128$ the training data has an IS of 233, and the validation data has an IS of 166.
- At $256 \times 256$ the training data has an IS of 377, and the validation data has an IS of 234.
- At $512 \times 512$ the training data has an IS of 348, and the validation data has an IS of 241.
The discrepancy between training and validation scores is due to the Inception classifier having been trained on the training data, resulting in high-confidence outputs that are preferred by the Inception Score.

CIFAR-10

We run our networks on CIFAR-10 using the settings from Table 1, row 8, and achieve an IS of 9.22 and an FID of 14.73 without truncation.

Choosing Latent Spaces

We explore the choice of latents by considering an array of possible designs. The two latents which we find to work best without truncation are Bernoulli ${0, 1\}$ and Censored Normal $\max (\mathcal N (0, I), 0)$ , both of which improve speed of training and lightly improve final performance, but are less amenable to truncation.
We also ablate the choice of latent space dimensonality (which by default is $z ∈ \R^{128}$ ), finding that we are able to successfully train with latent dimensions as low as $z ∈ \R^8$ , and that with $z ∈ \R^{32}$ we see a minimal drop in performance.

Latents

$\mathcal N (0, I)$ . A standard choice of the latent space which we use in the main experiments.
$\mathcal U[−1, 1]$ . Another standard choice; we find that it performs similarly to $\mathcal N (0, I)$ .
Bernoulli ${0, 1\}$ . A discrete latent might reflect our prior that underlying factors of variation in natural images are not continuous, but discrete (one feature is present, another is not). This latent outperforms $\mathcal N (0, I)$ (in terms of IS) by 8% and requires 60% fewer iterations.
$\max (\mathcal N (0, I), 0)$ , also called Censored Normal. This latent is designed to introduce sparsity in the latent space (reflecting our prior that certain latent features are sometimes present and sometimes not), but also allow those latents to vary continuously, expressing different degrees of intensity for latents which are active. This latent outperforms $\mathcal N (0, I)$ (in terms of IS) by 15-20% and tends to require fewer iterations.
…

Negative Results

We found that doubling the depth (by inserting an additional Residual block after every up-or down-sampling block) hampered performance. - 加深网络未必能获得性能提升
We experimented with sharing class embeddings between both $G$ and $D$ (as opposed to just within $G$ ). This is accomplished by replacing $D$ ’s class embedding with a projection from $G$ ’s embeddings, as is done in $G$ ’s BatchNorm layers. In our initial experiments this seemed to help and accelerate training, but we found this trick scaled poorly and was sensitive to optimization hyperparameters, particularly the choice of number of $D$ steps per $G$ step.
We tried replacing BatchNorm in $G$ with WeightNorm (Salimans & Kingma, 2016), but this crippled training. We also tried removing BatchNorm and only having Spectral Normalization, but this also crippled training.
We tried adding BatchNorm to $D$ (both class-conditional and unconditional) in addition to Spectral Normalization, but this crippled training.
We tried varying the choice of location of the attention block in $G$ and $D$ (and inserting multiple attention blocks at different resolutions) but found that at $128 \times 128$ there was no noticeable benefit to doing so, and compute and memory costs increased substantially. We found a benefit to moving the attention block up one stage when moving to $256 \times 256$ , which is in line with our expectations given the increased resolution.
We tried using filter sizes of 5 or 7 instead of 3 in either $G$ or $D$ or both. We found that having a filter size of 5 in $G$ only provided a small improvement over the baseline but came at an unjustifiable compute cost. All other settings degraded performance.
We tried varying the dilation for convolutional filters in both $G$ and $D$ at $128 \times 128$ , but found that even a small amount of dilation in either network degraded performance.
We tried bilinear upsampling in $G$ in place of nearest-neighbors upsampling, but this degraded performance.
In some of our models, we observed class-conditional mode collapse, where the model would only output one or two samples for a subset of classes but was still able to generate samples for all other classes. We noticed that the collapsed classes had embedings which had become very large relative to the other embeddings, and attempted to ameliorate this issue by applying weight decay to the shared embedding only. We found that small amounts of weight decay ( $10^{−6}$ ) instead degraded performance, and that only even smaller values ( $10^{−8}$ ) did not degrade performance, but these values were also too small to prevent the class vectors from exploding. Higher-resolution models appear to be more resilient to this problem, and none of our final models appear to suffer from this type of collapse.
We experimented with using MLPs instead of linear projections from $G$ ’s class embeddings to its BatchNorm gains and biases, but did not find any benefit to doing so. We also experimented with Spectrally Normalizing these MLPs, and with providing these (and the linear projections) with a bias at their output, but did not notice any benefit.
We tried gradient norm clipping (both the global variant typically used in recurrent net-works, and a local version where the clipping value is determined on a per-parameter basis) but found this did not alleviate instability.