SNGAN: Spectrum Normalization for GANs

最新推荐文章于 2023-12-04 14:55:46 发布

连理o

最新推荐文章于 2023-12-04 14:55:46 发布

阅读量541

点赞数

分类专栏： # Generative Models

本文链接：https://blog.csdn.net/weixin_42437114/article/details/119268413

版权

Generative Models 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

Spectral Norm Regularization (SNR)

Motivation

In this study, we consider the generalizability (泛化性能) of deep learning from the perspective of sensitivity to input perturbation (输入扰动). We hypothesize that the high sensitivity to the perturbation of data degrades the performance on it. (最后论文用实验来验证这个假设)
- Intuitively, if a trained model is insensitive or sensitive to the perturbation of an input, then the model is confident or not confident about the output, respectively. As the performance on test data is important, models that are insensitive to the perturbation of test data are required. (e.g. 图片只改变一个像素就给出完全不一样的分类结果)
- A conventional method of understanding the generalizability of a trained model is the notion of the flatness/sharpness of a local minimum. A local minimum is (informally) referred to as $f l a t$ if its loss value does not increase significantly when it is perturbed; otherwise, it is referred to as $s h a r p$ . In general, the high sensitivity of a training function at a sharp local minimizer negatively affects the generalizability of the trained model.
To reduce the sensitivity to perturbation, we propose a simple and effective regularization method, referred to as spectral norm regularization, which penalizes the high spectral norm (谱范数) of weight matrices in neural networks.

Spectral Norm (谱范数)

The spectral norm of a matrix $A\in\R^{m\times n}$ is defined as
which corresponds to the largest singular value of $A$ .
- Proof: 显然 $\frac{||A\xi||_2}{||\xi||_2}$ 的值与 $\xi$ 的长度无关，因此不妨设 $||\xi||_2=1$ ，原问题就变成了求 $\max_{\xi\in R^n,||\xi||=1}\sqrt{\xi^TA^TA\xi}$ 。可以看出这就是一个条件优化问题，因此最大值为 $A^TA$ 的最大特征值的算术平方根，也就是 $A$ 的最大奇异值

Spectral Norm Regularization

General idea

We consider feed-forward neural networks as a simple example to explain the intuition behind spectral norm regularization.

A feed-forward neural network
for $\ell = 1, . . . , L$ for some $L$ ; $f^{\ell}: \mathbb{R}^{n_{\ell}} \rightarrow \mathbb{R}^{n_{\ell}}$ is a (non-linear) activation function.
- 因此，对于上述前馈网络，它的参数可由集合 $\Theta=\{W^{\ell},b^{\ell}\}_{\ell=1}^L$ 表示，整个神经网络可由函数 $f_\Theta:\R^{n_0}\rightarrow\R^{n_L}$ 来表示 ( $f_\Theta(x^0)=x^L$ )
- Given training data, $x_i, y_i)_{i=1}^K$ , the loss function is defined as
  $\frac{1}{K}\sum_{i=1}^KL(f_\Theta(x_i),y_i)$
Our goal is to obtain a model $\Theta$ that is insensitive to the perturbation of the input. 即，让扰动指数 $P$ 尽量小:
$P=\frac{||f(x+\xi)-f(x)||_2}{||\xi||_2}$
A key observation is that most practically used neural networks exhibit nonlinearity only because they use piecewise linear functions (分段线性函数), such as ReLU, as activation functions. In such a case, function $f_\Theta$ is a piecewise linear function. Hence, if we consider a small neighborhood (邻域) of $x$ , we can regard $f_\Theta$ as a linear function.
- In other words, we can represent $f_\Theta$ by an affine map, $x\mapsto W_{\Theta,x}x + b_{\Theta,x}$ . Then, for a small perturbation, $ξ ∈ \R^{n_0}$ , we have
  where $\sigma(W_{\Theta,x})$ is the spectral norm of $W_{\Theta,\xi}$
- 也就是说， $\sigma(W_{\Theta,x})$ 给出了扰动指数的上界。我们只要让 $\sigma(W_{\Theta,x})$ 尽量小，就能减弱网络对输入扰动的敏感性
To further investigate the property of $W_{\Theta,x}$ , let us assume that each activation function, $f^\ell$ , is an element-wise ReLU (the argument can be easily generalized to other piecewise linear functions). Note that, for a given vector, $x$ , $f^\ell$ acts as a diagonal matrix, $D_{\Theta,x}^\ell\in\R^{n_\ell\times n_\ell}$ , where an element in the diagonal is equal to one if the corresponding element in $x^{\ell-1}$ is positive; otherwise, it is equal to zero. Then, we can rewrite $W_{\Theta;x}$ as
根据矩阵范数的性质 ( $||AB||\leq||A||||B||$ )，有
上式进一步给出了 $\sigma(W_{\Theta,x})$ 的上界，同时也提示我们，只要减小每个层权重矩阵的谱范数，就可以增强模型抗输入扰动的能力。这就引出了谱范数正则化

Details of spectral norm regularization

Spectral Norm Regularizer

To bound the spectral norm of each weight matrix, $W^\ell$ , we consider the following empirical risk minimization problem:
where $λ ∈ \R_+$ is a regularization factor. We refer to the second term as the spectral norm regularizer. It decreases the spectral norms of the weight matrices.

Calculate the gradient of the spectral norm regularizer

When performing SGD, we need to calculate the gradient of the spectral norm regularizer. To this end, let us consider the gradient of $\sigma(W^\ell)_2^2/2$ (这里不清楚这个下标 2 是什么意思) for a particular $\ell ∈ \{1, 2, . . . , L\}$ .
Let $σ_1 = σ(W^\ell)$ and $σ_2$ be the first and second singular values, respectively. If $σ_1 > σ_2$ , then the gradient of $σ(W^\ell)^2/2$ is $σ_1u_1v_1^T$ , where $u_1$ and $v_1$ are the first left and right singular vectors, respectively. If $σ_1 = σ_2$ , then $σ(W^\ell)^2_2$ is not differentiable. (这里的 “not differentiable” 是指 $σ_1 = σ_2$ 时，梯度不是 $σ_1u_1v_1^T$ 吗？不太明白) However, for practical purposes, we can assume that this case never occurs because numerical errors prevent $σ_1$ and $σ_2$ from being exactly equal.
- Proof: 设 $L_\ell=\frac{1}{2}σ(W^\ell)^2$ . 则由谱分解可知，
  $\frac{\partial L_\ell}{\partial W^\ell}=\frac{\partial L_\ell}{\partial σ(W^\ell)}\frac{\partial σ(W^\ell)}{\partial W^\ell}=σ(W^\ell)\cdot(u_1v_1^T)=σ_1u_1v_1^T$
- 因此，由上式可知，我们只需要在反向传播时给每个层的权重梯度加上 $\lambda σ_1u_1v_1^T$ 即可。我们现在还需要一个能够快速计算出 $W^{\ell}$ 的最大奇异值 $\sigma_1$ 以及其对应的 left singular vector $u_1$ 和 right singular vector $v_1$ 的方法

Power iteration method (幂迭代法)

Starting with a randomly initialized $\R^{n_{\ell−1}}$ , we iteratively perform the following procedure a sufficient number of times: (这里写的迭代公式我还是不太明白，为什么没有对 $u$ 和 $v$ 归一化呢？ $\sigma$ 的计算方法也不太明白。但在 Spectral Normalization 的 paper 里的迭代公式是比较好理解的，因此这里就先不管这个迭代公式为什么能逼近 $σ_1$ , $u_1$ , 和 $v_1$ 了，先了解它的思想吧)
Then, $σ$ , $u$ , and $v$ converge to $σ_1$ , $u_1$ , and $v_1$ , respectively (if $σ_1 > σ_2$ ).
To approximate $σ_1$ , $u_1$ , and $v_1$ in the next iteration of SGD, we can reuse $v$ as the initial vector. in our experiments, we performed only one iteration because it was adequate for obtaining a sufficiently good approximation.
- Because the weights change slowly, we only need to perform a single power iteration on the current version of these vectors for each step of learning; this is why spectral norm regularization is computationally efficient

在这里插入图片描述

Convolutions

Consider a convolutional layer with $a$ input channels, $b$ output channels, and a $k_w × k_h$ -sized kernel. Note that a value in an output channel is determined using $ak_wk_h$ values in the input channels.
Hence, we align the parameters as a matrix of size $b × ak_wk_h$ and apply the abovementioned power iteration method to the matrix to calculate its spectral norm and gradient.

Spectrum Normalization for GANs

Introduction

In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our normalization enjoys following favorable properties.
- Lipschitz constant is the only hyper-parameter to be tuned, and the algorithm does not require intensive tuning of the only hyper-parameter for satisfactory performance.
- Implementation is simple and the additional computational cost is small.

The centrality of Lipschitz continuity in GANs

Notation

Let us consider a simple discriminator made of a neural network of the following form, with the input $x$ :
where $θ := \{W^1, . . . , W^L, W^{L+1}\}$ is the learning parameters set, $W^l ∈ \R^{d_l×d_{l−1}}$ , $W^{L+1} ∈ \R^{1×d_L}$ , and $a_l$ is an element-wise non-linear activation function. We omit the bias term of each layer for simplicity. The final output of the discriminator is given by
where $\mathcal A$ is an activation function.

Standard formulation of GANs

The standard formulation of GANs is given by
where $\min$ and $\max$ of $G$ and $D$ are taken over the set of generator and discriminator functions, respectively.
- The conventional form of $V (G, D)$ is given by
  where $q_{data}$ is the data distribution and $p_G$ is the (model) generator distribution to be learned through the adversarial min-max optimization
- The activation function $\mathcal A$ is some continuous function with range $[0, 1]$ (e.g, sigmoid function).
- It is known that, for a fixed generator $G$ , the optimal discriminator is given by
The machine learning community has been pointing out recently that the function space from which the discriminators are selected crucially affects the performance of GANs.
- e.g. The optimal discriminator of GANs on the above standard formulation takes the form
  and its derivative
  can be unbounded or even incomputable. This prompts us to introduce some regularity condition to the derivative of $f (x)$ .

Lipschitz continuity

A number of works (Ue-hara et al., 2016; Qi (Loss-Sensitive GAN; LSGAN), 2017; Gulrajani et al. (WGAN-GP), 2017) advocate the importance of Lipschitz continuity in assuring the boundedness of statistics.
- If we are training a Wasserstein GAN, then Kantorovich-Rubinstein duality requires it.
- However, when we’re training using the standard KL-divergence loss, there’s a looser but still intuitive explanation. Mandating Lipschitz continuity bounds the gradients in our discriminator.
A particularly successful works in this array are (Qi (Loss-Sensitive GAN; LSGAN), 2017; Arjovsky et al.; WGAN, 2017; Gulrajani et al. (WGAN-GP), 2017), which proposed methods to control the Lipschitz constant of the discriminator by adding regularization terms defined on input examples $x$ . We would follow their footsteps and search for the discriminator $D$ from the set of $K$ -Lipschitz continuous functions, that is,
where we mean by $f||_{Lip}$ the smallest value $M$ such that $∣ ∣ f (x) - f (x^{'}) ∣ ∣ / ∣ ∣ x - x^{'} ∣ ∣ \leq M$ for any $x, x^{'}$ , with the norm being the $\ell_2$ norm.

Definition of Lipschitz continuity

Suppose we have a GAN discriminator $D:I→\R$ , where $I$ is the space of images (e.g., $R^{32×32}$ ). Because both the domain and codomain of this function have inner products, we have a natural metric (distance function) in both spaces: the L2 distance.
If our discriminator is $K$ -Lipschitz continuous, then for all $x$ and $y$ in $I$ ,
where $∣ \cdot ∣$ is the L2 norm. Here, if $K$ is a minimum, then it is called the Lipschitz constant of the discriminator.
- Let’s look at the 1D case to illustrate Lipschitz continuity geometrically. Suppose $D (x) = s i n (x)$ . If $D$ is Lipschitz continuous, then we can draw a cone centered at every point on its graph such that the graph lies outside of this cone. It’s clear that $\boldsymbol{\sin}$ is 1-Lipschitz continuous.
- What about ReLU, our favorite activation function? - This is also obviously 1-Lipschitz continuous.
One interesting fact to notice from these examples is that if a 1D function is differentiable, then its Lipschitz constant is just the maximum value of its derivative.

The Lipschitz constant of a linear function is its spectral norm

Suppose we have a linear function $A:\R^n→\R^m$ . This function could be the pre-activation operation of one layer in a multilayer perceptron.
What is the Lipschitz constant of this function, if it exists?
- Since $A$ is linear, if $A$ is $K$ -Lipschitz at zero, then it is $K$ -Lipschitz everywhere. (proof sketch: $||f(x-y)-0||\leq K||(x-y)-0||\Rightarrow||f(x)-f(y)||\leq K||x-y||$ )
- This simplifies the Lipschitz continuity requirement to
  for all $x \in I$ . This is equivalent to the statement
  which in turn is equivalent to (下面的 $K^2$ 后面应该加个 $I$ )If we expand $x$ in the orthonormal basis of eigenvectors of $A^TA$ (i.e., $x=∑_ix_iv_i$ ), we can then write out this inner product explicitly: (下面运用了一些内积空间的性质; $\lambda_i$ 为特征向量 $v_i$ 对应的特征值)Note that since $A^TA$ is positive semidefinite, all the $λ_i$ s must be nonnegative. To guarantee the above sum to be nonnegative, each term must be nonnegative, so we have
  Since we choose $K$ to be the minimum value satisfying the above constraints, we immediately see that $K$ is the square root of the largest eigenvalue of $A^TA$ . Therefore, the Lipschitz constant of a linear function is its largest singular value, or its spectral norm.
那么再由下式可知，将 $A$ 除以 $K$ ( $A$ 的谱范数) 即可使 $A$ 满足 1-Lipschitz continuity

Composition of functions

对于复合函数，我们有这样的定理:
其中， $||\cdot||_{Lip}$ 表示函数的 Lipschitz 常数
而神经网络正是多个复合函数嵌套的操作。最常见的嵌套是：一层卷积，一层激活函数，再一层卷积，再一层激活函数… 而激活函数通常选取的 ReLU，Leaky ReLU 都是 1-Lipschitz 的，带入到上式中相乘都不影响总体的 Lipschitz constant，我们只需要保证卷积和全连接的部分是 1-Lipschitz continuous 的，就可以保证整个神经网络都是 1-Lipschitz continuous 的: Our spectral normalization normalizes the spectral norm of the weight matrix $W$ so that it satisfies the Lipschitz constraint $σ (W) = 1$ :
- For the evaluation of the spectral norm for the convolutional weight $W ∈ \R^{d_{out}×d_{in}×h×w}$ , we treated the operator as a 2-D matrix of dimension $d_{out} × (d_{in}hw)$ (Note that, since we are conducting the convolution discretely, the spectral norm will depend on the size of the stride and padding. However, the answer will only differ by some predefined $K$ .)

Fast Approximation of the Spectral Norm $\sigma(W)$

Power iteration

思想与 Spectral Norm Regularization 相同，也是利用幂迭代法快速地计算出谱范数以及对应的 left singular vector 和 right singular vector
update rule:
We can then approximate the spectral norm of $W$ with the pair of so-approximated singular vectors:
- Proof ( $\tilde u$ 和谱范数的求解都可由 SVD 的定义直接得到，因此下面主要说明为何可以迭代逼近 $v$ ):
  - 设 $A$ 是一个 $n \times n$ 的满秩方阵，它的单位特征向量为 $v_1, v_2,..., v_n$ ，对应的特征值为 $λ_1, λ_2,..., λ_n$ (按特征值大小降序排列) 且 $λ_1>λ_2$ 。那么对任意向量 $x$ 有 $x=\sum x_i·v_i$ ，因此
    经过 $k$ 次迭代后有：
    因此，当 $k\rightarrow\infty$ 时， $\frac{1}{\lambda_1^k}A^kx=x_1v_1$ ，即 $A^kx$ 几乎与 $v_1$ 的所指方向相同
  - 可以看出，每次更新时， $\tilde v$ 的更新公式为 $\tilde v\leftarrow W^TW\tilde v$ (之后再归一化)，因此更新次数足够多时， $\tilde v$ 所指方向即为 $W^TW$ 的最大特征值所对应的特征向量的方向，在进行归一化后即可得到 $v$

Details (computational efficiency)

If we use SGD for updating $W$ , the change in $W$ at each update would be small, and hence the change in its largest singular value. In our implementation, we took advantage of this fact and reused the $\tilde u$ computed at each step of the algorithm as the initial vector in the subsequent step. In fact, with this ‘recycle’ procedure, one round of power iteration was sufficient in the actual experiment to achieve satisfactory performance.

Algorithm

在这里插入图片描述

连理o

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
SNGAN: Spectrum Normalization for GANs

目录Spectral Norm Regularization (SNR)MotivationSpectral Norm RegularizationGeneral ideaDetails of spectral norm regularizationSpectrum Normalization for GANsIntroductionBackgroundSpectrum NormalizationSpectral Norm Regularization (SNR)paper: Spectral Nor
复制链接

扫一扫