SNGAN: Spectrum Normalization for GANs

Spectral Norm Regularization (SNR)

Motivation

  • In this study, we consider the generalizability (泛化性能) of deep learning from the perspective of sensitivity to input perturbation (输入扰动). We hypothesize that the high sensitivity to the perturbation of data degrades the performance on it. (最后论文用实验来验证这个假设)
    • Intuitively, if a trained model is insensitive or sensitive to the perturbation of an input, then the model is confident or not confident about the output, respectively. As the performance on test data is important, models that are insensitive to the perturbation of test data are required. (e.g. 图片只改变一个像素就给出完全不一样的分类结果)
    • A conventional method of understanding the generalizability of a trained model is the notion of the flatness/sharpness of a local minimum. A local minimum is (informally) referred to as f l a t flat flat if its loss value does not increase significantly when it is perturbed; otherwise, it is referred to as s h a r p sharp sharp. In general, the high sensitivity of a training function at a sharp local minimizer negatively affects the generalizability of the trained model.
  • To reduce the sensitivity to perturbation, we propose a simple and effective regularization method, referred to as spectral norm regularization, which penalizes the high spectral norm (谱范数) of weight matrices in neural networks.

Spectral Norm (谱范数)

  • The spectral norm of a matrix A ∈ R m × n A\in\R^{m\times n} ARm×n is defined as
    在这里插入图片描述which corresponds to the largest singular value of A A A.
    • Proof: 显然 ∣ ∣ A ξ ∣ ∣ 2 ∣ ∣ ξ ∣ ∣ 2 \frac{||A\xi||_2}{||\xi||_2} ξ2Aξ2 的值与 ξ \xi ξ 的长度无关,因此不妨设 ∣ ∣ ξ ∣ ∣ 2 = 1 ||\xi||_2=1 ξ2=1,原问题就变成了求 max ⁡ ξ ∈ R n , ∣ ∣ ξ ∣ ∣ = 1 ξ T A T A ξ \max_{\xi\in R^n,||\xi||=1}\sqrt{\xi^TA^TA\xi} maxξRn,ξ=1ξTATAξ 。可以看出这就是一个条件优化问题,因此最大值为 A T A A^TA ATA 的最大特征值的算术平方根,也就是 A A A 的最大奇异值

Spectral Norm Regularization

General idea

  • We consider feed-forward neural networks as a simple example to explain the intuition behind spectral norm regularization.

  • A feed-forward neural network
    在这里插入图片描述for ℓ = 1 , . . . , L \ell = 1, . . . , L =1,...,L for some L L L; f ℓ : R n ℓ → R n ℓ f^{\ell}: \mathbb{R}^{n_{\ell}} \rightarrow \mathbb{R}^{n_{\ell}} f:RnRn is a (non-linear) activation function.
    • 因此,对于上述前馈网络,它的参数可由集合 Θ = { W ℓ , b ℓ } ℓ = 1 L \Theta=\{W^{\ell},b^{\ell}\}_{\ell=1}^L Θ={W,b}=1L 表示,整个神经网络可由函数 f Θ : R n 0 → R n L f_\Theta:\R^{n_0}\rightarrow\R^{n_L} fΘ:Rn0RnL 来表示 ( f Θ ( x 0 ) = x L f_\Theta(x^0)=x^L fΘ(x0)=xL)
    • Given training data, ( x i , y i ) i = 1 K (x_i, y_i)_{i=1}^K (xi,yi)i=1K, the loss function is defined as
      1 K ∑ i = 1 K L ( f Θ ( x i ) , y i ) \frac{1}{K}\sum_{i=1}^KL(f_\Theta(x_i),y_i) K1i=1KL(fΘ(xi),yi)
  • Our goal is to obtain a model Θ \Theta Θ that is insensitive to the perturbation of the input. 即,让扰动指数 P P P 尽量小:
    P = ∣ ∣ f ( x + ξ ) − f ( x ) ∣ ∣ 2 ∣ ∣ ξ ∣ ∣ 2 P=\frac{||f(x+\xi)-f(x)||_2}{||\xi||_2} P=ξ2f(x+ξ)f(x)2
  • A key observation is that most practically used neural networks exhibit nonlinearity only because they use piecewise linear functions (分段线性函数), such as ReLU, as activation functions. In such a case, function f Θ f_\Theta fΘ is a piecewise linear function. Hence, if we consider a small neighborhood (邻域) of x x x, we can regard f Θ f_\Theta fΘ as a linear function.
    • In other words, we can represent f Θ f_\Theta fΘ by an affine map, x ↦ W Θ , x x + b Θ , x x\mapsto W_{\Theta,x}x + b_{\Theta,x} xWΘ,xx+bΘ,x. Then, for a small perturbation, ξ ∈ R n 0 ξ ∈ \R^{n_0} ξRn0 , we have
      在这里插入图片描述where σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) is the spectral norm of W Θ , ξ W_{\Theta,\xi} WΘ,ξ
    • 也就是说, σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 给出了扰动指数的上界。我们只要让 σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 尽量小,就能减弱网络对输入扰动的敏感性
  • To further investigate the property of W Θ , x W_{\Theta,x} WΘ,x, let us assume that each activation function, f ℓ f^\ell f, is an element-wise ReLU (the argument can be easily generalized to other piecewise linear functions). Note that, for a given vector, x x x, f ℓ f^\ell f acts as a diagonal matrix, D Θ , x ℓ ∈ R n ℓ × n ℓ D_{\Theta,x}^\ell\in\R^{n_\ell\times n_\ell} DΘ,xRn×n, where an element in the diagonal is equal to one if the corresponding element in x ℓ − 1 x^{\ell-1} x1 is positive; otherwise, it is equal to zero. Then, we can rewrite W Θ ; x W_{\Theta;x} WΘ;x as
    在这里插入图片描述根据矩阵范数的性质 ( ∣ ∣ A B ∣ ∣ ≤ ∣ ∣ A ∣ ∣ ∣ ∣ B ∣ ∣ ||AB||\leq||A||||B|| ABAB),有
    在这里插入图片描述上式进一步给出了 σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 的上界,同时也提示我们,只要减小每个层权重矩阵的谱范数,就可以增强模型抗输入扰动的能力。这就引出了谱范数正则化

Details of spectral norm regularization

Spectral Norm Regularizer

  • To bound the spectral norm of each weight matrix, W ℓ W^\ell W, we consider the following empirical risk minimization problem:
    在这里插入图片描述where λ ∈ R + λ ∈ \R_+ λR+ is a regularization factor. We refer to the second term as the spectral norm regularizer. It decreases the spectral norms of the weight matrices.

Calculate the gradient of the spectral norm regularizer

  • When performing SGD, we need to calculate the gradient of the spectral norm regularizer. To this end, let us consider the gradient of σ ( W ℓ ) 2 2 / 2 \sigma(W^\ell)_2^2/2 σ(W)22/2 (这里不清楚这个下标 2 是什么意思) for a particular ℓ ∈ { 1 , 2 , . . . , L } \ell ∈ \{1, 2, . . . , L\} {1,2,...,L}.
  • Let σ 1 = σ ( W ℓ ) σ_1 = σ(W^\ell) σ1=σ(W) and σ 2 σ_2 σ2 be the first and second singular values, respectively. If σ 1 > σ 2 σ_1 > σ_2 σ1>σ2, then the gradient of σ ( W ℓ ) 2 / 2 σ(W^\ell)^2/2 σ(W)2/2 is σ 1 u 1 v 1 T σ_1u_1v_1^T σ1u1v1T, where u 1 u_1 u1 and v 1 v_1 v1 are the first left and right singular vectors, respectively. If σ 1 = σ 2 σ_1 = σ_2 σ1=σ2, then σ ( W ℓ ) 2 2 σ(W^\ell)^2_2 σ(W)22 is not differentiable. (这里的 “not differentiable” 是指 σ 1 = σ 2 σ_1 = σ_2 σ1=σ2 时,梯度不是 σ 1 u 1 v 1 T σ_1u_1v_1^T σ1u1v1T 吗?不太明白) However, for practical purposes, we can assume that this case never occurs because numerical errors prevent σ 1 σ_1 σ1 and σ 2 σ_2 σ2 from being exactly equal.
    • Proof: 设 L ℓ = 1 2 σ ( W ℓ ) 2 L_\ell=\frac{1}{2}σ(W^\ell)^2 L=21σ(W)2. 则由谱分解可知,
      ∂ L ℓ ∂ W ℓ = ∂ L ℓ ∂ σ ( W ℓ ) ∂ σ ( W ℓ ) ∂ W ℓ = σ ( W ℓ ) ⋅ ( u 1 v 1 T ) = σ 1 u 1 v 1 T \frac{\partial L_\ell}{\partial W^\ell}=\frac{\partial L_\ell}{\partial σ(W^\ell)}\frac{\partial σ(W^\ell)}{\partial W^\ell}=σ(W^\ell)\cdot(u_1v_1^T)=σ_1u_1v_1^T WL=σ(W)LWσ(W)=σ(W)(u1v1T)=σ1u1v1T
    • 因此,由上式可知,我们只需要在反向传播时给每个层的权重梯度加上 λ σ 1 u 1 v 1 T \lambda σ_1u_1v_1^T λσ1u1v1T 即可。我们现在还需要一个能够快速计算出 W ℓ W^{\ell} W 的最大奇异值 σ 1 \sigma_1 σ1 以及其对应的 left singular vector u 1 u_1 u1 和 right singular vector v 1 v_1 v1 的方法

Power iteration method (幂迭代法)

  • Starting with a randomly initialized v ∈ R n ℓ − 1 v ∈ \R^{n_{\ell−1}} vRn1, we iteratively perform the following procedure a sufficient number of times: (这里写的迭代公式我还是不太明白,为什么没有对 u u u v v v 归一化呢? σ \sigma σ 的计算方法也不太明白。但在 Spectral Normalization 的 paper 里的迭代公式是比较好理解的,因此这里就先不管这个迭代公式为什么能逼近 σ 1 σ_1 σ1, u 1 u_1 u1, 和 v 1 v_1 v1 了,先了解它的思想吧)
    在这里插入图片描述Then, σ σ σ, u u u, and v v v converge to σ 1 σ_1 σ1, u 1 u_1 u1, and v 1 v_1 v1, respectively (if σ 1 > σ 2 σ_1 > σ_2 σ1>σ2).
  • To approximate σ 1 σ_1 σ1, u 1 u_1 u1, and v 1 v_1 v1 in the next iteration of SGD, we can reuse v v v as the initial vector. in our experiments, we performed only one iteration because it was adequate for obtaining a sufficiently good approximation.
    • Because the weights change slowly, we only need to perform a single power iteration on the current version of these vectors for each step of learning; this is why spectral norm regularization is computationally efficient

在这里插入图片描述


Convolutions

  • Consider a convolutional layer with a a a input channels, b b b output channels, and a k w × k h k_w × k_h kw×kh-sized kernel. Note that a value in an output channel is determined using a k w k h ak_wk_h akwkh values in the input channels.
  • Hence, we align the parameters as a matrix of size b × a k w k h b × ak_wk_h b×akwkh and apply the abovementioned power iteration method to the matrix to calculate its spectral norm and gradient.

Spectrum Normalization for GANs

Introduction

  • In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our normalization enjoys following favorable properties.
    • Lipschitz constant is the only hyper-parameter to be tuned, and the algorithm does not require intensive tuning of the only hyper-parameter for satisfactory performance.
    • Implementation is simple and the additional computational cost is small.

The centrality of Lipschitz continuity in GANs

Notation

  • Let us consider a simple discriminator made of a neural network of the following form, with the input x x x:
    在这里插入图片描述where θ : = { W 1 , . . . , W L , W L + 1 } θ := \{W^1, . . . , W^L, W^{L+1}\} θ:={W1,...,WL,WL+1} is the learning parameters set, W l ∈ R d l × d l − 1 W^l ∈ \R^{d_l×d_{l−1}} WlRdl×dl1, W L + 1 ∈ R 1 × d L W^{L+1} ∈ \R^{1×d_L} WL+1R1×dL, and a l a_l al is an element-wise non-linear activation function. We omit the bias term of each layer for simplicity. The final output of the discriminator is given by
    在这里插入图片描述where A \mathcal A A is an activation function.

Standard formulation of GANs

  • The standard formulation of GANs is given by
    在这里插入图片描述where min ⁡ \min min and max ⁡ \max max of G G G and D D D are taken over the set of generator and discriminator functions, respectively.
    • The conventional form of V ( G , D ) V (G, D) V(G,D) is given by
      在这里插入图片描述where q d a t a q_{data} qdata is the data distribution and p G p_G pG is the (model) generator distribution to be learned through the adversarial min-max optimization
    • The activation function A \mathcal A A is some continuous function with range [ 0 , 1 ] [0, 1] [0,1] (e.g, sigmoid function).
    • It is known that, for a fixed generator G G G, the optimal discriminator is given by
      在这里插入图片描述
  • The machine learning community has been pointing out recently that the function space from which the discriminators are selected crucially affects the performance of GANs.
    • e.g. The optimal discriminator of GANs on the above standard formulation takes the form
      在这里插入图片描述and its derivative
      在这里插入图片描述can be unbounded or even incomputable. This prompts us to introduce some regularity condition to the derivative of f ( x ) f(x) f(x).

Lipschitz continuity

  • A number of works (Ue-hara et al., 2016; Qi (Loss-Sensitive GAN; LSGAN), 2017; Gulrajani et al. (WGAN-GP), 2017) advocate the importance of Lipschitz continuity in assuring the boundedness of statistics.
    • If we are training a Wasserstein GAN, then Kantorovich-Rubinstein duality requires it.
    • However, when we’re training using the standard KL-divergence loss, there’s a looser but still intuitive explanation. Mandating Lipschitz continuity bounds the gradients in our discriminator.
  • A particularly successful works in this array are (Qi (Loss-Sensitive GAN; LSGAN), 2017; Arjovsky et al.; WGAN, 2017; Gulrajani et al. (WGAN-GP), 2017), which proposed methods to control the Lipschitz constant of the discriminator by adding regularization terms defined on input examples x x x. We would follow their footsteps and search for the discriminator D D D from the set of K K K-Lipschitz continuous functions, that is,
    在这里插入图片描述where we mean by ∣ ∣ f ∣ ∣ L i p ||f||_{Lip} fLip the smallest value M M M such that ∣ ∣ f ( x ) − f ( x ′ ) ∣ ∣ / ∣ ∣ x − x ′ ∣ ∣ ≤ M ||f(x) − f(x')||/||x − x'||≤ M f(x)f(x)/xxM for any x , x ′ x, x' x,x, with the norm being the ℓ 2 \ell_2 2 norm.

Definition of Lipschitz continuity

  • Suppose we have a GAN discriminator D : I → R D:I→\R D:IR, where I I I is the space of images (e.g., R 32 × 32 \R^{32×32} R32×32). Because both the domain and codomain of this function have inner products, we have a natural metric (distance function) in both spaces: the L2 distance.
  • If our discriminator is K K K-Lipschitz continuous, then for all x x x and y y y in I I I,
    在这里插入图片描述where ∣ ⋅ ∣ |⋅| is the L2 norm. Here, if K K K is a minimum, then it is called the Lipschitz constant of the discriminator.
    • Let’s look at the 1D case to illustrate Lipschitz continuity geometrically. Suppose D ( x ) = s i n ( x ) D(x)=sin(x) D(x)=sin(x). If D D D is Lipschitz continuous, then we can draw a cone centered at every point on its graph such that the graph lies outside of this cone. It’s clear that sin ⁡ \boldsymbol{\sin} sin is 1-Lipschitz continuous.
      在这里插入图片描述
    • What about ReLU, our favorite activation function? - This is also obviously 1-Lipschitz continuous.
      在这里插入图片描述
  • One interesting fact to notice from these examples is that if a 1D function is differentiable, then its Lipschitz constant is just the maximum value of its derivative.

The Lipschitz constant of a linear function is its spectral norm

  • Suppose we have a linear function A : R n → R m A:\R^n→\R^m A:RnRm. This function could be the pre-activation operation of one layer in a multilayer perceptron.
  • What is the Lipschitz constant of this function, if it exists?
    • Since A A A is linear, if A A A is K K K-Lipschitz at zero, then it is K K K-Lipschitz everywhere. (proof sketch: ∣ ∣ f ( x − y ) − 0 ∣ ∣ ≤ K ∣ ∣ ( x − y ) − 0 ∣ ∣ ⇒ ∣ ∣ f ( x ) − f ( y ) ∣ ∣ ≤ K ∣ ∣ x − y ∣ ∣ ||f(x-y)-0||\leq K||(x-y)-0||\Rightarrow||f(x)-f(y)||\leq K||x-y|| f(xy)0K(xy)0f(x)f(y)Kxy)
    • This simplifies the Lipschitz continuity requirement to
      在这里插入图片描述for all x ∈ I x∈I xI. This is equivalent to the statement
      在这里插入图片描述which in turn is equivalent to (下面的 K 2 K^2 K2 后面应该加个 I I I)在这里插入图片描述If we expand x x x in the orthonormal basis of eigenvectors of A T A A^TA ATA (i.e., x = ∑ i x i v i x=∑_ix_iv_i x=ixivi), we can then write out this inner product explicitly: (下面运用了一些内积空间的性质; λ i \lambda_i λi 为特征向量 v i v_i vi 对应的特征值)在这里插入图片描述Note that since A T A A^TA ATA is positive semidefinite, all the λ i λ_i λis must be nonnegative. To guarantee the above sum to be nonnegative, each term must be nonnegative, so we have
      在这里插入图片描述Since we choose K K K to be the minimum value satisfying the above constraints, we immediately see that K K K is the square root of the largest eigenvalue of A T A A^TA ATA. Therefore, the Lipschitz constant of a linear function is its largest singular value, or its spectral norm.
  • 那么再由下式可知, A A A 除以 K K K ( A A A 的谱范数) 即可使 A A A 满足 1-Lipschitz continuity
    在这里插入图片描述

Composition of functions

  • 对于复合函数,我们有这样的定理:
    在这里插入图片描述其中, ∣ ∣ ⋅ ∣ ∣ L i p ||\cdot||_{Lip} Lip 表示函数的 Lipschitz 常数
  • 神经网络正是多个复合函数嵌套的操作。最常见的嵌套是:一层卷积,一层激活函数,再一层卷积,再一层激活函数… 而激活函数通常选取的 ReLU,Leaky ReLU 都是 1-Lipschitz 的,带入到上式中相乘都不影响总体的 Lipschitz constant,我们只需要保证卷积和全连接的部分是 1-Lipschitz continuous 的,就可以保证整个神经网络都是 1-Lipschitz continuous 的: Our spectral normalization normalizes the spectral norm of the weight matrix W W W so that it satisfies the Lipschitz constraint σ ( W ) = 1 σ(W ) = 1 σ(W)=1:
    在这里插入图片描述
    • For the evaluation of the spectral norm for the convolutional weight W ∈ R d o u t × d i n × h × w W ∈ \R^{d_{out}×d_{in}×h×w} WRdout×din×h×w, we treated the operator as a 2-D matrix of dimension d o u t × ( d i n h w ) d_{out} × (d_{in}hw) dout×(dinhw) (Note that, since we are conducting the convolution discretely, the spectral norm will depend on the size of the stride and padding. However, the answer will only differ by some predefined K K K.)

Fast Approximation of the Spectral Norm σ ( W ) \sigma(W) σ(W)

Power iteration

  • 思想与 Spectral Norm Regularization 相同,也是利用幂迭代法快速地计算出谱范数以及对应的 left singular vector 和 right singular vector
  • update rule:
    在这里插入图片描述We can then approximate the spectral norm of W W W with the pair of so-approximated singular vectors:
    在这里插入图片描述
    • Proof ( u ~ \tilde u u~ 和谱范数的求解都可由 SVD 的定义直接得到,因此下面主要说明为何可以迭代逼近 v v v):
      • A A A 是一个 n × n n × n n×n 的满秩方阵,它的单位特征向量为 v 1 , v 2 , . . . , v n v_1, v_2,..., v_n v1,v2,...,vn,对应的特征值为 λ 1 , λ 2 , . . . , λ n λ_1, λ_2,..., λ_n λ1,λ2,...,λn (按特征值大小降序排列) 且 λ 1 > λ 2 λ_1>λ_2 λ1>λ2。那么对任意向量 x x x x = ∑ x i ⋅ v i x=\sum x_i·v_i x=xivi,因此
        在这里插入图片描述经过 k k k 次迭代后有:
        在这里插入图片描述因此,当 k → ∞ k\rightarrow\infty k 时, 1 λ 1 k A k x = x 1 v 1 \frac{1}{\lambda_1^k}A^kx=x_1v_1 λ1k1Akx=x1v1,即 A k x A^kx Akx 几乎与 v 1 v_1 v1 的所指方向相同
      • 可以看出,每次更新时, v ~ \tilde v v~ 的更新公式为 v ~ ← W T W v ~ \tilde v\leftarrow W^TW\tilde v v~WTWv~ (之后再归一化),因此更新次数足够多时, v ~ \tilde v v~ 所指方向即为 W T W W^TW WTW 的最大特征值所对应的特征向量的方向,在进行归一化后即可得到 v v v

Details (computational efficiency)

  • If we use SGD for updating W W W, the change in W W W at each update would be small, and hence the change in its largest singular value. In our implementation, we took advantage of this fact and reused the u ~ \tilde u u~ computed at each step of the algorithm as the initial vector in the subsequent step. In fact, with this ‘recycle’ procedure, one round of power iteration was sufficient in the actual experiment to achieve satisfactory performance.

Algorithm

在这里插入图片描述

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值