Spectral Norm Regularization (SNR)
- paper: Spectral Norm Regularization for Improving the Generalizability of Deep Learning
- Ref: 谱范数正则 (Spectral Norm Regularization) 的理解
Motivation
- In this study, we consider the generalizability (泛化性能) of deep learning from the perspective of sensitivity to input perturbation (输入扰动). We hypothesize that the high sensitivity to the perturbation of data degrades the performance on it. (最后论文用实验来验证这个假设)
- Intuitively, if a trained model is insensitive or sensitive to the perturbation of an input, then the model is confident or not confident about the output, respectively. As the performance on test data is important, models that are insensitive to the perturbation of test data are required. (e.g. 图片只改变一个像素就给出完全不一样的分类结果)
- A conventional method of understanding the generalizability of a trained model is the notion of the flatness/sharpness of a local minimum. A local minimum is (informally) referred to as f l a t flat flat if its loss value does not increase significantly when it is perturbed; otherwise, it is referred to as s h a r p sharp sharp. In general, the high sensitivity of a training function at a sharp local minimizer negatively affects the generalizability of the trained model.
- To reduce the sensitivity to perturbation, we propose a simple and effective regularization method, referred to as spectral norm regularization, which penalizes the high spectral norm (谱范数) of weight matrices in neural networks.
Spectral Norm (谱范数)
- The spectral norm of a matrix
A
∈
R
m
×
n
A\in\R^{m\times n}
A∈Rm×n is defined as
which corresponds to the largest singular value of A A A.
- Proof: 显然 ∣ ∣ A ξ ∣ ∣ 2 ∣ ∣ ξ ∣ ∣ 2 \frac{||A\xi||_2}{||\xi||_2} ∣∣ξ∣∣2∣∣Aξ∣∣2 的值与 ξ \xi ξ 的长度无关,因此不妨设 ∣ ∣ ξ ∣ ∣ 2 = 1 ||\xi||_2=1 ∣∣ξ∣∣2=1,原问题就变成了求 max ξ ∈ R n , ∣ ∣ ξ ∣ ∣ = 1 ξ T A T A ξ \max_{\xi\in R^n,||\xi||=1}\sqrt{\xi^TA^TA\xi} maxξ∈Rn,∣∣ξ∣∣=1ξTATAξ。可以看出这就是一个条件优化问题,因此最大值为 A T A A^TA ATA 的最大特征值的算术平方根,也就是 A A A 的最大奇异值
Spectral Norm Regularization
General idea
- We consider feed-forward neural networks as a simple example to explain the intuition behind spectral norm regularization.
- A feed-forward neural network
for ℓ = 1 , . . . , L \ell = 1, . . . , L ℓ=1,...,L for some L L L; f ℓ : R n ℓ → R n ℓ f^{\ell}: \mathbb{R}^{n_{\ell}} \rightarrow \mathbb{R}^{n_{\ell}} fℓ:Rnℓ→Rnℓ is a (non-linear) activation function.
- 因此,对于上述前馈网络,它的参数可由集合 Θ = { W ℓ , b ℓ } ℓ = 1 L \Theta=\{W^{\ell},b^{\ell}\}_{\ell=1}^L Θ={Wℓ,bℓ}ℓ=1L 表示,整个神经网络可由函数 f Θ : R n 0 → R n L f_\Theta:\R^{n_0}\rightarrow\R^{n_L} fΘ:Rn0→RnL 来表示 ( f Θ ( x 0 ) = x L f_\Theta(x^0)=x^L fΘ(x0)=xL)
- Given training data,
(
x
i
,
y
i
)
i
=
1
K
(x_i, y_i)_{i=1}^K
(xi,yi)i=1K, the loss function is defined as
1 K ∑ i = 1 K L ( f Θ ( x i ) , y i ) \frac{1}{K}\sum_{i=1}^KL(f_\Theta(x_i),y_i) K1i=1∑KL(fΘ(xi),yi)
- Our goal is to obtain a model
Θ
\Theta
Θ that is insensitive to the perturbation of the input. 即,让扰动指数
P
P
P 尽量小:
P = ∣ ∣ f ( x + ξ ) − f ( x ) ∣ ∣ 2 ∣ ∣ ξ ∣ ∣ 2 P=\frac{||f(x+\xi)-f(x)||_2}{||\xi||_2} P=∣∣ξ∣∣2∣∣f(x+ξ)−f(x)∣∣2 - A key observation is that most practically used neural networks exhibit nonlinearity only because they use piecewise linear functions (分段线性函数), such as ReLU, as activation functions. In such a case, function
f
Θ
f_\Theta
fΘ is a piecewise linear function. Hence, if we consider a small neighborhood (邻域) of
x
x
x, we can regard
f
Θ
f_\Theta
fΘ as a linear function.
- In other words, we can represent
f
Θ
f_\Theta
fΘ by an affine map,
x
↦
W
Θ
,
x
x
+
b
Θ
,
x
x\mapsto W_{\Theta,x}x + b_{\Theta,x}
x↦WΘ,xx+bΘ,x. Then, for a small perturbation,
ξ
∈
R
n
0
ξ ∈ \R^{n_0}
ξ∈Rn0 , we have
where σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) is the spectral norm of W Θ , ξ W_{\Theta,\xi} WΘ,ξ
- 也就是说, σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 给出了扰动指数的上界。我们只要让 σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 尽量小,就能减弱网络对输入扰动的敏感性
- In other words, we can represent
f
Θ
f_\Theta
fΘ by an affine map,
x
↦
W
Θ
,
x
x
+
b
Θ
,
x
x\mapsto W_{\Theta,x}x + b_{\Theta,x}
x↦WΘ,xx+bΘ,x. Then, for a small perturbation,
ξ
∈
R
n
0
ξ ∈ \R^{n_0}
ξ∈Rn0 , we have
- To further investigate the property of
W
Θ
,
x
W_{\Theta,x}
WΘ,x, let us assume that each activation function,
f
ℓ
f^\ell
fℓ, is an element-wise ReLU (the argument can be easily generalized to other piecewise linear functions). Note that, for a given vector,
x
x
x,
f
ℓ
f^\ell
fℓ acts as a diagonal matrix,
D
Θ
,
x
ℓ
∈
R
n
ℓ
×
n
ℓ
D_{\Theta,x}^\ell\in\R^{n_\ell\times n_\ell}
DΘ,xℓ∈Rnℓ×nℓ, where an element in the diagonal is equal to one if the corresponding element in
x
ℓ
−
1
x^{\ell-1}
xℓ−1 is positive; otherwise, it is equal to zero. Then, we can rewrite
W
Θ
;
x
W_{\Theta;x}
WΘ;x as
根据矩阵范数的性质 ( ∣ ∣ A B ∣ ∣ ≤ ∣ ∣ A ∣ ∣ ∣ ∣ B ∣ ∣ ||AB||\leq||A||||B|| ∣∣AB∣∣≤∣∣A∣∣∣∣B∣∣),有
上式进一步给出了 σ ( W Θ , x ) \sigma(W_{\Theta,x}) σ(WΘ,x) 的上界,同时也提示我们,只要减小每个层权重矩阵的谱范数,就可以增强模型抗输入扰动的能力。这就引出了谱范数正则化
Details of spectral norm regularization
Spectral Norm Regularizer
- To bound the spectral norm of each weight matrix,
W
ℓ
W^\ell
Wℓ, we consider the following empirical risk minimization problem:
where λ ∈ R + λ ∈ \R_+ λ∈R+ is a regularization factor. We refer to the second term as the spectral norm regularizer. It decreases the spectral norms of the weight matrices.
Calculate the gradient of the spectral norm regularizer
- When performing SGD, we need to calculate the gradient of the spectral norm regularizer. To this end, let us consider the gradient of σ ( W ℓ ) 2 2 / 2 \sigma(W^\ell)_2^2/2 σ(Wℓ)22/2 (这里不清楚这个下标 2 是什么意思) for a particular ℓ ∈ { 1 , 2 , . . . , L } \ell ∈ \{1, 2, . . . , L\} ℓ∈{1,2,...,L}.
- Let
σ
1
=
σ
(
W
ℓ
)
σ_1 = σ(W^\ell)
σ1=σ(Wℓ) and
σ
2
σ_2
σ2 be the first and second singular values, respectively. If
σ
1
>
σ
2
σ_1 > σ_2
σ1>σ2, then the gradient of
σ
(
W
ℓ
)
2
/
2
σ(W^\ell)^2/2
σ(Wℓ)2/2 is
σ
1
u
1
v
1
T
σ_1u_1v_1^T
σ1u1v1T, where
u
1
u_1
u1 and
v
1
v_1
v1 are the first left and right singular vectors, respectively. If
σ
1
=
σ
2
σ_1 = σ_2
σ1=σ2, then
σ
(
W
ℓ
)
2
2
σ(W^\ell)^2_2
σ(Wℓ)22 is not differentiable. (这里的 “not differentiable” 是指
σ
1
=
σ
2
σ_1 = σ_2
σ1=σ2 时,梯度不是
σ
1
u
1
v
1
T
σ_1u_1v_1^T
σ1u1v1T 吗?不太明白) However, for practical purposes, we can assume that this case never occurs because numerical errors prevent
σ
1
σ_1
σ1 and
σ
2
σ_2
σ2 from being exactly equal.
- Proof: 设
L
ℓ
=
1
2
σ
(
W
ℓ
)
2
L_\ell=\frac{1}{2}σ(W^\ell)^2
Lℓ=21σ(Wℓ)2. 则由谱分解可知,
∂ L ℓ ∂ W ℓ = ∂ L ℓ ∂ σ ( W ℓ ) ∂ σ ( W ℓ ) ∂ W ℓ = σ ( W ℓ ) ⋅ ( u 1 v 1 T ) = σ 1 u 1 v 1 T \frac{\partial L_\ell}{\partial W^\ell}=\frac{\partial L_\ell}{\partial σ(W^\ell)}\frac{\partial σ(W^\ell)}{\partial W^\ell}=σ(W^\ell)\cdot(u_1v_1^T)=σ_1u_1v_1^T ∂Wℓ∂Lℓ=∂σ(Wℓ)∂Lℓ∂Wℓ∂σ(Wℓ)=σ(Wℓ)⋅(u1v1T)=σ1u1v1T - 因此,由上式可知,我们只需要在反向传播时给每个层的权重梯度加上 λ σ 1 u 1 v 1 T \lambda σ_1u_1v_1^T λσ1u1v1T 即可。我们现在还需要一个能够快速计算出 W ℓ W^{\ell} Wℓ 的最大奇异值 σ 1 \sigma_1 σ1 以及其对应的 left singular vector u 1 u_1 u1 和 right singular vector v 1 v_1 v1 的方法
- Proof: 设
L
ℓ
=
1
2
σ
(
W
ℓ
)
2
L_\ell=\frac{1}{2}σ(W^\ell)^2
Lℓ=21σ(Wℓ)2. 则由谱分解可知,
Power iteration method (幂迭代法)
- Starting with a randomly initialized
v
∈
R
n
ℓ
−
1
v ∈ \R^{n_{\ell−1}}
v∈Rnℓ−1, we iteratively perform the following procedure a sufficient number of times: (这里写的迭代公式我还是不太明白,为什么没有对
u
u
u 和
v
v
v 归一化呢?
σ
\sigma
σ 的计算方法也不太明白。但在 Spectral Normalization 的 paper 里的迭代公式是比较好理解的,因此这里就先不管这个迭代公式为什么能逼近
σ
1
σ_1
σ1,
u
1
u_1
u1, 和
v
1
v_1
v1 了,先了解它的思想吧)
Then, σ σ σ, u u u, and v v v converge to σ 1 σ_1 σ1, u 1 u_1 u1, and v 1 v_1 v1, respectively (if σ 1 > σ 2 σ_1 > σ_2 σ1>σ2).
- To approximate
σ
1
σ_1
σ1,
u
1
u_1
u1, and
v
1
v_1
v1 in the next iteration of SGD, we can reuse
v
v
v as the initial vector. in our experiments, we performed only one iteration because it was adequate for obtaining a sufficiently good approximation.
- Because the weights change slowly, we only need to perform a single power iteration on the current version of these vectors for each step of learning; this is why spectral norm regularization is computationally efficient
Convolutions
- Consider a convolutional layer with a a a input channels, b b b output channels, and a k w × k h k_w × k_h kw×kh-sized kernel. Note that a value in an output channel is determined using a k w k h ak_wk_h akwkh values in the input channels.
- Hence, we align the parameters as a matrix of size b × a k w k h b × ak_wk_h b×akwkh and apply the abovementioned power iteration method to the matrix to calculate its spectral norm and gradient.
Spectrum Normalization for GANs
- paper: Spectral Normalization for Generative Adversarial Networks
- Ref: Spectral Normalization Explained; 详解 GAN 的谱归一化 (Spectral Normalization)
Introduction
- In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our normalization enjoys following favorable properties.
- Lipschitz constant is the only hyper-parameter to be tuned, and the algorithm does not require intensive tuning of the only hyper-parameter for satisfactory performance.
- Implementation is simple and the additional computational cost is small.
The centrality of Lipschitz continuity in GANs
Notation
- Let us consider a simple discriminator made of a neural network of the following form, with the input
x
x
x:
where θ : = { W 1 , . . . , W L , W L + 1 } θ := \{W^1, . . . , W^L, W^{L+1}\} θ:={W1,...,WL,WL+1} is the learning parameters set, W l ∈ R d l × d l − 1 W^l ∈ \R^{d_l×d_{l−1}} Wl∈Rdl×dl−1, W L + 1 ∈ R 1 × d L W^{L+1} ∈ \R^{1×d_L} WL+1∈R1×dL, and a l a_l al is an element-wise non-linear activation function. We omit the bias term of each layer for simplicity. The final output of the discriminator is given by
where A \mathcal A A is an activation function.
Standard formulation of GANs
- The standard formulation of GANs is given by
where min \min min and max \max max of G G G and D D D are taken over the set of generator and discriminator functions, respectively.
- The conventional form of
V
(
G
,
D
)
V (G, D)
V(G,D) is given by
where q d a t a q_{data} qdata is the data distribution and p G p_G pG is the (model) generator distribution to be learned through the adversarial min-max optimization
- The activation function A \mathcal A A is some continuous function with range [ 0 , 1 ] [0, 1] [0,1] (e.g, sigmoid function).
- It is known that, for a fixed generator
G
G
G, the optimal discriminator is given by
- The conventional form of
V
(
G
,
D
)
V (G, D)
V(G,D) is given by
- The machine learning community has been pointing out recently that the function space from which the discriminators are selected crucially affects the performance of GANs.
- e.g. The optimal discriminator of GANs on the above standard formulation takes the form
and its derivative
can be unbounded or even incomputable. This prompts us to introduce some regularity condition to the derivative of f ( x ) f(x) f(x).
- e.g. The optimal discriminator of GANs on the above standard formulation takes the form
Lipschitz continuity
- A number of works (Ue-hara et al., 2016; Qi (Loss-Sensitive GAN; LSGAN), 2017; Gulrajani et al. (WGAN-GP), 2017) advocate the importance of Lipschitz continuity in assuring the boundedness of statistics.
- If we are training a Wasserstein GAN, then Kantorovich-Rubinstein duality requires it.
- However, when we’re training using the standard KL-divergence loss, there’s a looser but still intuitive explanation. Mandating Lipschitz continuity bounds the gradients in our discriminator.
- A particularly successful works in this array are (Qi (Loss-Sensitive GAN; LSGAN), 2017; Arjovsky et al.; WGAN, 2017; Gulrajani et al. (WGAN-GP), 2017), which proposed methods to control the Lipschitz constant of the discriminator by adding regularization terms defined on input examples
x
x
x. We would follow their footsteps and search for the discriminator
D
D
D from the set of
K
K
K-Lipschitz continuous functions, that is,
where we mean by ∣ ∣ f ∣ ∣ L i p ||f||_{Lip} ∣∣f∣∣Lip the smallest value M M M such that ∣ ∣ f ( x ) − f ( x ′ ) ∣ ∣ / ∣ ∣ x − x ′ ∣ ∣ ≤ M ||f(x) − f(x')||/||x − x'||≤ M ∣∣f(x)−f(x′)∣∣/∣∣x−x′∣∣≤M for any x , x ′ x, x' x,x′, with the norm being the ℓ 2 \ell_2 ℓ2 norm.
Definition of Lipschitz continuity
- Suppose we have a GAN discriminator D : I → R D:I→\R D:I→R, where I I I is the space of images (e.g., R 32 × 32 \R^{32×32} R32×32). Because both the domain and codomain of this function have inner products, we have a natural metric (distance function) in both spaces: the L2 distance.
- If our discriminator is
K
K
K-Lipschitz continuous, then for all
x
x
x and
y
y
y in
I
I
I,
where ∣ ⋅ ∣ |⋅| ∣⋅∣ is the L2 norm. Here, if K K K is a minimum, then it is called the Lipschitz constant of the discriminator.
- Let’s look at the 1D case to illustrate Lipschitz continuity geometrically. Suppose
D
(
x
)
=
s
i
n
(
x
)
D(x)=sin(x)
D(x)=sin(x). If
D
D
D is Lipschitz continuous, then we can draw a cone centered at every point on its graph such that the graph lies outside of this cone. It’s clear that
sin
\boldsymbol{\sin}
sin is 1-Lipschitz continuous.
- What about ReLU, our favorite activation function? - This is also obviously 1-Lipschitz continuous.
- Let’s look at the 1D case to illustrate Lipschitz continuity geometrically. Suppose
D
(
x
)
=
s
i
n
(
x
)
D(x)=sin(x)
D(x)=sin(x). If
D
D
D is Lipschitz continuous, then we can draw a cone centered at every point on its graph such that the graph lies outside of this cone. It’s clear that
sin
\boldsymbol{\sin}
sin is 1-Lipschitz continuous.
- One interesting fact to notice from these examples is that if a 1D function is differentiable, then its Lipschitz constant is just the maximum value of its derivative.
The Lipschitz constant of a linear function is its spectral norm
- Suppose we have a linear function A : R n → R m A:\R^n→\R^m A:Rn→Rm. This function could be the pre-activation operation of one layer in a multilayer perceptron.
- What is the Lipschitz constant of this function, if it exists?
- Since A A A is linear, if A A A is K K K-Lipschitz at zero, then it is K K K-Lipschitz everywhere. (proof sketch: ∣ ∣ f ( x − y ) − 0 ∣ ∣ ≤ K ∣ ∣ ( x − y ) − 0 ∣ ∣ ⇒ ∣ ∣ f ( x ) − f ( y ) ∣ ∣ ≤ K ∣ ∣ x − y ∣ ∣ ||f(x-y)-0||\leq K||(x-y)-0||\Rightarrow||f(x)-f(y)||\leq K||x-y|| ∣∣f(x−y)−0∣∣≤K∣∣(x−y)−0∣∣⇒∣∣f(x)−f(y)∣∣≤K∣∣x−y∣∣)
- This simplifies the Lipschitz continuity requirement to
for all x ∈ I x∈I x∈I. This is equivalent to the statement
which in turn is equivalent to (下面的 K 2 K^2 K2 后面应该加个 I I I)
If we expand x x x in the orthonormal basis of eigenvectors of A T A A^TA ATA (i.e., x = ∑ i x i v i x=∑_ix_iv_i x=∑ixivi), we can then write out this inner product explicitly: (下面运用了一些内积空间的性质; λ i \lambda_i λi 为特征向量 v i v_i vi 对应的特征值)
Note that since A T A A^TA ATA is positive semidefinite, all the λ i λ_i λis must be nonnegative. To guarantee the above sum to be nonnegative, each term must be nonnegative, so we have
Since we choose K K K to be the minimum value satisfying the above constraints, we immediately see that K K K is the square root of the largest eigenvalue of A T A A^TA ATA. Therefore, the Lipschitz constant of a linear function is its largest singular value, or its spectral norm.
- 那么再由下式可知,将
A
A
A 除以
K
K
K (
A
A
A 的谱范数) 即可使
A
A
A 满足 1-Lipschitz continuity
Composition of functions
- 对于复合函数,我们有这样的定理:
其中, ∣ ∣ ⋅ ∣ ∣ L i p ||\cdot||_{Lip} ∣∣⋅∣∣Lip 表示函数的 Lipschitz 常数
- 而神经网络正是多个复合函数嵌套的操作。最常见的嵌套是:一层卷积,一层激活函数,再一层卷积,再一层激活函数… 而激活函数通常选取的 ReLU,Leaky ReLU 都是 1-Lipschitz 的,带入到上式中相乘都不影响总体的 Lipschitz constant,我们只需要保证卷积和全连接的部分是 1-Lipschitz continuous 的,就可以保证整个神经网络都是 1-Lipschitz continuous 的: Our spectral normalization normalizes the spectral norm of the weight matrix
W
W
W so that it satisfies the Lipschitz constraint
σ
(
W
)
=
1
σ(W ) = 1
σ(W)=1:
- For the evaluation of the spectral norm for the convolutional weight W ∈ R d o u t × d i n × h × w W ∈ \R^{d_{out}×d_{in}×h×w} W∈Rdout×din×h×w, we treated the operator as a 2-D matrix of dimension d o u t × ( d i n h w ) d_{out} × (d_{in}hw) dout×(dinhw) (Note that, since we are conducting the convolution discretely, the spectral norm will depend on the size of the stride and padding. However, the answer will only differ by some predefined K K K.)
Fast Approximation of the Spectral Norm σ ( W ) \sigma(W) σ(W)
Power iteration
- 思想与 Spectral Norm Regularization 相同,也是利用幂迭代法快速地计算出谱范数以及对应的 left singular vector 和 right singular vector
- update rule:
We can then approximate the spectral norm of W W W with the pair of so-approximated singular vectors:
- Proof (
u
~
\tilde u
u~ 和谱范数的求解都可由 SVD 的定义直接得到,因此下面主要说明为何可以迭代逼近
v
v
v):
- 设
A
A
A 是一个
n
×
n
n × n
n×n 的满秩方阵,它的单位特征向量为
v
1
,
v
2
,
.
.
.
,
v
n
v_1, v_2,..., v_n
v1,v2,...,vn,对应的特征值为
λ
1
,
λ
2
,
.
.
.
,
λ
n
λ_1, λ_2,..., λ_n
λ1,λ2,...,λn (按特征值大小降序排列) 且
λ
1
>
λ
2
λ_1>λ_2
λ1>λ2。那么对任意向量
x
x
x 有
x
=
∑
x
i
⋅
v
i
x=\sum x_i·v_i
x=∑xi⋅vi,因此
经过 k k k 次迭代后有:
因此,当 k → ∞ k\rightarrow\infty k→∞ 时, 1 λ 1 k A k x = x 1 v 1 \frac{1}{\lambda_1^k}A^kx=x_1v_1 λ1k1Akx=x1v1,即 A k x A^kx Akx 几乎与 v 1 v_1 v1 的所指方向相同
- 可以看出,每次更新时, v ~ \tilde v v~ 的更新公式为 v ~ ← W T W v ~ \tilde v\leftarrow W^TW\tilde v v~←WTWv~ (之后再归一化),因此更新次数足够多时, v ~ \tilde v v~ 所指方向即为 W T W W^TW WTW 的最大特征值所对应的特征向量的方向,在进行归一化后即可得到 v v v
- 设
A
A
A 是一个
n
×
n
n × n
n×n 的满秩方阵,它的单位特征向量为
v
1
,
v
2
,
.
.
.
,
v
n
v_1, v_2,..., v_n
v1,v2,...,vn,对应的特征值为
λ
1
,
λ
2
,
.
.
.
,
λ
n
λ_1, λ_2,..., λ_n
λ1,λ2,...,λn (按特征值大小降序排列) 且
λ
1
>
λ
2
λ_1>λ_2
λ1>λ2。那么对任意向量
x
x
x 有
x
=
∑
x
i
⋅
v
i
x=\sum x_i·v_i
x=∑xi⋅vi,因此
- Proof (
u
~
\tilde u
u~ 和谱范数的求解都可由 SVD 的定义直接得到,因此下面主要说明为何可以迭代逼近
v
v
v):
Details (computational efficiency)
- If we use SGD for updating W W W, the change in W W W at each update would be small, and hence the change in its largest singular value. In our implementation, we took advantage of this fact and reused the u ~ \tilde u u~ computed at each step of the algorithm as the initial vector in the subsequent step. In fact, with this ‘recycle’ procedure, one round of power iteration was sufficient in the actual experiment to achieve satisfactory performance.