deep learning可以作为很复杂的model,很复杂的model可以很容易在训练数据集中学到更多的东西,对训练数据拟合的非常好,这就很容易造成overfitting的现象,那regularization就可以一定程度避免这一现象发生,regularization就是在loss funcation中添加对参数规范的term,使训练具有一定偏向性,在deep learning中可以添加regularization term以希望参数值越小越好,使得训练出的model更加平滑。
L2 regularization
假设参数集
θ
=
{
w
1
,
w
2
,
.
.
.
,
w
n
}
\theta=\{w_1,w_2,...,w_n\}
θ={w1,w2,...,wn},L2 norm(L2正规化)是对参数求平方和:
∣
∣
θ
∣
∣
2
=
(
w
1
)
2
+
(
w
2
)
2
.
.
.
+
(
w
n
)
2
||\theta||_2=(w_1)^2+(w_2)^2...+(w_n)^2
∣∣θ∣∣2=(w1)2+(w2)2...+(wn)2
L2 in deep learning
-
loss funcation中添加regularization项: L ′ ( θ ) = L ( θ ) + λ 1 2 ∣ ∣ θ ∣ ∣ 2 L'(\theta)=L(\theta)+\lambda \frac{1}{2}||\theta||_2 L′(θ)=L(θ)+λ21∣∣θ∣∣2
-
loss funcation对参数 w i w_i wi求梯度: ∂ L ′ ∂ w i = ∂ L ∂ w i + λ w i \frac{\partial L'}{\partial w_i}=\frac{\partial L}{\partial w_i}+\lambda w_i ∂wi∂L′=∂wi∂L+λwi
-
update参数 w i w_i wi: w i t + 1 = w i t − η ∂ L ′ ∂ w i = w i t − η ( ∂ L ∂ w i + λ w i t ) = ( 1 − η λ ) w i t − η ∂ L ∂ w i w_i^{t+1}=w_i^t-\eta \frac{\partial L'}{\partial w_i}=w_i^t-\eta(\frac{\partial L}{\partial w_i}+\lambda w_i^t)=(1-\eta \lambda)w_i^t-\eta \frac{\partial L}{\partial w_i} wit+1=wit−η∂wi∂L′=wit−η(∂wi∂L+λwit)=(1−ηλ)wit−η∂wi∂L
Regulization term中 λ 1 2 ∣ ∣ θ ∣ ∣ 2 \lambda \frac{1}{2}||\theta||_2 λ21∣∣θ∣∣2前加了两个因子, λ \lambda λ作为参数,可以控制整个正则项的对loss的影响程度, λ \lambda λ越大影响越大,对梯度更新的影响也就越大, 1 2 \frac{1}{2} 21是为了在求梯度时,去掉 2 2 2这个系数。
根据梯度更新式子 w i t + 1 = ( 1 − η λ ) w i t − η ∂ L ∂ w i w_i^{t+1}=(1-\eta \lambda)w_i^t-\eta \frac{\partial L}{\partial w_i} wit+1=(1−ηλ)wit−η∂wi∂L,与原式子 w i t + 1 = w i t − η ∂ L ∂ w i w_i^{t+1}=w_i^t-\eta \frac{\partial L}{\partial w_i} wit+1=wit−η∂wi∂L作比较,在减掉learning rate × × ×gradient之前,将 w i t w_i^t wit乘上了系数 ( 1 − η λ ) (1-\eta \lambda) (1−ηλ),以希望对 w i t w_i^t wit进行一次缩放, η \eta η是非常小的,那么缩放的程度取决于 λ \lambda λ, λ \lambda λ大(使得系数不小于0)则缩放程度大,但通常 λ \lambda λ取值较小,对 w w w进行一次微小的缩放,每次更新时都会对 w w w缩放一次。使用L2 regularization参数趋向于更小的值,这就叫做Weight Decay(权重衰减)
L1 regularization
假设参数集
θ
=
{
w
1
,
w
2
,
.
.
.
,
w
n
}
\theta=\{w_1,w_2,...,w_n\}
θ={w1,w2,...,wn},L1 norm是对每个参数的绝对值求和:
∣
∣
θ
∣
∣
1
=
∣
w
1
∣
+
∣
w
2
∣
+
.
.
.
+
∣
w
n
∣
||\theta||_1=|w_1|+|w_2|+...+|w_n|
∣∣θ∣∣1=∣w1∣+∣w2∣+...+∣wn∣
L1 in deep learning
-
loss funcation中添加regularization项: L ′ ( θ ) = L ( θ ) + λ 1 2 ∣ ∣ θ ∣ ∣ 1 L'(\theta)=L(\theta)+\lambda \frac{1}{2}||\theta||_1 L′(θ)=L(θ)+λ21∣∣θ∣∣1
-
loss funcation对参数 w i w_i wi求梯度: ∂ L ′ ∂ w i = ∂ L ∂ w i + λ s g n ( w i ) \frac{\partial L'}{\partial w_i}=\frac{\partial L}{\partial w_i}+\lambda sgn(w_i) ∂wi∂L′=∂wi∂L+λsgn(wi)
-
update参数 w i w_i wi: w i t + 1 = w i t − η ∂ L ′ ∂ w i = w i t − η ( ∂ L ∂ w i + λ s g n ( w i t ) ) = w i t − η ∂ L ∂ w i − η λ s g n ( w i t ) w_i^{t+1}=w_i^t-\eta \frac{\partial L'}{\partial w_i}=w_i^t-\eta(\frac{\partial L}{\partial w_i}+\lambda \ sgn(w_i^t))=w_i^t-\eta \frac{\partial L}{\partial w_i}-\eta \lambda \ sgn(w_i^t) wit+1=wit−η∂wi∂L′=wit−η(∂wi∂L+λ sgn(wit))=wit−η∂wi∂L−ηλ sgn(wit)
在对 w i w_i wi求梯度时,绝对值项 ∣ w i ∣ |w_i| ∣wi∣对 w i w_i wi的微分值 d ( ∣ w i ∣ ) d w i \frac{d(|w_i|)}{dw_i} dwid(∣wi∣)怎么计算,依据绝对值函数,当 w i < 0 w_i<0 wi<0时,其导数为-1,当 w i > 0 w_i>0 wi>0时,其导数为1, w i = 0 w_i=0 wi=0时,假设其导数为0,可用 s g n ( w i ) sgn(w_i) sgn(wi)的值代替。
根据梯度更新式子 w i t + 1 = w i t − η ∂ L ∂ w i − η λ s g n ( w i t ) w_i^{t+1}=w_i^t-\eta \frac{\partial L}{\partial w_i}-\eta \lambda \ sgn(w_i^t) wit+1=wit−η∂wi∂L−ηλ sgn(wit),与原式子 w i t + 1 = w i t − η ∂ L ∂ w i w_i^{t+1}=w_i^t-\eta \frac{\partial L}{\partial w_i} wit+1=wit−η∂wi∂L作比较,每次更新多加上一项 ( − η λ s g n ( w i t ) ) (-\eta \lambda \ sgn(w_i^t)) (−ηλ sgn(wit)),参数 w i t w_i^t wit若为正,则 s g n ( w i t ) sgn(w_i^t) sgn(wit)的值为 + 1 +1 +1,相当加上 ( − η λ ) (-\eta \lambda) (−ηλ),使得参数 w i t w_i^t wit趋于更小值。
L1 vs L2
L1和L2对参数
w
w
w的更新公式:
L
1
:
w
i
t
+
1
=
w
i
t
−
η
∂
L
∂
w
i
−
η
λ
s
g
n
(
w
i
t
)
L
2
:
w
i
t
+
1
=
w
i
t
−
η
∂
L
∂
w
i
−
η
λ
w
i
t
L1: w_i^{t+1}=w_i^t-\eta \frac{\partial L}{\partial w_i}-\eta \lambda \ sgn(w_i^t)\\ L2: w_i^{t+1}=w_i^t-\eta\frac{\partial L}{\partial w_i}-\eta\lambda w_i^t
L1:wit+1=wit−η∂wi∂L−ηλ sgn(wit)L2:wit+1=wit−η∂wi∂L−ηλwit
L1和L2虽然它们都是使参数的绝对值变小,但regularization的方式不同:
- L1的regularization是static的值 ( − η λ s g n ( w i t ) ) (-\eta \lambda \ sgn(w_i^t)) (−ηλ sgn(wit))
- L2的regularization是dynamic的值 ( − η λ w i t ) (-\eta \lambda w_i^t) (−ηλwit)
L2的regularization的值与 w w w值成正比, w w w绝对值越大则regularization的值越大,参数下降越快,但当参数 w w w的绝对值比较小的时候,L2的下降速度就会变得很慢,train出来的参数平均都是比较小的,而L1每次下降一个固定的值,train出来的参数是比较sparse,这些参数有很多是接近0的值,也会有很大的值。