L2 Normalization
第二种Normalization对于每个样本缩放到单位范数(每个样本的范数为1),主要有L1-normalization(L1范数)、L2-normalization(L2范数)等
Normalization主要思想是对每个样本计算其p-范数,然后对该样本中每个元素除以该范数,这样处理的结果是使得每个处理后样本的p-范数(比如l1-norm,l2-norm)等于1。
p-范式的计算公式:
∣
∣
X
∣
∣
p
=
(
(
x
1
)
p
+
(
x
2
)
p
+
.
.
.
+
(
x
n
)
p
)
1
/
p
||X||_p=((x_1)^p+(x_2)^p+...+(x_n)^p)^{1/p}
∣∣X∣∣p=((x1)p+(x2)p+...+(xn)p)1/p
tensorflow中实现这一方法的函数如下:
tf.nn.l2_normalize(x,
dim,
epsilon=1e-12,
name=None)
上式:
x为输入的向量;
dim为l2范化的维数,dim取值为0或0或1或[0,1];
epsilon的范化的最小值边界;
下面看个例子:
#-*-coding:utf-8-*-
import tensorflow as tf
input_data = tf.constant([[1.0,2,3],[4.0,5,6],[7.0,8,9]])
output_1 = tf.nn.l2_normalize(input_data, dim=0, epsilon=1e-10, name='nn_l2_norm')
output_2 = tf.nn.l2_normalize(input_data, dim=1, epsilon=1e-10, name='nn_l2_norm')
output_3 = tf.nn.l2_normalize(input_data, dim=[0, 1], epsilon=1e-10, name='nn_l2_norm')
with tf.Session() as sess:
print(output_1.eval())
print(output_2.eval())
print(output_3.eval())
‘’’output:
[[0.12309149 0.20739034 0.26726127]
[0.49236596 0.51847583 0.53452253]
[0.86164045 0.82956135 0.80178374]]
[[0.26726124 0.5345225 0.8017837 ]
[0.45584232 0.5698029 0.6837635 ]
[0.5025707 0.5743665 0.64616233]]
[[0.05923489 0.11846977 0.17770466]
[0.23693955 0.29617444 0.35540932]
[0.4146442 0.4738791 0.53311396]]
'''
dim = 0, 为按列进行l2范化
n
o
r
m
(
1
)
=
1
2
+
4
2
+
7
2
=
66
norm(1) = \sqrt{1^2+4^2+7^2}=\sqrt{66}
norm(1)=12+42+72=66
n
o
r
m
(
2
)
=
2
2
+
5
2
+
8
2
=
93
norm(2) = \sqrt{2^2+5^2+8^2}=\sqrt{93}
norm(2)=22+52+82=93
n
o
r
m
(
3
)
=
3
2
+
6
2
+
9
2
=
126
norm(3) = \sqrt{3^2+6^2+9^2}=\sqrt{126}
norm(3)=32+62+92=126
[[1./norm(1), 2./norm(2) , 3./norm(3) ]
[4./norm(1) , 5./norm(2) , 6./norm(3) ] =
[7./norm(1) , 8./norm(2) , 9./norm(3) ]]
[[0.12309149 0.20739034 0.26726127]
[0.49236596 0.51847583 0.53452253]
[0.86164045 0.82956135 0.80178374]]
dim=1,为按行进行l2范化
n
o
r
m
(
1
)
=
1
2
+
2
2
+
3
2
=
14
norm(1) = \sqrt{1^2+2^2+3^2}=\sqrt{14}
norm(1)=12+22+32=14
n
o
r
m
(
2
)
=
4
2
+
5
2
+
6
2
=
77
norm(2) = \sqrt{4^2+5^2+6^2}=\sqrt{77}
norm(2)=42+52+62=77
n
o
r
m
(
3
)
=
7
2
+
8
2
+
9
2
=
194
norm(3) = \sqrt{7^2+8^2+9^2}=\sqrt{194}
norm(3)=72+82+92=194
[[1./norm(1), 2./norm(1) , 3./norm(1) ]
[4./norm(2) , 5./norm(2) , 6./norm(2) ] =
[7./norm(3) , 8..norm(3) , 9./norm(3) ]]
[[0.12309149 0.20739034 0.26726127]
[0.49236596 0.51847583 0.53452253]
[0.86164045 0.82956135 0.80178374]]
dim=[1, 2],按行列进行l2范化
n o r m = 1 2 + 2 2 + 3 2 + 4 2 + 5 2 + 6 2 + 7 2 + 8 2 + 9 2 = 285 norm=\sqrt{1^2+2^2+3^2+4^2+5^2+6^2+7^2+8^2+9^2}=\sqrt{285} norm=12+22+32+42+52+62+72+82+92=285 16.1882
[[1./norm, 2./norm , 3./norm ]
[4./norm , 5./norm , 6./norm ] =
[7./norm , 8./norm , 9./norm ]]
[[0.05923489 0.11846977 0.17770466]
[0.23693955 0.29617444 0.35540932]
[0.4146442 0.4738791 0.53311396]]
L1和L2regulation
L1 regulation
线性模型常用来处理回归和分类任务,为了防止模型处于过拟合状态,需要用L1正则化和L2正则化降低模型的复杂度,很多线性回归模型正则化的文章会提到L1是通过稀疏参数(减少参数的数量)来降低复杂度,L2是通过减小参数值的大小来降低复杂度。
L1正则化的损失函数为:
L
(
w
)
=
E
D
(
w
)
+
λ
n
∑
i
n
∣
w
i
∣
L(w)=E_D(w)+\frac{\lambda}{n}\sum_i^n|w_i|
L(w)=ED(w)+nλi∑n∣wi∣
上式中,
E
D
(
w
)
E_D(w)
ED(w)是损失函数,
L
L
L是加上正则项的损失函数
求
L
(
w
)
L(w)
L(w)的梯度:
∂
L
(
w
)
∂
w
=
∂
E
D
(
w
)
∂
w
+
(
λ
n
∑
i
n
∣
w
∣
)
‘
\frac{\partial L(w)}{\partial w}=\frac{\partial E_D(w)}{\partial w}+(\frac{\lambda}{n}\sum_i^n|w|)^`
∂w∂L(w)=∂w∂ED(w)+(nλi∑n∣w∣)‘
更新权重:
w
′
=
w
−
η
(
∂
E
D
(
w
)
∂
w
+
(
λ
n
∑
i
n
∣
w
i
∣
)
‘
)
w'=w-\eta(\frac{\partial E_D(w)}{\partial w}+(\frac{\lambda}{n}\sum_i^n|w_i|)^`)
w′=w−η(∂w∂ED(w)+(nλi∑n∣wi∣)‘)
咱们做个假设,所有的w都大于0,上式变为:
w
′
=
w
−
η
∂
E
D
(
w
)
∂
w
−
η
λ
w'=w-\eta\frac{\partial E_D(w)}{\partial w}-\eta\lambda
w′=w−η∂w∂ED(w)−ηλ
看上式,因为w>0,在式子最后减去一个 η λ \eta\lambda ηλ,这容易使得最后的w趋向于0,如果假设w<0,同理也会有这个效果。所以,当w大于0时,更新的参数w变小;当w小于0时,更新的参数w变大;所以,L1正则化容易使参数变为0,即特征稀疏化。
L2 regulation
L2正则化的损失函数为:
由上式可知,正则化的更新参数相比于未含正则项的更新参数多了
项,当w趋向于0时,参数减小的非常缓慢,因此L2正则化使参数减小到很小的范围,但不为0