L1和L2范数,L1和L2损失函数,L1和L2正则化
flyfish
作为范数 L1-Norm和L2-Norm
L1-Norm
L1-Norm 也就是曼哈顿距离Manhattan distance
两点之间的曼哈顿距离是
P1: (X1, Y1)
P2: (X2, Y2)
∣
∣
X
∣
∣
1
=
∣
x
1
−
x
2
∣
+
∣
y
1
−
y
2
∣
|| \mathrm{X}||_{1}={|x1-x2| + |y1-y2|}
∣∣X∣∣1=∣x1−x2∣+∣y1−y2∣
∣
∣
X
∣
∣
1
=
∣
3
∣
+
∣
4
∣
=
7
|| \mathrm{X}||_{1}=|3|+|4|=7
∣∣X∣∣1=∣3∣+∣4∣=7
n维空间中的一个点是 (x1, x2, …, xN),两点之间的曼哈顿距离是
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
∣
x
1
−
y
1
∣
+
∣
x
2
−
y
2
∣
+
.
.
.
+
∣
x
N
−
y
N
∣
{|x1-y1| + |x2-y2| + ...+ |xN-yN|}
∣x1−y1∣+∣x2−y2∣+...+∣xN−yN∣
L2-Norm
L2-Norm也就是欧几里得距离 euclidean distance
P1: (X1, Y1)
P2: (X2, Y2)
(
x
1
−
x
2
)
2
+
(
y
1
−
y
2
)
2
\sqrt{{(x1-x2)}^2\ +\ {(y1-y2)}^2}
(x1−x2)2 + (y1−y2)2
n维空间中的两点之间的欧几里得距离是
P1: (X1, X2, …, XN)
P2: (Y1, Y2, …, YN)
(
x
1
−
y
1
)
2
+
(
x
2
−
y
2
)
2
+
.
.
.
+
(
x
N
−
y
N
)
2
\sqrt{{(x1-y1)}^2\ +\ {(x2-y2)}^2\ +\ ...\ +\ {(xN-yN)}^2}
(x1−y1)2 + (x2−y2)2 + ... + (xN−yN)2
举个例子
u = [ 3 4 ] u= \begin{bmatrix} 3 \\ 4 \end{bmatrix} u=[34]
u 2 = ∣ 3 ∣ 2 + ∣ 4 ∣ 2 = 25 = 5 \begin{aligned} {{u}}_2 &=\sqrt{|3|^2+|4|^2}\\ &=\sqrt{25}\\ &=5 \end{aligned} u2=∣3∣2+∣4∣2=25=5
L
2
n
o
r
m
=
5
L^2 norm = 5
L2norm=5
L1和L2作为损失函数(As An Error Function)
L1的式子(三个式子表达相同的意思换下字母的写法)
L
1
(
y
^
,
y
)
=
∑
i
=
0
m
∣
y
(
i
)
−
y
^
(
i
)
∣
\begin{aligned} & L_1(\hat{y}, y) = \sum_{i=0}^m|y^{(i)} - \hat{y}^{(i)}| \end{aligned}
L1(y^,y)=i=0∑m∣y(i)−y^(i)∣
S
=
∑
i
=
1
n
∣
Y
i
−
f
(
x
i
)
∣
.
S=\sum\limits_{i=1}^{n}|Y_{i}-f(x_{i})|.
S=i=1∑n∣Yi−f(xi)∣.
S
=
∑
i
=
0
n
∣
y
i
−
h
(
x
i
)
∣
\begin{aligned} & S = \sum_{i=0}^n|y_i - h(x_i)| \end{aligned}
S=i=0∑n∣yi−h(xi)∣
L2的式子(三个式子表达相同的意思换下字母的写法)
L
2
(
y
^
,
y
)
=
∑
i
=
0
m
(
y
(
i
)
−
y
^
(
i
)
)
2
\begin{aligned} & L_2(\hat{y},y) = \sum_{i=0}^m(y^{(i)} - \hat{y}^{(i)})^2 \end{aligned}
L2(y^,y)=i=0∑m(y(i)−y^(i))2
S
=
∑
i
=
1
n
(
Y
i
−
f
(
x
i
)
)
2
.
S=\sum\limits_{i=1}^{n}\Big(Y_{i}-f(x_{i})\Big)^{2}.
S=i=1∑n(Yi−f(xi))2.
S
=
∑
i
=
0
n
(
y
i
−
h
(
x
i
)
)
2
\begin{aligned} & S = \sum_{i=0}^n(y_i - h(x_i))^2 \end{aligned}
S=i=0∑n(yi−h(xi))2
代码实现
import numpy as np
def L1(yhat, y):
loss = np.sum(np.abs(y - yhat))
return loss
def L2(yhat, y):
loss =np.sum(np.power((y - yhat), 2))
return loss
#调用
yhat = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
y = np.array([1, 1, 0, 1, 1])
print("L1 = " ,(L1(yhat,y)))
print("L2 = " ,(L2(yhat,y)))
L1和L2作为正则化(As Regularization)
正则化定义:对学习算法的修改 - 皆在减少泛化误差而不是训练误差。
额外的约束和惩罚可以改善模型在测试集的表现。
对于线性回归模型,加入了L1正则项的模型叫做Lasso回归,加入了L2正则项的模型叫做Ridge回归
损失函数增加正则项,通常表达式前面乘上1/2,求导方便
LASSO回归
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
∣
θ
j
∣
λ
>
0
J(\theta) = \frac{1}{2}\sum^m_{i=1}(h_{\theta}(x^{i}) - y^{(i)})^2 + \lambda\sum^n_{j=1} |\theta_j| \ \ \ \lambda > 0
J(θ)=21i=1∑m(hθ(xi)−y(i))2+λj=1∑n∣θj∣ λ>0
Ridge回归,岭回归
J
(
θ
)
=
1
2
∑
i
=
1
m
(
h
θ
(
x
i
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
λ
>
0
J(\theta) = \frac{1}{2}\sum^m_{i=1}(h_{\theta}(x^{i}) - y^{(i)})^2 + \lambda\sum^n_{j=1} \theta_j^2 \ \ \ \lambda > 0
J(θ)=21i=1∑m(hθ(xi)−y(i))2+λj=1∑nθj2 λ>0
OLS+正则项 换字母写法
普通最小二乘法(Ordinary Least Squares,OLS)
β
^
OLS
=
arg
min
β
∑
i
=
1
n
(
y
i
−
(
β
0
+
β
1
x
i
,
1
+
.
.
.
+
β
p
x
i
,
p
)
)
2
=
arg
min
β
∑
i
=
1
n
(
y
i
−
y
i
^
)
2
(1)
\begin{aligned} {\bf \hat{\beta}_{\text{OLS}}} &= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 \\ &= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i-\hat{y_i})^2 \tag{1} \end{aligned}
β^OLS=argβmini=1∑n(yi−(β0+β1xi,1+...+βpxi,p))2=argβmini=1∑n(yi−yi^)2(1)
OLS+正则项
β
^
L
1
=
arg
min
β
(
∑
i
=
1
n
(
y
i
−
(
β
0
+
β
1
x
i
,
1
+
…
+
β
p
x
i
,
p
)
)
2
+
λ
∑
j
=
0
p
∣
β
j
∣
)
β
^
L
2
=
arg
min
β
(
∑
i
=
1
n
(
y
i
−
(
β
0
+
β
1
x
i
,
1
+
…
+
β
p
x
i
,
p
)
)
2
+
λ
∑
j
=
0
p
∣
β
j
∣
2
)
\begin{array}{l} \hat{\beta}_{\mathbf{L} 1}=\arg \min _{\beta}\left(\sum_{i=1}^{n}\left(y_{i}-\left(\beta_{0}+\beta_{1} x_{i, 1}+\ldots+\beta_{p} x_{i, p}\right)\right)^{2}+\lambda \sum_{j=0}^{p}\left|\beta_{j}\right|\right) \\ \hat{\beta}_{\mathbf{L} 2}=\arg \min _{\beta}\left(\sum_{i=1}^{n}\left(y_{i}-\left(\beta_{0}+\beta_{1} x_{i, 1}+\ldots+\beta_{p} x_{i, p}\right)\right)^{2}+\lambda \sum_{j=0}^{p}\left|\beta_{j}\right|^{2}\right) \end{array}
β^L1=argminβ(∑i=1n(yi−(β0+β1xi,1+…+βpxi,p))2+λ∑j=0p∣βj∣)β^L2=argminβ(∑i=1n(yi−(β0+β1xi,1+…+βpxi,p))2+λ∑j=0p∣βj∣2)
解决什么问题
当我看到《Pattern Recognition and Machine Learning》这本书中L2画个圆形,L1画菱形的时候,深深感到作者的newbility,深深把握住了L1和L2,作者就是个学神,然而神就在天上,说了几句简短不那么平易近人的话,得需要我这个贞人与神沟通。我通过宇宙树,这个用来攀爬天梯的树直达上天与学神交流,将L1和L2的道理转达世人。
学神的图
写代码画图
θ
^
l
a
s
s
o
=
a
r
g
m
i
n
θ
∈
R
n
∑
i
=
1
m
(
y
i
−
x
i
T
θ
)
2
+
λ
∑
j
=
1
n
∣
θ
j
∣
\hat \theta_{lasso} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2 + \lambda \sum_{j=1}^n | \theta_j|
θ^lasso=argminθ∈Rni=1∑m(yi−xiTθ)2+λj=1∑n∣θj∣
θ
^
r
i
d
g
e
=
a
r
g
m
i
n
θ
∈
R
n
∑
i
=
1
m
(
y
i
−
x
i
T
θ
)
2
+
λ
∑
j
=
1
n
θ
j
2
\hat \theta_{ridge} = argmin_{\theta \in \mathbb{R}^n} \sum_{i=1}^m (y_i - \mathbf{x_i}^T \theta)^2 + \lambda \sum_{j=1}^n \theta_j^2
θ^ridge=argminθ∈Rni=1∑m(yi−xiTθ)2+λj=1∑nθj2
λ
\lambda
λ变化过程
λ
=
0
\lambda=0
λ=0没有正则项,然后
λ
\lambda
λ逐渐变大的情况
overfitting -》generalization -》undefitting
L2正则化是防止模型过拟合的一种方法,是L2正则化是对于大数值的权重进行严厉惩罚,鼓励较小值。
不好的拟合(过拟合了)
正好的拟合
此时倾向于简单的模型,而不是复杂的模型。
L1为什么会用在模型压缩上
目的让不重要的参数变成0,然后抛弃他们。
看L1的梯度
看绝对值的导数
∣
x
∣
′
=
1
2
x
2
⋅
2
x
=
x
x
2
=
x
∣
x
∣
|x|^{\prime}=\frac{1}{2 \sqrt{x^{2}}} \cdot 2 x=\frac{x}{\sqrt{x^{2}}}=\frac{x}{|x|}
∣x∣′=2x21⋅2x=x2x=∣x∣x
分三种情况
When
X
>
1
,
derivative
=
1
When
X
=
0
,
derivative
=
undefined
When
X
<
1
,
derivative
=
−
1
\begin{array}{l} \text { When } \mathrm{X}>1, \text { derivative } =1 \\ \text { When } \mathrm{X}=0, \text { derivative } =\text { undefined } \\ \text { When } \mathrm{X}<1, \text { derivative } =-1 \end{array}
When X>1, derivative =1 When X=0, derivative = undefined When X<1, derivative =−1
∣
x
∣
′
=
1
2
x
2
⋅
2
x
=
x
x
2
=
x
∣
x
∣
|x|^{\prime}=\frac{1}{2 \sqrt{x^{2}}} \cdot 2 x=\frac{x}{\sqrt{x^{2}}}=\frac{x}{|x|}
∣x∣′=2x21⋅2x=x2x=∣x∣x
L1的梯度另一种写法是
L
=
L
+
λ
∑
i
=
1
n
∣
w
i
∣
∂
L
∂
w
i
=
∂
L
∂
w
i
+
λ
sign
(
w
i
)
w
i
=
w
i
−
η
∂
L
∂
w
i
−
η
λ
sign
(
w
i
)
\begin{array}{c} L=L+\lambda \sum_{i=1}^{n}\left|w_{i}\right| \\ \frac{\partial L}{\partial w_{i}}=\frac{\partial L}{\partial w_{i}}+\lambda \operatorname{sign}\left(w_{i}\right) \\ w_{i}=w_{i}-\eta \frac{\partial L}{\partial w_{i}}-\eta \lambda \operatorname{sign}\left(w_{i}\right) \end{array}
L=L+λ∑i=1n∣wi∣∂wi∂L=∂wi∂L+λsign(wi)wi=wi−η∂wi∂L−ηλsign(wi)
主要看参数更新过程,提供多种写法找个容易看的
w
i
j
(
r
)
←
w
i
j
(
r
)
−
η
λ
s
g
n
(
w
i
j
(
r
)
)
−
η
∂
L
∂
w
i
j
(
r
)
\begin{aligned} w_{ij}^{(r)} \leftarrow w_{ij}^{(r)} - {\eta \lambda}\; sgn(w_{ij}^{(r)}) - {\eta}\; \frac{\partial \mathcal L}{\partial w_{ij}^{(r)}} \end{aligned}
wij(r)←wij(r)−ηλsgn(wij(r))−η∂wij(r)∂L
或者
w
i
=
w
i
−
η
∂
L
∂
w
i
−
η
λ
sign
(
w
i
)
w_{i}=w_{i}-\eta \frac{\partial L}{\partial w_{i}}-\eta \lambda \operatorname{sign}\left(w_{i}\right)
wi=wi−η∂wi∂L−ηλsign(wi)
或者
w
new
=
w
−
η
∂
L
1
∂
w
=
w
−
η
⋅
[
2
x
(
w
x
+
b
−
y
)
+
λ
d
∣
w
∣
d
w
]
=
{
w
−
η
⋅
[
2
x
(
w
x
+
b
−
y
)
+
λ
]
w
>
0
w
−
η
⋅
[
2
x
(
w
x
+
b
−
y
)
−
λ
]
w
<
0
\begin{aligned} w_{\text {new }} &=w-\eta \frac{\partial L_{1}}{\partial w} \\ &=w-\eta \cdot\left[2 x(w x+b-y)+\lambda \frac{d|w|}{d w}\right] \\ &=\left\{\begin{array}{l} w-\eta \cdot[2 x(w x+b-y)+\lambda] \quad w>0 \\ w-\eta \cdot[2 x(w x+b-y)-\lambda] & w<0 \end{array}\right. \end{aligned}
wnew =w−η∂w∂L1=w−η⋅[2x(wx+b−y)+λdwd∣w∣]={w−η⋅[2x(wx+b−y)+λ]w>0w−η⋅[2x(wx+b−y)−λ]w<0
或者
w
i
→
w
i
′
=
def
w
i
−
η
∂
L
∂
w
i
−
η
γ
ℓ
1
n
sgn
(
w
i
)
.
w_i\to w'_i \overset{\text{def}}{=} w_i - \eta\frac{\partial L}{\partial w_i} - \eta\frac{\gamma\ell_1}{n}\text{sgn}(w_i).
wi→wi′=defwi−η∂wi∂L−ηnγℓ1sgn(wi).
因为 η γ ℓ 1 n > 0 \eta\frac{\gamma\ell_1}{n} > 0 ηnγℓ1>0, η γ ℓ 1 n sgn ( w i ) \eta\frac{\gamma\ell_1}{n}\text{sgn}(w_i) ηnγℓ1sgn(wi)让 w i → 0 w_i \to 0 wi→0,等于0了就实现了参数稀疏。