基本形式
模型参数
w
w
w的
L
1
L^{1}
L1正则化的一般形式:
Ω
(
θ
)
=
∣
∣
w
∣
∣
1
=
∑
i
∣
w
i
∣
\varOmega(\theta)=||w||_1=\sum_i|w_i|
Ω(θ)=∣∣w∣∣1=i∑∣wi∣ 即各个参数的绝对值之和,在这里
θ
\theta
θ也就是
w
w
w。如果将参数正则化到其他非零值
w
(
o
)
w^{(o)}
w(o)。在这种情况下,
L
1
L^1
L1正则化将会引入不同的项
Ω
(
θ
)
=
∣
∣
w
−
w
(
o
)
∣
∣
1
=
∑
i
∣
w
i
−
w
i
(
o
)
∣
\varOmega(\theta)=||w-w^{(o)}||_1=\sum{_i}|w_i-w_i^{(o)}|
Ω(θ)=∣∣w−w(o)∣∣1=∑i∣wi−wi(o)∣。
正则化目标函数
具体形似如下:
J
~
(
w
;
X
,
y
)
=
α
∣
∣
w
∣
∣
1
+
J
(
w
;
X
,
y
)
\tilde J(w;X,y) = \alpha||w||_1+J(w;X,y)
J~(w;X,y)=α∣∣w∣∣1+J(w;X,y)
对应的梯度:
∇
w
J
~
(
w
;
X
,
y
)
=
α
s
i
g
n
(
w
)
+
∇
w
J
(
w
;
X
,
y
)
\nabla_w \tilde J(w;X,y)=\alpha sign(w)\: + \:\nabla_w J(w;X,y)
∇wJ~(w;X,y)=αsign(w)+∇wJ(w;X,y)
其中
s
i
g
n
(
w
)
sign(w)
sign(w)只是简单地取
w
w
w各个元素的正负号。
近似处理
令
w
∗
w^*
w∗为未正则化的目标函数取得最小训练误差时的权重向量,即
w
∗
=
a
r
g
m
i
n
w
J
(
w
)
w^*=arg\:min_w\:J(w)
w∗=argminwJ(w),并在
w
∗
w*
w∗的邻域对目标函数做二次近似。若果目标函数确实是二次的,则该近似是完美的。近似的
J
(
w
)
J(w)
J(w)形式大致如下:
J
(
w
)
≈
J
^
(
w
(
∗
)
)
+
(
w
−
w
(
∗
)
)
T
j
′
(
w
∗
)
+
1
2
(
w
−
w
(
∗
)
)
T
(
w
−
w
(
∗
)
)
j
′
′
(
w
∗
)
J(w)\approx \hat{J}(w^{(*)})+(w-w^{(*)})^{T} j^{'} (w^{*})+ \frac{1}{2}(w-w^{(*)})^{T}(w-w^{(*)})j^{''}(w^{*})
J(w)≈J^(w(∗))+(w−w(∗))Tj′(w∗)+21(w−w(∗))T(w−w(∗))j′′(w∗)
其中
j
(
w
)
′
j(w)^{'}
j(w)′是
w
=
w
∗
w=w^{*}
w=w∗,即最优解的一阶导,
j
(
w
)
′
′
j(w)^{''}
j(w)′′表示最优解的二阶导,因为
w
∗
w^{*}
w∗。
所以化简后
J
(
w
)
=
J
(
w
∗
)
+
1
2
(
w
−
w
∗
)
T
(
w
−
w
∗
)
j
′
′
(
w
∗
)
J(w)=J(w^*)+\frac{1}{2}(w-w^*)^{T}(w-w^*)j^{''}(w^{*})
J(w)=J(w∗)+21(w−w∗)T(w−w∗)j′′(w∗)
在这里我们用Hessian矩阵表示二阶导,表示如下
j
′
′
(
w
∗
)
=
H
=
[
∂
2
f
∂
w
1
2
∂
2
f
∂
w
1
∂
w
2
⋯
∂
2
f
∂
w
1
∂
w
n
∂
2
f
∂
w
2
∂
w
1
∂
2
f
∂
w
2
2
⋯
∂
2
f
∂
w
2
∂
w
n
⋮
⋮
⋱
⋮
∂
2
f
∂
w
n
∂
w
1
∂
2
f
∂
w
n
∂
w
2
⋯
∂
2
f
∂
w
n
2
]
j^{''}\left( w^*\right) =H=\left[ \begin{matrix} \frac{\partial ^2f}{\partial w_{1}^{2}}& \frac{\partial ^2f}{\partial w_1\,\partial w_2}& \cdots& \frac{\partial ^2f}{\partial w_1\,\partial w_n}\\ & & & \\ \frac{\partial ^2f}{\partial w_2\,\partial w_1}& \frac{\partial ^2f}{\partial w_{2}^{2}}& \cdots& \frac{\partial ^2f}{\partial w_2\,\partial w_n}\\ & & & \\ \vdots& \vdots& \ddots& \vdots\\ & & & \\ \frac{\partial ^2f}{\partial w_n\,\partial w_1}& \frac{\partial ^2f}{\partial w_n\,\partial w_2}& \cdots& \frac{\partial ^2f}{\partial w_{n}^{2}}\\ \end{matrix} \right]
j′′(w∗)=H=
∂w12∂2f∂w2∂w1∂2f⋮∂wn∂w1∂2f∂w1∂w2∂2f∂w22∂2f⋮∂wn∂w2∂2f⋯⋯⋱⋯∂w1∂wn∂2f∂w2∂wn∂2f⋮∂wn2∂2f
最终化简后得到:
j
(
w
)
=
j
(
w
∗
)
+
1
2
(
w
−
w
∗
)
T
(
w
−
w
∗
)
H
j(w) = j(w^{*})+\frac{1}{2}(w-w^{*})^{T}(w-w^{*})H
j(w)=j(w∗)+21(w−w∗)T(w−w∗)H
正则化后的目标函数
j ^ ( w ) = j ( w ) + α ∣ ∣ w ∣ ∣ 1 = j ( w ∗ ) + 1 2 ( w − w ∗ ) T ( w − w ∗ ) H + α ∣ ∣ w ∣ ∣ 1 \hat{j}(w) = j(w)+\alpha||w||_1 = j(w^{*})+\frac{1}{2}(w-w^{*})^{T}(w-w^{*})H +\alpha||w||_1 j^(w)=j(w)+α∣∣w∣∣1=j(w∗)+21(w−w∗)T(w−w∗)H+α∣∣w∣∣1
w ∗ w^{*} w∗的分析
对
j
^
(
w
)
\hat{j}(w)
j^(w)求导,并致其为零(这里假设Hessian矩阵是对角矩阵):
∇
w
J
(
w
;
X
,
y
)
=
0
+
2
⋅
1
2
H
(
w
−
w
∗
)
(
w
−
w
∗
)
′
+
α
⋅
s
i
g
n
(
w
)
=
H
(
w
−
w
∗
)
+
α
⋅
s
i
g
n
(
w
)
=
0
\nabla _wJ\left( w;X,y \right) =0+2\cdot \frac{1}{2}H\left( w-w^* \right) \left( w-w^* \right) ^{'}+\alpha \cdot sign\left( w \right) \\ = H\left( w-w^* \right) +\alpha \cdot sign\left( w \right) =0
∇wJ(w;X,y)=0+2⋅21H(w−w∗)(w−w∗)′+α⋅sign(w)=H(w−w∗)+α⋅sign(w)=0
针对每个
i
i
i,则可表示为:
H
i
i
(
w
i
−
w
i
∗
)
+
α
⋅
s
i
g
n
(
w
i
)
=
0
H_{ii}(w_i-w_{i}^{*})+\alpha \cdot sign(w_i)=0
Hii(wi−wi∗)+α⋅sign(wi)=0
考虑
w
i
=
0
w_i=0
wi=0时,则
j
^
(
w
)
=
j
(
w
∗
)
+
1
2
H
(
w
∗
)
2
\hat j(w)=j(w^{*})+\frac{1}{2}H(w^{*})^{2}
j^(w)=j(w∗)+21H(w∗)2,由于
w
∗
w^{*}
w∗为已知量,则
j
(
w
∗
)
+
1
2
H
(
w
∗
)
2
j(w^{*})+\frac{1}{2}H(w^{*})^{2}
j(w∗)+21H(w∗)2就是最小值,这里我们用下图表示:
根据极值点的性质可知,
- 当 w i → − 0 {w_i\to -0} wi→−0,此时 H i i ( w i − w i ∗ ) − s i g n ( w i ) α = − H i i w i ∗ − α ≤ 0 H_{ii}(w_i-w_{i}^{*})-sign(w_i)\alpha=-H_{ii}w_{i}^{*}-\alpha \le 0 Hii(wi−wi∗)−sign(wi)α=−Hiiwi∗−α≤0,则 w i ∗ ≥ − α H i i w_{i}^{*} \ge -\frac{\alpha}{H_{ii}} wi∗≥−Hiiα。
- 当当
w
i
→
+
0
{w_i\to +0}
wi→+0,此时
H
i
i
(
w
i
−
w
i
∗
)
−
s
i
g
n
(
w
i
)
α
=
−
H
i
i
w
i
∗
+
α
≥
0
H_{ii}(w_i-w_{i}^{*})-sign(w_i)\alpha=-H_{ii}w_{i}^{*}+\alpha \ge 0
Hii(wi−wi∗)−sign(wi)α=−Hiiwi∗+α≥0,则
w
i
∗
≤
α
H
i
i
w_{i}^{*} \le \frac{\alpha}{H_{ii}}
wi∗≤Hiiα。
综上,当 w = 0 w=0 w=0时, − α H i i ≤ w i ∗ ≤ α H i i -\frac{\alpha}{H_{ii}} \le w_{i}^{*} \le \frac{\alpha}{H_{ii}} −Hiiα≤wi∗≤Hiiα。
考虑 w > 0 w>0 w>0时,则 w i = w i ∗ − α H i i w_{i} = w_{i}^{*}-\frac{\alpha}{H_{ii}} wi=wi∗−Hiiα,即 w i ∗ = w i + α H i i > α H i i w_{i}^{*}=w_{i} + \frac{\alpha}{H_{ii}} > \frac{\alpha}{H_{ii}} wi∗=wi+Hiiα>Hiiα。所以当 w i ∗ > α H i i w_{i}^{*}>\frac{\alpha}{H_{ii}} wi∗>Hiiα时, w i = w i ∗ − s i g n ( w i ) α H i i = s i g n ( w ∗ ) ( ∣ w ∗ ∣ − α H i i ) w_{i}=w_{i}^{*}-sign(w_{i})\frac{\alpha}{H_{ii}}=sign(w^{*})(|w^{*}|-\frac{\alpha}{H_{ii}}) wi=wi∗−sign(wi)Hiiα=sign(w∗)(∣w∗∣−Hiiα)。
考虑 w < 0 w<0 w<0时,则 w ∗ < − α H i i w^{*}<- \frac{\alpha}{H_{ii}} w∗<−Hiiα;所以当 w i ∗ < − α H i i w_{i}^{*}<- \frac{\alpha}{H_{ii}} wi∗<−Hiiα时, w i = w i ∗ − s i g n ( w i ) α H i i = s i g n ( w ∗ ) ( ∣ w ∗ ∣ − α H i i ) w_{i}=w_{i}^{*}-sign(w_{i})\frac{\alpha}{H_{ii}}=sign(w^{*})(|w^{*}|-\frac{\alpha}{H_{ii}}) wi=wi∗−sign(wi)Hiiα=sign(w∗)(∣w∗∣−Hiiα)。
综上
a. 当 ∣ w ∗ ∣ ≤ α H i i |w^{*}| \le \frac{\alpha}{H_{ii}} ∣w∗∣≤Hiiα, w i = 0 w_{i}=0 wi=0。
b. 当 w ∗ > α H i i w^{*}>\frac{\alpha}{H_{ii}} w∗>Hiiα时; w m i n = s i g n ( w ∗ ) ( w ∗ − α H ) \underset{min}{w}=sign(w^{*})(w^{*}-\frac{\alpha}{H}) minw=sign(w∗)(w∗−Hα)。
c. 当 w ∗ < α H i i w^{*}<\frac{\alpha}{H_{ii}} w∗<Hiiα时, w m i n = s i g n ( w ∗ ) ( w ∗ − α H ) \underset{min}{w}=sign(w^{*})(w^{*}-\frac{\alpha}{H}) minw=sign(w∗)(w∗−Hα)。
故 w = s i g n ( w ∗ ) m a x ( ∣ w ∗ ∣ − α H , 0 ) w=sign(w^{*})max(|w^{*}|-\frac{\alpha}{H},0) w=sign(w∗)max(∣w∗∣−Hα,0)