近端梯度法是一种求解不可微凸函数最优化问题的经典方法。其核心思想在于将不可微凸函数的最优化问题转换为易求解的proximal映射函数,从而实现近似求解。
一、proximal映射
proximal映射是近端梯度法的核心方法。假设约束函数 f ( x ) f(\boldsymbol x) f(x)的定义域为 U \boldsymbol U U,定义自变量 x \boldsymbol x x的proximal映射为: p r o x f ( x ) = arg min u ∈ U ( f ( u ) + 1 2 ∣ ∣ u − x ∣ ∣ 2 2 ) , ∀ x ∈ U prox_f(\boldsymbol x)=\arg\min_{\boldsymbol {u\in U}}(f(\boldsymbol u)+\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2), \quad \forall \boldsymbol x\in \boldsymbol U proxf(x)=argu∈Umin(f(u)+21∣∣u−x∣∣22),∀x∈U通俗的讲,变量 x \boldsymbol x x的proximal映射即为同定义域内,满足与之“欧式距离平方+约束函数 f f f值”最小的变量值 u \boldsymbol u u。
下面给出一些约束函数 f ( x ) f(\boldsymbol x) f(x)下的proximal映射案例:
1)常数函数
令
f
(
x
)
=
c
f(\boldsymbol x)=c
f(x)=c,则有:
p
r
o
x
f
(
x
)
=
arg
min
u
∈
U
(
c
+
1
2
∣
∣
u
−
x
∣
∣
2
2
)
prox_f(\boldsymbol x)=\arg\min_{\boldsymbol {u\in U}}(c+\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2)
proxf(x)=argu∈Umin(c+21∣∣u−x∣∣22)显然,此时
p
r
o
x
f
(
x
)
=
x
prox_f(\boldsymbol x)=\boldsymbol x
proxf(x)=x
2)仿射函数
令 f ( x ) = a T x + b f(\boldsymbol x)=\boldsymbol a^T\boldsymbol x+\boldsymbol b f(x)=aTx+b,则有: p r o x f ( x ) = arg min u ∈ U ( a T u + b + 1 2 ∣ ∣ u − x ∣ ∣ 2 2 ) prox_f(\boldsymbol x)=\arg\min_{\boldsymbol {u\in U}}(\boldsymbol a^T\boldsymbol u+\boldsymbol b +\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2) proxf(x)=argu∈Umin(aTu+b+21∣∣u−x∣∣22)对 u \boldsymbol {u} u求导,并另其为0,可得: p r o x f ( x ) = x − a prox_f(\boldsymbol x)=\boldsymbol x-\boldsymbol a proxf(x)=x−a
3)二次函数
令 f ( x ) = 1 2 x T A x + b T x + c f(\boldsymbol x)=\frac{1}{2}\boldsymbol {x^TAx}+\boldsymbol b^T\boldsymbol x+ \boldsymbol c f(x)=21xTAx+bTx+c,则有: p r o x f ( x ) = arg min u ∈ U ( 1 2 u T A u + b T u + c + 1 2 ∣ ∣ u − x ∣ ∣ 2 2 ) prox_f(\boldsymbol x)=\arg\min_{\boldsymbol {u\in U}}(\frac{1}{2}\boldsymbol {u^TAu}+\boldsymbol b^T\boldsymbol u+ \boldsymbol c \boldsymbol +\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2) proxf(x)=argu∈Umin(21uTAu+bTu+c+21∣∣u−x∣∣22)对 u \boldsymbol {u} u求导,并另其为0,可得: p r o x f ( x ) = ( A + I ) − 1 ( x − b ) prox_f(\boldsymbol x)=(\boldsymbol A+\boldsymbol I)^{-1}(\boldsymbol x-\boldsymbol b) proxf(x)=(A+I)−1(x−b)
4)L1范数
令 f ( x ) = t ∣ ∣ x ∣ ∣ 1 f(\boldsymbol x)=t||\boldsymbol x||_1 f(x)=t∣∣x∣∣1,则有: p r o x f ( x ) = arg min u ∈ U ( t ∣ ∣ u ∣ ∣ 1 + 1 2 ∣ ∣ u − x ∣ ∣ 2 2 ) prox_f(\boldsymbol x)=\arg\min_{\boldsymbol {u\in U}}(t ||\boldsymbol u||_1 +\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2) proxf(x)=argu∈Umin(t∣∣u∣∣1+21∣∣u−x∣∣22)注意到L1范数并非连续可导,这里需要用到次梯度的性质,将上述问题转化为: 0 ∈ ∂ ( t ∣ ∣ u ∣ ∣ 1 + 1 2 ∣ ∣ u − x ∣ ∣ 2 2 ) ⇒ x − u ∈ t ∂ ∣ ∣ u ∣ ∣ 1 \begin{aligned}&0\in \partial(t ||\boldsymbol u||_1 +\frac{1}{2}||\boldsymbol {u}-\boldsymbol x||_2^2)\\ \Rightarrow &\boldsymbol {x}-\boldsymbol u\in t\partial ||\boldsymbol u||_1\end{aligned} ⇒0∈∂(t∣∣u∣∣1+21∣∣u−x∣∣22)x−u∈t∂∣∣u∣∣1对于上式而言, u \boldsymbol u u的各成分 u i u_i ui并无耦合关系,所以可单独考虑每个 u i u_i ui。根据 u i u_i ui的取值符号进行分段考虑:
(a)若
u
i
>
0
u_i>0
ui>0
此时
∂
∣
∣
u
i
∣
∣
1
=
1
\partial ||u_i||_1=1
∂∣∣ui∣∣1=1,因此:
p
r
o
x
f
(
u
i
)
=
x
i
−
t
prox_f(u_i)=x_i-t
proxf(ui)=xi−t同时注意到
x
i
−
t
=
u
i
>
0
x_i-t=u_i>0
xi−t=ui>0
(b)若
u
i
<
0
u_i<0
ui<0
此时
∂
∣
∣
u
i
∣
∣
1
=
−
1
\partial ||u_i||_1=-1
∂∣∣ui∣∣1=−1,因此:
p
r
o
x
f
(
u
i
)
=
x
i
+
t
prox_f(u_i)=x_i+t
proxf(ui)=xi+t同时注意到
x
i
+
t
=
u
i
<
0
x_i+t=u_i<0
xi+t=ui<0
(c)若
u
i
=
0
u_i=0
ui=0
此时
∂
∣
u
i
∣
=
[
−
1
,
1
]
\partial |u_i|=[-1,1]
∂∣ui∣=[−1,1],因此
x
i
−
u
i
∈
[
−
t
,
t
]
x_i-u_i\in[-t,t]
xi−ui∈[−t,t],即:
∣
x
i
−
u
i
∣
≤
t
|x_i-u_i|\le t
∣xi−ui∣≤t综上,上述由proximal映射定义的优化问题的解为:
u
i
=
S
t
(
x
i
)
u_i=S_t(x_i)
ui=St(xi),其中
S
t
(
⋅
)
S_t (\cdot)
St(⋅)被称为软阈值函数:
S
t
(
x
i
)
=
{
x
i
−
t
,
i
f
x
i
>
t
0
,
i
f
∣
x
i
−
u
i
∣
≤
t
x
i
+
t
,
,
i
f
x
i
<
−
t
S_t(x_i)=\begin{cases}x_i-t\quad,if \quad x_i>t\\0\quad,if \quad |x_i-u_i|\le t\\x_i+t,\quad,if \quad x_i<-t\end{cases}
St(xi)=⎩⎪⎨⎪⎧xi−t,ifxi>t0,if∣xi−ui∣≤txi+t,,ifxi<−t
二、近端梯度法
定义如下的无约束问题: min f ( x ) = g ( x ) + h ( x ) \min f(\boldsymbol x)=g(\boldsymbol x)+h(\boldsymbol x) minf(x)=g(x)+h(x)其中 g ( x ) g(x) g(x)为凸函数,且可微分; h ( x ) h(x) h(x)为凸函数(分解时应尽量简单),但不可微。
对于 g ( x ) g(\boldsymbol x) g(x),在 x k \boldsymbol x_k xk处进行二阶泰勒展开的近似,可知存在值 t t t,使得: g ( x ) = g ( x k ) + ∇ g T ( x k ) ( x − x k ) + 1 2 t ( x − x k ) T ( x − x k ) g(\boldsymbol x)=g(\boldsymbol x_{k})+\nabla g^T(\boldsymbol x_{k})(\boldsymbol x-\boldsymbol x_{k})+\frac{1}{2t}(\boldsymbol x-\boldsymbol x_{k})^T(\boldsymbol x-\boldsymbol x_{k}) g(x)=g(xk)+∇gT(xk)(x−xk)+2t1(x−xk)T(x−xk)因此,最优化问题可写为 arg min x g ( x k ) + ∇ g T ( x k ) ( x − x k ) + 1 2 t ( x − x k ) T ( x − x k ) + h ( x ) = arg min x 1 2 t ∣ ∣ x − ( x k − t ∇ g ( x k ) ) ∣ ∣ 2 2 + h ( x ) \begin{aligned}&\arg\min_x g(\boldsymbol x_{k})+\nabla g^T(\boldsymbol x_{k})(\boldsymbol x-\boldsymbol x_{k})+\frac{1}{2t}(\boldsymbol x-\boldsymbol x_{k})^T(\boldsymbol x-\boldsymbol x_{k})+h(\boldsymbol x)\\=&\arg\min_x\frac{1}{2t}||\boldsymbol x-(\boldsymbol x_k-t\nabla g(\boldsymbol x_k))||_2^2+h(\boldsymbol x)\end{aligned} =argxming(xk)+∇gT(xk)(x−xk)+2t1(x−xk)T(x−xk)+h(x)argxmin2t1∣∣x−(xk−t∇g(xk))∣∣22+h(x)仔细观察上式,如果将 x k − t ∇ g ( x k ) \boldsymbol x_k-t\nabla g(\boldsymbol x_k) xk−t∇g(xk)视为一个整体 z \boldsymbol z z,同时忽略常系数,其本质上就是求其的proximal映射: p r o x h ( z ) = 1 2 t ∣ ∣ x − z ∣ ∣ 2 2 + h ( x ) prox_h(\boldsymbol z)=\frac{1}{2t}||\boldsymbol x-\boldsymbol z||_2^2+h(\boldsymbol x) proxh(z)=2t1∣∣x−z∣∣22+h(x)由此,便得到了近段梯度法的迭代公式: x k + 1 = p r o x h , t ( x k − t ∇ g ( x k ) ) \boldsymbol x_{k+1}=\mathop{prox}\limits_{h,t}(\boldsymbol x_k-t\nabla g(\boldsymbol x_k)) xk+1=h,tprox(xk−t∇g(xk))
三、近端梯度法在Lasso回归的应用
近段梯度法常被用于解决Lasso回归的优化问题: arg min W 1 2 ∣ ∣ ( X W − y ) ∣ ∣ 2 2 + λ ∣ ∣ W ∣ ∣ 1 \arg\min_{\boldsymbol W}\frac{1}{2}||(\boldsymbol{XW}-{\boldsymbol y})||_2^2+\lambda||\boldsymbol W||_1 argWmin21∣∣(XW−y)∣∣22+λ∣∣W∣∣1其中损失函数第一项为可微凸函数,第二项为L1正则项,直接套用上文近段梯度法的迭代公式以及L1范数下的近段梯度,可得: w i k + 1 = p r o x h , w i ( w i k − X ⋅ , i T ( X W k − y ) ) , h ( w ) = ∣ w ∣ w_i^{k+1}=\mathop{prox}\limits_{h,w_i}(w_i^k-\boldsymbol X^T_{\cdot,i}(\boldsymbol {XW^k}-\boldsymbol y)),h(w)=|w| wik+1=h,wiprox(wik−X⋅,iT(XWk−y)),h(w)=∣w∣且若 ∣ X ⋅ , i T ( X W k − y ) ∣ ≤ λ |\boldsymbol X^T_{\cdot,i}(\boldsymbol {XW^k}-\boldsymbol y)|\le\lambda ∣X⋅,iT(XWk−y)∣≤λ此时有 w i = 0 w_i=0 wi=0,这表明了L1正则化会导致系数的稀疏化。